What gives ATI's graphics architecture its advantages?

[edit] wait, in your comparison are you referring to Fermi or a theoretical 320 scalar ALU Cypress @ 850Mhz?
Yeah, I'm referring to the latter. I think ATI gets something like 70% packing efficiency in games, and lets say 25% of the total render time is ALU-limited, so overall that means ATI gets a 62% boost by choosing a 5x design over a 1x design.
The 1.5-2x advantage almost never materializes however and is based on workload
See above. I wasn't comparing to NVidia, but talking about the decision to go VLIW. ATI knows that it's not efficient in terms of unit utilization, but it is smart in terms of overall chip utilization (i.e. perf/$).

The real difference is, NVidia is betting on workloads (tessellation, double precision, CPU-like problems, ECC, etc) that haven't materialized yet.
This is actually completely opposite of the ATI/NVidia of old. R100 did dependent texturing while NV10/15 didn't, R200 did PS1.4 because it was the path to PS2.0 whereas NVidia's choice of PS1.0 limited itself to fixed math before texturing (hence the mess of tex instructions), and NV18 was really lacking features. NV30 survived because games didn't use DX9 enough yet, and G7x was a barebones PS3.0 implementation while ATI designed R5xx's shader core for fast dynamic branching which never showed up in games at the time.

NVidia changed its tune with G80, and ATI stopped making design decisions on far-fetched workload projections after R600.

I wouldn't actually say Fermi's architecture is bad, it's just from a cost-benefit analysis summing over all current workloads, it looks non-cost effective. NVidia may be ok with that, because they might have their sights on trying to alter the market into one where their card becomes cost effective.
Yup, I agree. I think NVidia is doing everything you could expect them to do in attempting this alteration, but whether it works is yet to be seen. The danger is that once the market has been created, ATI swoops in and attacks it with its next generation by making a few tweaks to its architecture. You need some pretty contrived examples for GT200b not to get crushed by Cypress in GPGPU workloads, and if you scaled Cypress to 55nm it would be roughly the same size.

I have to say that, despite enormous boosts in theoretical flops and pixel-power in recent years, games don't look all that more impressive to me, we seem to be getting diminishing returns, Game budgets are skyrocketing (tens of millions), FLOPS have gone through the roof, but nothing has really blown me away in the last few years.
Yeah, but despite this, the market for the high end GPU hasn't died. Maybe it will soon, just like the lack of benefit from high end CPUs has killed ASPs in recent years when consumers started realizing how little CPU perf. matters, or maybe someone will make a killer game engine that needs those FLOPs. Needless to say, I hope it's the latter...
 
It has 4-times more SPs than e.g. G80, but only 2,5-times more performance. That's why I think they aren't bottlenecking the GPU.
Haven't seen comparison to G80, but apart that it's not quite 4 times more SPs (due to disabled cluster), compared to the highest clocked G92 it's only slightly less than 3 times the peak ALU rate. And that's not factoring in that interpolation now is done in main alus (most likely, at least not sfus any longer) and there aren't more sfus in total than G200 has so in theory there could also be a bottleneck in special functions (though my theory is that interpolation is now done in main alus to free the sfus up for the special functions since there are fewer of them). Though G92 wasn't terribly limited neither apparently by its ALUs if you compared the results to G94...
 
Btw. what is the primary bottleneck of GF100 performance?
I don't think anything is bottlenecking it. Aside from geometry, it's just a design with rather low peak throughput in all areas given its size and power consumption.

It doesn't have an advantage over Cypress in texturing, ALUs, or ROPs. It has an advantage in bandwidth, but even that came in lower than expected given the 384-bit bus.
 
Blend rate is fine in theory not so much in practice according to the hardware.fr numbers (outclassed by Cypress pretty much).
Some of those numbers are definitely wrong.
http://www.hardware.fr/articles/787-6/dossier-nvidia-geforce-gtx-480-470.html

27.2Gpix/s for "32-bit 4xINT8" with blending would need 218 GB/s, which the 5870 doesn't have (probably a typo, looking at the 5850 numbers). The most important fillrates - blended 4xINT8, blended FP16, and Z-only - are faster on the GTX480. The other scenarios don't really matter in games because they don't have trivial pixel shaders, and rarely will the shading engines pump out pixels faster than a few GPix/s.
 
It does have more overhead. But it makes life a lot easier.

For instance, suppose you have a function call in your kernel. Handling a function call per scalar is a hell of a lot easier than figuring out how to emit a function call across a 4-wide vector. One let's you use the same C code that you had before (function operating on scalar), the other requires that you either lose performance or regenerate the function call.
One can take the easy solution: Simply don't vectorize and let the compiler figure out how the ILP can be used to fill the VLIW slots.
Remember, one needs only about 50% utilization rate to arrive at the same shader performance as the competition. I think this can be achieved on average workloads without special optimizations. The vectorization often reduces the gap only to the theoretical peak performance. Vectorized code is routinely (a lot) faster on ATI than on nvidia, one really needs to hit some weak spot to reverse that. Furthermore, embarassingly parallel code is often fairly easy to vectorize. And when I think about it, what are all the SSEx extensions good for? And has it helped much, that the AMD K8 CPUs had the same throughput with scalar DP SSE2 as with vector instructions?
 
Last edited by a moderator:
Some of those numbers are definitely wrong.
http://www.hardware.fr/articles/787-6/dossier-nvidia-geforce-gtx-480-470.html

27.2Gpix/s for "32-bit 4xINT8" with blending would need 218 GB/s, which the 5870 doesn't have (probably a typo, looking at the 5850 numbers).
Yes I guess you're right. That is only one of the numbers, I can't see any obvious mistake with the others.
The most important fillrates - blended 4xINT8, blended FP16, and Z-only - are faster on the GTX480.
For the former 2 the rops don't matter one bit though as they are just measuring memory bandwidth. For all those where the rops actually matter (except z and 1x fp32) it is slower.
The other scenarios don't really matter in games because they don't have trivial pixel shaders, and rarely will the shading engines pump out pixels faster than a few GPix/s.
Maybe, but the fact remains that those 48 rops are slower than AMDs 32 (except z fill where they have twice the rate per rop anyway). I'm not sure that the others don't really matter (not less than those 3 you mentioned at least).

edit: fp32 is actually interesting with GF100. Evergreen can do half-speed fp32, no matter if one or 4 channels (bandwidth permitting, of course). However, GF100 can do full-speed fp32 single channel, but only quarter speed (it seems) 4-channel fp32. Well without blending at least... With blending looks like quarter-speed from Evergreen to me, and again some number-of-channels dependent factor (which I can't figure quite out, but it seems to be a bit less than half for one channel) from GF100. The GF100 way of doing things seems to make sense (because you don't have the required bandwidth for multi-channel fp32 anyway), except the rate is so low that even for the really bandwidth demanding 4 fp32 blend they don't get anywhere close to the bandwidth limit.
 
Last edited by a moderator:
Yup, I agree. I think NVidia is doing everything you could expect them to do in attempting this alteration, but whether it works is yet to be seen. The danger is that once the market has been created, ATI swoops in and attacks it with its next generation by making a few tweaks to its architecture. You need some pretty contrived examples for GT200b not to get crushed by Cypress in GPGPU workloads, and if you scaled Cypress to 55nm it would be roughly the same size.

Well yeah, they must know that if they succeed in creating new markets then ATi can join the dance later on with competitive hardware. But that's not the whole strategy. Being a first-mover has advantages in a market where intangibles like reputation and support are important. And it goes without saying that software is still king.
 
Yes I guess you're right. That is only one of the numbers, I can't see any obvious mistake with the others.
True, but that is the one that makes Fermi look 'outclassed', IMO. The FP10 format is rarely used today, but it probably will be used more in the future so it's relevent.

For the former 2 the rops don't matter one bit though as they are just measuring memory bandwidth.
Exactly. ATI has two ROP quads per memory channel because one would be insufficient to saturate BW, but due to BW limitations the advantage is pretty superficial in other cases.

And yeah, the FP32 results are interesting. Not sure why Fermi was designed to be so slow with that.

Well yeah, they must know that if they succeed in creating new markets then ATi can join the dance later on with competitive hardware. But that's not the whole strategy. Being a first-mover has advantages in a market where intangibles like reputation and support are important. And it goes without saying that software is still king.
Of course, but if we move to open standards like DirectCompute and OpenCL, the impact of being the first mover won't matter too much. If some of the features of Fermi start showing advantages in software, ATI will match them. If there application has a big market, the dev will do a little tweaking on ATI hardware.

Untrue. See Folding@Home, where GT200b stomps all over anything ATi has, including Hemlock. Math rate isn't the only factor in solving any problem, GPGPU or not.
I knew someone was going to bring this up.

That's because the workloads and algorithms aren't the same. I don't even think they fold the same types of proteins. I'm pretty sure that they don't just share the code with ATI/NVidia and let them write a client.

Did NV need to "completely re-write" F@H to get such good performance? Whatever they did, they did it right, and willingly.
Both got a GPU2 client written for them once. ATI got it written during the R600 era, which is architecturally very different from Cypress, and NVidia got it written during the G80 era, which is very similar to GT200. There were some tweaks to support new hardware since then, but that's it.
 
The flops calculation already accounts for clock speeds.

edit: not awake yet, damn need more coffee.

What I'm referring to is the overall performance that is percieved. Guess that wasn't exactly the question. :)
 
Last edited by a moderator:

Heh, indeed. Well from those numbers the GTX480 with almost exactly 4 times the peak alu throughput (not factoring in SFUs) manages to be 2-3 times faster than the 8800GTX. Too bad some of the newer titles (which are more likely to have more complicated shaders) run into memory limits... Though indeed from that point of view it doesn't look like a GTX480 is really suffering from a lack of alus...
 
G7x was a barebones PS3.0 implementation while ATI designed R5xx's shader core for fast dynamic branching which never showed up in games at the time.
Well NV40 was a hella lot more interesting than R420. ;) Too bad that R400 didn't work out. I think G70 was originally NV48, before they decided it would be their "new" product line and so gave it a fresh number.

ATI did have a thing about adding nifty forward-looking features with R100 and R200, but their performance wasn't very competitive and image quality was fugly at times. They couldn't pull together all of the odds and end into a really great product. Even R300, clearly a superb piece of engineering, was held back by terrible drivers for awhile. Ti 4600 actually held its own in OpenGL. The company sure has changed since then though, entirely in good ways too.
 
Last edited by a moderator:
It does have more overhead. But it makes life a lot easier.

For instance, suppose you have a function call in your kernel. Handling a function call per scalar is a hell of a lot easier than figuring out how to emit a function call across a 4-wide vector. One let's you use the same C code that you had before (function operating on scalar), the other requires that you either lose performance or regenerate the function call.
The VLIW cross channel dependencies don't really matter to you if you are writing scalar code in C though ... only the compiler has to worry about them. The branch granularity gets worse of course, but if you just want to write scalar code that's what you have to live with.
 
The VLIW cross channel dependencies don't really matter to you if you are writing scalar code in C though ... only the compiler has to worry about them. The branch granularity gets worse of course, but if you just want to write scalar code that's what you have to live with.

Unfortunately, compilers are not perfect at all when it comes to vectorization. For graphics, it can be pretty easy to vectorize, but that's a function of the workload. Not all HPC is trivial to vectorize, and a lot is downright difficult or impossible.

Getting back to OP's question - this is one of the areas where NV designed their architecture to be more robust across a wider range of workloads. That cost them die area, power, design effort, etc. that isn't a huge benefit for graphics, but is very helpful elsewhere.
 
Vectorization for SIMD is much harder than for VLIW though.
 
Last edited by a moderator:
Well NV40 was a hella lot more interesting than R420. ;) Too bad that R400 didn't work out. I think G70 was originally NV48, before they decided it would be their "new" product line and so gave it a fresh number.

ATI did have a thing about adding nifty forward-looking features with R100 and R200, but their performance wasn't very competitive and image quality was fugly at times. They couldn't pull together all of the odds and end into a really great product.
True, but NV40's implementations of PS3.0 and VS3.0 were almost useless due to bad branching and bad vertex texturing. IMO the biggest blunder ATI made with R420 was omitting FP16 blending. All the HDR effects in that era could be done with that coupled with PS2.0. However, I have to give props to NVidia for NV43, as that was a blazing fast midrange card.

R100 and R200 had problems, especially with drivers, but they were solid engineering and technology efforts. All the fancy water effects we saw with DX8 were actually things that would look better with EMBM, so to me GF1/GF2 were a big step down in image quality if devs treated the hardware equally (and don't forget the S3TC bug on GF1/2). When you look at the HiZ, Z compression, 16-bit vs. 32-bit rendering speed, even the vertex and pixel shaders (yes, they were there, but slightly short of DX8), R100 was far closer to GF3/4 in architecture than GF1/2. I even liked R100's ugly AF, as it still had gobs more detail than trilinear filtering and was a godsend in racing games. R200's PS1.4 was capable of parallax mapping years before it became popular, which is sort of tragic, IMO. You could actually decide what math you do before a texture lookup rather than choosing a predefined calculation in PS1.1-1.3.

Anyway, I still think my analogy of NVidia and ATI switching philosophies is fairly apt.
 
Vectorization for SIMD is much harder than for FLIW though.

Not if your SIMD is a complete SIMD with gather and scatter though (see LRBni for an example). CUDA does automatic vectorization for their SIMD because of this (it looks like scalar but the underlying architecture is actually a SIMD).
 
Not if your SIMD is a complete SIMD with gather and scatter though (see LRBni for an example). CUDA does automatic vectorization for their SIMD because of this (it looks like scalar but the underlying architecture is actually a SIMD).

I don't know if the architecture is SIMD.

The microarchitecture is definitely SIMD.

There's a really big difference.

David
 
Getting back to OP's question - this is one of the areas where NV designed their architecture to be more robust across a wider range of workloads. That cost them die area, power, design effort, etc. that isn't a huge benefit for graphics, but is very helpful elsewhere.
I've argued that ATI can do this with very minor modifications*, so I don't buy this. Besides, Cypress can already do some dependent operations between the channels (e.g. a*b*c takes 1 cycle, 2 slots instead of 2 cycles, 1 slot), so NVidia's advantage is almost gone. In fact, a scalar sequence of dependent adds or muls will be much faster on Cypress than the similarly sized GF104, despite using only two of its four simple ALUs.

The main costs that NVidia is dealing with are, as I mentioned earlier, a more complex scheduler that isn't clause based, smaller wavefront size, a less restrictive register file, and the cache.

*Just change the simple ALUs to 64x1 instead of 16x4 and operate on one channel at a time. Instruction group issue rate, branch rate, register access, etc. can stay the same. The main modification will be the need to round robin 8 wavefronts every 32 clocks instead of 2 every 8.
 
Back
Top