Yeah, I'm referring to the latter. I think ATI gets something like 70% packing efficiency in games, and lets say 25% of the total render time is ALU-limited, so overall that means ATI gets a 62% boost by choosing a 5x design over a 1x design.[edit] wait, in your comparison are you referring to Fermi or a theoretical 320 scalar ALU Cypress @ 850Mhz?
See above. I wasn't comparing to NVidia, but talking about the decision to go VLIW. ATI knows that it's not efficient in terms of unit utilization, but it is smart in terms of overall chip utilization (i.e. perf/$).The 1.5-2x advantage almost never materializes however and is based on workload
This is actually completely opposite of the ATI/NVidia of old. R100 did dependent texturing while NV10/15 didn't, R200 did PS1.4 because it was the path to PS2.0 whereas NVidia's choice of PS1.0 limited itself to fixed math before texturing (hence the mess of tex instructions), and NV18 was really lacking features. NV30 survived because games didn't use DX9 enough yet, and G7x was a barebones PS3.0 implementation while ATI designed R5xx's shader core for fast dynamic branching which never showed up in games at the time.The real difference is, NVidia is betting on workloads (tessellation, double precision, CPU-like problems, ECC, etc) that haven't materialized yet.
Yup, I agree. I think NVidia is doing everything you could expect them to do in attempting this alteration, but whether it works is yet to be seen. The danger is that once the market has been created, ATI swoops in and attacks it with its next generation by making a few tweaks to its architecture. You need some pretty contrived examples for GT200b not to get crushed by Cypress in GPGPU workloads, and if you scaled Cypress to 55nm it would be roughly the same size.I wouldn't actually say Fermi's architecture is bad, it's just from a cost-benefit analysis summing over all current workloads, it looks non-cost effective. NVidia may be ok with that, because they might have their sights on trying to alter the market into one where their card becomes cost effective.
Yeah, but despite this, the market for the high end GPU hasn't died. Maybe it will soon, just like the lack of benefit from high end CPUs has killed ASPs in recent years when consumers started realizing how little CPU perf. matters, or maybe someone will make a killer game engine that needs those FLOPs. Needless to say, I hope it's the latter...I have to say that, despite enormous boosts in theoretical flops and pixel-power in recent years, games don't look all that more impressive to me, we seem to be getting diminishing returns, Game budgets are skyrocketing (tens of millions), FLOPS have gone through the roof, but nothing has really blown me away in the last few years.
Haven't seen comparison to G80, but apart that it's not quite 4 times more SPs (due to disabled cluster), compared to the highest clocked G92 it's only slightly less than 3 times the peak ALU rate. And that's not factoring in that interpolation now is done in main alus (most likely, at least not sfus any longer) and there aren't more sfus in total than G200 has so in theory there could also be a bottleneck in special functions (though my theory is that interpolation is now done in main alus to free the sfus up for the special functions since there are fewer of them). Though G92 wasn't terribly limited neither apparently by its ALUs if you compared the results to G94...It has 4-times more SPs than e.g. G80, but only 2,5-times more performance. That's why I think they aren't bottlenecking the GPU.
I don't think anything is bottlenecking it. Aside from geometry, it's just a design with rather low peak throughput in all areas given its size and power consumption.Btw. what is the primary bottleneck of GF100 performance?
Some of those numbers are definitely wrong.Blend rate is fine in theory not so much in practice according to the hardware.fr numbers (outclassed by Cypress pretty much).
One can take the easy solution: Simply don't vectorize and let the compiler figure out how the ILP can be used to fill the VLIW slots.It does have more overhead. But it makes life a lot easier.
For instance, suppose you have a function call in your kernel. Handling a function call per scalar is a hell of a lot easier than figuring out how to emit a function call across a 4-wide vector. One let's you use the same C code that you had before (function operating on scalar), the other requires that you either lose performance or regenerate the function call.
Yes I guess you're right. That is only one of the numbers, I can't see any obvious mistake with the others.Some of those numbers are definitely wrong.
http://www.hardware.fr/articles/787-6/dossier-nvidia-geforce-gtx-480-470.html
27.2Gpix/s for "32-bit 4xINT8" with blending would need 218 GB/s, which the 5870 doesn't have (probably a typo, looking at the 5850 numbers).
For the former 2 the rops don't matter one bit though as they are just measuring memory bandwidth. For all those where the rops actually matter (except z and 1x fp32) it is slower.The most important fillrates - blended 4xINT8, blended FP16, and Z-only - are faster on the GTX480.
Maybe, but the fact remains that those 48 rops are slower than AMDs 32 (except z fill where they have twice the rate per rop anyway). I'm not sure that the others don't really matter (not less than those 3 you mentioned at least).The other scenarios don't really matter in games because they don't have trivial pixel shaders, and rarely will the shading engines pump out pixels faster than a few GPix/s.
Yup, I agree. I think NVidia is doing everything you could expect them to do in attempting this alteration, but whether it works is yet to be seen. The danger is that once the market has been created, ATI swoops in and attacks it with its next generation by making a few tweaks to its architecture. You need some pretty contrived examples for GT200b not to get crushed by Cypress in GPGPU workloads, and if you scaled Cypress to 55nm it would be roughly the same size.
Might wanna start here:Haven't seen comparison to G80,…
True, but that is the one that makes Fermi look 'outclassed', IMO. The FP10 format is rarely used today, but it probably will be used more in the future so it's relevent.Yes I guess you're right. That is only one of the numbers, I can't see any obvious mistake with the others.
Exactly. ATI has two ROP quads per memory channel because one would be insufficient to saturate BW, but due to BW limitations the advantage is pretty superficial in other cases.For the former 2 the rops don't matter one bit though as they are just measuring memory bandwidth.
Of course, but if we move to open standards like DirectCompute and OpenCL, the impact of being the first mover won't matter too much. If some of the features of Fermi start showing advantages in software, ATI will match them. If there application has a big market, the dev will do a little tweaking on ATI hardware.Well yeah, they must know that if they succeed in creating new markets then ATi can join the dance later on with competitive hardware. But that's not the whole strategy. Being a first-mover has advantages in a market where intangibles like reputation and support are important. And it goes without saying that software is still king.
I knew someone was going to bring this up.Untrue. See Folding@Home, where GT200b stomps all over anything ATi has, including Hemlock. Math rate isn't the only factor in solving any problem, GPGPU or not.
Both got a GPU2 client written for them once. ATI got it written during the R600 era, which is architecturally very different from Cypress, and NVidia got it written during the G80 era, which is very similar to GT200. There were some tweaks to support new hardware since then, but that's it.Did NV need to "completely re-write" F@H to get such good performance? Whatever they did, they did it right, and willingly.
The flops calculation already accounts for clock speeds.
Well NV40 was a hella lot more interesting than R420. Too bad that R400 didn't work out. I think G70 was originally NV48, before they decided it would be their "new" product line and so gave it a fresh number.G7x was a barebones PS3.0 implementation while ATI designed R5xx's shader core for fast dynamic branching which never showed up in games at the time.
The VLIW cross channel dependencies don't really matter to you if you are writing scalar code in C though ... only the compiler has to worry about them. The branch granularity gets worse of course, but if you just want to write scalar code that's what you have to live with.It does have more overhead. But it makes life a lot easier.
For instance, suppose you have a function call in your kernel. Handling a function call per scalar is a hell of a lot easier than figuring out how to emit a function call across a 4-wide vector. One let's you use the same C code that you had before (function operating on scalar), the other requires that you either lose performance or regenerate the function call.
The VLIW cross channel dependencies don't really matter to you if you are writing scalar code in C though ... only the compiler has to worry about them. The branch granularity gets worse of course, but if you just want to write scalar code that's what you have to live with.
True, but NV40's implementations of PS3.0 and VS3.0 were almost useless due to bad branching and bad vertex texturing. IMO the biggest blunder ATI made with R420 was omitting FP16 blending. All the HDR effects in that era could be done with that coupled with PS2.0. However, I have to give props to NVidia for NV43, as that was a blazing fast midrange card.Well NV40 was a hella lot more interesting than R420. Too bad that R400 didn't work out. I think G70 was originally NV48, before they decided it would be their "new" product line and so gave it a fresh number.
ATI did have a thing about adding nifty forward-looking features with R100 and R200, but their performance wasn't very competitive and image quality was fugly at times. They couldn't pull together all of the odds and end into a really great product.
Vectorization for SIMD is much harder than for FLIW though.
Not if your SIMD is a complete SIMD with gather and scatter though (see LRBni for an example). CUDA does automatic vectorization for their SIMD because of this (it looks like scalar but the underlying architecture is actually a SIMD).
I've argued that ATI can do this with very minor modifications*, so I don't buy this. Besides, Cypress can already do some dependent operations between the channels (e.g. a*b*c takes 1 cycle, 2 slots instead of 2 cycles, 1 slot), so NVidia's advantage is almost gone. In fact, a scalar sequence of dependent adds or muls will be much faster on Cypress than the similarly sized GF104, despite using only two of its four simple ALUs.Getting back to OP's question - this is one of the areas where NV designed their architecture to be more robust across a wider range of workloads. That cost them die area, power, design effort, etc. that isn't a huge benefit for graphics, but is very helpful elsewhere.