Sir Eric Demers on AMD R600

Status
Not open for further replies.
That's pretty interesting as most reviews put this GPU far behind. But did you test with cat 7.6 maybe? And I heard there are some problems with AF through CCC?
Nope, we didn't as it was not available at launch time. And R600 was significantly fast only in plain vanilla mode - once again. With AF enabled performance dropped as could be expected when putting 16 TMUs against 64.
 
Ok now lets see how this stands up to what Erics ratio's have to say.

Lets say there are games that have 10 to 1 ratio of ALU to TMU needs, for now AMD seems to be content with the same ratio of thier TMU's to ROP's as well, so that means a ratio of 10 to 1 to 1, ALU to TMU to ROP's, if FEAR is already bottlenecked by ROPs and FEAR is a 7 to 1 to 1 ratio, then 10 to 1 to 1 game will be beneficial to the r600 when AA is not activated since AA is done through the shaders. Thats a 30% incraese of ALU usage over TMU and ROP's. The r600 has the extra shader power to accomidate this, but the g80 also has higher utilization of their shader units, as noted by the synthetic shader tests so where does that leave us. Its still kinda a ? since I don't have the tools to check this out. But this all in concideration that fillrates and texture usage stays the same, I don't think future games are going to stay the same in these two areas that AMD wants us to believe, we would not see Oblivion take such a hit or other games take a similiar hit when AF in activated, nor will we see the hits like we see in FEAR. Eric is correct in his assessment that increased work loads well ratios where ALU is increased will benifit the r600 but only if that g80 becomes more bottlenecked in the shader departement then the r600 is bottlenecked with TMU's or ROP's, what is the fillrate and texture ability differences? Well the g80 has 56% increase of filtering ability and 17% more fillrates.

A game like FEAR on the g80, is very well balanced with fillrates, filtering, and bandwidth, this is because when increasing the core clock we get a very nice increase that are similiar. Now we look at shader core increases, we get an increase in over all frame rates but to a lesser degree (the beauty of clock domains, helps us test these things out). So the ALU's aren't as bottlenecked as fillrates and or the filtering even on the G80 even with the increased fillrates and filtering power. So that leads us to believe even at 10 to 1 to 1 the shift will help the r600 but it still will have a disadvantage because the areas that it lacks against the compitition. Now at 20 to 1 I agree it should do much better on the r600 but take what we have now and cut the performance by half because thats how hard it will hit these cards. These cards won't be able to handle that.

Now lets cut through all this what I just said and look at some empirical proof of this

http://iiswc.org/iiswc2006/IISWC2006P1.2.pdf

Page 22

Oblivion has a 10.4 ALU to Tex ratio, where is the r600 advantage when such high levels of ALU to tex ratio's are seen.

One missing acknowledgement I see is that the bottlenecks aren't even talked about.....
 
Hum.. there would be a lot of other considerations to do, it's not so simple, otherwise you should see a 50 to 60% advantage in favor of G80 in all non-ALU limited cases (if there are no other bottlenecks).
I.e. if you use FP16 texture filtering or higher, G80 fill rate should decrease, but as far as I can understand, the fill rate should stay the same in R600 (there are 16 TFU, but they are all 32-bit capable). Again, I saw no tests on line about this, so I don't know if this is really the case. And it's difficult to compare ALU-to-TEX ratio between R580 and R600 because A R600 ALU is D5 superscalar and R580 is not, so we should see which instructions are really used to make a fair comparison between the two. Another point, is that I don't know if the ALU: TEX ratio showed in the document is an Average on the scene, an average between the shaders used or refers to a single shader only, and maybe only the first option could lead to some conclusion.
If 80% of math power is really unused, then there should be absolutely no advantage of R580 relative to R520, which is not the case.
DX10 environment, moreover, is still a unknown world.
 
Well, maybe it depends on the type of shader workload and dependency chain to the texturing.
And don't forget, that in a unified type GPU, you should take the vertex workload in consideration, when evaluating those multiple ratios.
 
Ok now lets see how this stands up to what Erics ratio's have to say.
...

Now lets cut through all this what I just said and look at some empirical proof of this

http://iiswc.org/iiswc2006/IISWC2006P1.2.pdf

Page 22

Oblivion has a 10.4 ALU to Tex ratio, where is the r600 advantage when such high levels of ALU to tex ratio's are seen.

One missing acknowledgement I see is that the bottlenecks aren't even talked about.....

Demirug has criticized this kind of proof "games are so ALU heavy". They compare the TEX vs. ALU ratio, but the more interesting thing to compare the number of cycles for each instruction.
Both small tables on page 22 /UT2004, D3&Q4) or Table XIII: http://personals.ac.upc.edu/vmoya/docs/IISWC-Workload.pdf
"These API statistics are not enough though to determine the real ratio between shader ALU processing and texture processing. There is a dynamic component in texture processing related with the implementation of the bilinear, trilinear and anisotropy filtering algorithms. The texture processing throughput of modern GPUs is fixed to 1 bilinear sample per cycle and fragment pipeline and better than bilinear filter algorithms take additional throughput cycles to complete (1 more for trilinear, up to 32 more with a 16 sample anisotropy filtering algorithm, see [28]). TABLE XIII shows firstly the average number of bilinear samples required per texture request for the three simulated benchmarks. Then, when combining these data with the corresponding TABLE XII ratios, the ratio between shader ALU processing and texture processing (bilinear requests) is below 1 and therefore the disbalanced architectures would not be able to efficiently use their increased shader processing power."
 
Well the g80 has 56% increase of filtering ability and 17% more fillrates.
Those numbers are wildly wrong. 8800GTS-640 has 102% more AF capability and 68% more colour+AA fillrate or 237% more z-only rate.

Also, R600 is quite happy with Oblivion:

http://www.hardware.fr/articles/671-14/ati-radeon-hd-2900-xt.html

What's continually entertaining about this is people are unwilling to ask how come G80 is so wildly wasteful of its texturing and raster capabilities, being massively over-specified for the performance it delivers.

Jawed
 
What's continually entertaining about this is people are unwilling to ask how come G80 is so wildly wasteful of its texturing and raster capabilities, being massively over-specified for the performance it delivers.
How balanced you are depends on what's the size of your units... Consider that also in the context that the ALU-TEX (and TEX-ROP, etc.) ratio varies from game to game, from frame to frame, and inside a single frame.

If NVIDIA's engineers have smaller/more efficient TMUs than AMD, then their ideal ratio for peak perf/mm² on today's games might be different from AMD's. The same is true the other way around: it would be unfair to say that R580's number of ALUs was "wasteful" when, clearly, they were quite damn cheap in terms of transistor count.

I do agree that NVIDIA's ALU ratio is too low though, and their triangle setup performance is also too low given everything else. In a Z-only or shadowmap pass, G80 must be so horribly triangle setup-limited that it's not even funny. Similarly, R600 and RV630 must be horribly Z-limited in the same scenarios.

If NV and AMD got all their ratios and all their units absolutely perfect on their first try, then where would be the fun in discussing all this?! :)
 
Demirug has criticized this kind of proof "games are so ALU heavy". They compare the TEX vs. ALU ratio, but the more interesting thing to compare the number of cycles for each instruction.
Both small tables on page 22 /UT2004, D3&Q4) or Table XIII: http://personals.ac.upc.edu/vmoya/docs/IISWC-Workload.pdf


That is true, but then Eric's mention of ratios fall into the same catagory, even worse he was talking about specific shaders not in game scenes. Without looking at the bottleneck its the ALU to tex ratio is a waste is my point.

Jawed, Oblivion is heavily TMU bottlenecked. If we were to turn on nothing else but AF you will see the x2900 takes a 30% or more hit at the higher levels of AF.

http://www.computerbase.de/artikel/..._hd_2900_xt/8/#abschnitt_aa_und_af_skalierung

Now I don't think nV really need that many TMU's but it does help it out.

opps obilivion isn't shown here, but I remember a website doing a test for AF and Oblivion.

Edit btw those are the same papers the one you linked to and the one I did earlier, just different layouts.
 
Last edited by a moderator:
How balanced you are depends on what's the size of your units... Consider that also in the context that the ALU-TEX (and TEX-ROP, etc.) ratio varies from game to game, from frame to frame, and inside a single frame.
And, sadly, we're still in a world where all credence is given to average frame rates. It makes my blood boil. The distortions brought about by silly-high max-FPS have no place in architectural analysis.

[...] it would be unfair to say that R580's number of ALUs was "wasteful" when, clearly, they were quite damn cheap in terms of transistor count.
Oh, I agree completely. Even when you take into account the associated register file to support gazillions of fragments in flight, R580's ALUs were a low-cost upgrade. I'm still sure ATI could have produced the same performance in the same games with 32 ALU pipes, not 48, but the difference in die size would have barely been worth discussing.

I do agree that NVIDIA's ALU ratio is too low though, and their triangle setup performance is also too low given everything else. In a Z-only or shadowmap pass, G80 must be so horribly triangle setup-limited that it's not even funny. Similarly, R600 and RV630 must be horribly Z-limited in the same scenarios.
The most damning aspect of R6xx, for me, is the fact that ATI put together all that bandwidth and then left it on the table. Now is the time for 4xAA samples per loop, not in another 2 years' time. NVidia got this right, even if only just about so. Xenos was a huge misdirection. ARGH.

If NV and AMD got all their ratios and all their units absolutely perfect on their first try, then where would be the fun in discussing all this?! :)
I can't help thinking that R600 looks like an autumn 2005/spring 2006 GPU that was just a bit too ambitious for the process it was due to come out on... Vista slippage took the pressure off ATI. Eric has said it himself that it was never designed to be 2x the performance of R5xx (even if it occasionally gets there).

I would really like to see NVidia go to a 4-SIMDs per cluster design (or double the width of each SIMD). It would also be really good if they dropped the "free trilinear" TMU architecture - G84 seems quite happy. I think G80's problem is that the thread scheduling costs so much that they had to cap the number of clusters, so bumped up the TMUs to compensate, but in doing so had to move "NVIO" off-die, because TMU-scaling is very coarse-grained.

ATI, meanwhile, was determined to implement an architecture that will run until "D3D12" - I'm referring to the virtualisation and threading model ...

Jawed
 
I.e. if you use FP16 texture filtering or higher, G80 fill rate should decrease, but as far as I can understand, the fill rate should stay the same in R600 (there are 16 TFU, but they are all 32-bit capable).
With FP16 and FP32 texturing rate is always twice as fast on G80 than it is on R600, AFAIK
 
Last edited:
I can't help thinking that R600 looks like an autumn 2005/spring 2006 GPU that was just a bit too ambitious for the process it was due to come out on... Vista slippage took the pressure off ATI. Eric has said it himself that it was never designed to be 2x the performance of R5xx (even if it occasionally gets there).
It seems to me that part of R600 is designed to be 2x the performance of R5xx (512 bit bus, more ALUs with increased scheduling flexibility, etc.) while other parts are not.
ATI, meanwhile, was determined to implement an architecture that will run until "D3D12" - I'm referring to the virtualisation and threading model ...
That's good IF this extra complexity does not take a relevant area of the chip, but I'm not really sure about that
 
With FP16 and FP32 texturing rate is always twice as fast on G80 than it is on R600, AFAIK

Hmm.. FP 16 OK, that's 2x performance per clock, but according to B3D article

http://www.beyond3d.com/content/reviews/1/8

FP 32 filtering should be half the FP16 rate.
I remember ATI stated each TFU in R600 can fetch one FP32 value per clock, so total FP16 rate of R600 should be 64% of G80, and FP32 filtering rate should be 129%.
Not that I heard of any game using heavy FP 32 texture filtering, though...
 
Hmm.. FP 16 OK, that's 2x performance per clock, but according to B3D article

http://www.beyond3d.com/content/reviews/1/8

FP 32 filtering should be half the FP16 rate.
I remember ATI stated each TFU in R600 can fetch one FP32 value per clock, so total FP16 rate of R600 should be 64% of G80, and FP32 filtering rate should be 129%.
Not that I heard of any game using heavy FP 32 texture filtering, though...
fetching != filtering, AFAIK R600 filters FP32 textures at half FP16 rate, so it's still twice as slow as G80 on a per clock basis.
R600 strength (or weakness imho) is to be able to filter FP16 and RGBA8 textures at same rate.
 
fetching != filtering, AFAIK R600 filters FP32 textures at half FP16 rate, so it's still twice as slow as G80 on a per clock basis.
R600 strength (or weakness imho) is to be able to filter FP16 and RGBA8 textures at same rate.


And how about fp 64?

is this what we see when HDR is activated in Far Cry, the r600 gets hurt?
 
And how about fp 64?

is this what we see when HDR is activated in Far Cry, the r600 gets hurt?
My terminology is not always correct..with FP16 (or FP32) I meant FP16 (or FP32) per component textures, I'm taking for granted that filtering runs at full speed even when all colour components are being filtered, but I could be completely wrong on that ;) (and probably I am..)
 
It seems to me that part of R600 is designed to be 2x the performance of R5xx (512 bit bus, more ALUs with increased scheduling flexibility, etc.) while other parts are not.

That's good IF this extra complexity does not take a relevant area of the chip, but I'm not really sure about that
Generally I think of R600 as a "sacrificial" GPU, in much the same way as R520 was: get the architecture going, prepare the ground for the next revision and get drivers/developers on the ramp. What irks me about R600 is that I don't see it being refreshed/replaced by something worth buying any time soon, whereas R580 was a great bit of kit.

Jawed
 
I don't believe in masochistic GPU design, I believe in wrong design decisions.
 
Status
Not open for further replies.
Back
Top