R6XX Performance Problems

That imho paints a pretty clear picture: the R600/RV670 is definitely suffering from a lack of texture filtering capacity sometimes, and I'd guess that if it would have, say, 24 tmus instead of 16, that would greatly help it be more competitive in those situations.
Crysis is even worse:
http://www.bit-tech.net/hardware/2007/11/02/nvidia_geforce_8800_gt/8

GTS and GT lose 16%, 2900XT loses 33%.

I see no value in playing without AF, as reducing resolution and enabling it provides more detail and higher framerates even on ATI.

On the plus side, ATI seems to have better memory management. A 512MB card from NVidia experiences a more precipitous drop at high res and AA settings than one from ATI, and we see this in many games.
 
Crysis is even worse:
http://www.bit-tech.net/hardware/2007/11/02/nvidia_geforce_8800_gt/8

GTS and GT lose 16%, 2900XT loses 33%.
Well, Gothic is comparatively clearly worse than that, where the 8800GT loses 4%, and rv670 32%.
But in any case, there seems to be a general agreement that the rv670 is indeed, at least sometimes, quite performance-limited by its relatively low texture filtering capacity.
On the plus side, ATI seems to have better memory management. A 512MB card from NVidia experiences a more precipitous drop at high res and AA settings than one from ATI, and we see this in many games.
I wonder though if (and if so by how much) this could be improved with better drivers. Maybe rv670 indeed has better hw memory management (maybe even full memory virtualization, e.g. can swap pages between main memory / card memory based on usage pattern?), but it's possible nvidia just didn't optimize for such low-mem-available situations very well.
 
Last edited by a moderator:
That imho paints a pretty clear picture: the R600/RV670 is definitely suffering from a lack of texture filtering capacity sometimes, and I'd guess that if it would have, say, 24 tmus instead of 16, that would greatly help it be more competitive in those situations.
You can't just tack 8 extra TMUs into this architecture. Trilinear filtering for Int8 would be more feasible.
 
mczak said:
That imho paints a pretty clear picture: the R600/RV670 is definitely suffering from a lack of texture filtering capacity sometimes, and I'd guess that if it would have, say, 24 tmus instead of 16, that would greatly help it be more competitive in those situations.
So, I don't think there's much of a "mystery" for these "performance problems", though I'd definitely like to see more benchmarks investigating this.

In R600 each sampler unit can bilinear filter 4x INT8 or 4xFP16! I really which it could of doubled the number of INT8 bilerps so.... 8x INT8 or 4xFP14.
 
Last edited by a moderator:
You can't just tack 8 extra TMUs into this architecture. Trilinear filtering for Int8 would be more feasible.
Didn't want to imply it was that simple. But, unlike the GF8, where the TMU units are directly tied to the shaders, it should indeed be possible in theory to change the TEX:ALU ratio - though it seems you can't increase TEX:ALU ratio without a hit to branch granularity (not sure though if the design is able to handle shader clusters without a power-of-two number of shader units, though I think this is how the 2900GT works - or does the 2900GT have 3 shader clusters with 16 units and not 4 clusters with 12? In this case the 2900GT should still have 16 tmus).
I think this configuration should be possible (with a similar transistor count as what rv670 has?)
5 (x4) samplers, 3 shader clusters with 20 shader units (so 5 quads) each (this would reduce the number of shader units from 64 to 60 - or 320 to 300 if you like). Maybe other changes would be needed (e.g. enough interpolators to feed the more tmus) too.
Granted, I can't come up with a sensible design with 24 texture units now (would require either 2 shader clusters with 24 units each, which probably has a too low ALU:TEX ratio (and quite bad branching performance already) to be useful, or 3 units with 24 clusters each, which clearly would require more transistors (more tmus than rv670 and more shader units in total).
That said, I'm sure AMD has looked into what the optimal arrangement should be, and took the one which made the most sense overall (given the approximate transistor count).
Doubling only the filtering units (at least for int8) thus may indeed make more sense, but of course that's not free in terms of transistor count neither.
 
I wonder though if (and if so by how much) this could be improved with better drivers. Maybe rv670 indeed has better hw memory management (maybe even full memory virtualization, e.g. can swap pages between main memory / card memory based on usage pattern?), but it's possible nvidia just didn't optimize for such low-mem-available situations very well.

I think it's more of a bandwidth / access latency problem. The HD3870 has much more BW than the 8800GT, and the ringbus is theoretically superior when it comes to latencies (never seen a measurement though).

Didn't want to imply it was that simple. But, unlike the GF8, where the TMU units are directly tied to the shaders, it should indeed be possible in theory to change the TEX:ALU ratio

In the R6xx, the TMUs are not hardwired to the shaders, but they're still bound logically - so TMU unit #1 always serves shader cluster #1, and so on. While it may be possible to change this, the R6xx architecture seems less flexible in these matters than the competition.
I see the doubling of the texture filtering units as the only valid alternative - I was hoping they do it in the RV670 until I saw the transistor count.

I wonder if AMD could do the same trick with the shader core clock as nVidia - halving the number of VLIW units would definitely make some room for other enhancements.
 
halving the number of VLIW units would definitely make some room for other enhancements.
Sounds like a good way to either destroy batch performance or square compiler/scheduling complexity :oops:
 
I think it's more of a bandwidth / access latency problem. The HD3870 has much more BW than the 8800GT, and the ringbus is theoretically superior when it comes to latencies (never seen a measurement though).

I would think the opposite - Crossbar should have better average access latency.
 
I think it's more of a bandwidth / access latency problem. The HD3870 has much more BW than the 8800GT
I'm not really convinced of that. The HD3850 has less ram bandwidth (and only half the memory) than the 8800GT, but it generally seems to behave better in this regard - no sharp dropoffs.
Look for instance at this benchmark here: http://techreport.com/articles.x/13603/6 - the 8800GTS 320 trails the 8800GTS 640 by a factor of 2, the 3850 (with even less ram than the 8800GTS 320!) easily beating it because of that (and definitely having way less performance drop compared to the 3870 - in fact without AA there doesn't seem to be any drop because of not enough ram).
In the R6xx, the TMUs are not hardwired to the shaders, but they're still bound logically - so TMU unit #1 always serves shader cluster #1, and so on.
While it may be possible to change this, the R6xx architecture seems less flexible in these matters than the competition.
No that's not true. They aren't bound to the shader clusters. You can increase the shader cluster count to whatever you want while keeping the sampler count intact (example: rv630 has 3 shader clusters, 2 (x4) samplers). As I said, the samplers are however bound to the quads inside the clusters (thus my theory of a theoretically possible gpu with 5 (x4) sampler units, and 3 shader clusters (x20x5 units)). Well according to the b3d r600 article anyway :).

I wonder if AMD could do the same trick with the shader core clock as nVidia - halving the number of VLIW units would definitely make some room for other enhancements.
Well doubling the shader core clock certainly doesn't come for free - in theory you'd need pretty much twice the transistor count for two times the maximum achievable clock (if you want to retain the IPC). Maybe it's slightly less than that (because certain areas might even now probably tolerate already a much higher clock), but don't expect miracles.
 
Last edited by a moderator:
The texture units are more like another shader SIMD themselves. They run in parallel with the shader SIMD's and server the quads across them. This is where we get the 2 dimensional scalability from.
 
The texture units are more like another shader SIMD themselves. They run un parallel with the shader SIMD's and server the quads across them. This is where we get the 2 dimensional scalability from.
Yes, but since each sampler feeds one quad within a shader SIMD, the TEX:ALU ratio is constant there. The only way to change that ratio is to increase or decrease the number of shader clusters, if I interpret that architecture article correctly. Or is that what you're saying?
 
Crossbar VS ring bus: as far as I remember the high level operation of the ring bus, it was mentioned that the average data has to travel half the ring as the bus is bi-directional. Doesn't it translate to better average latency where a mass of data is transferred? Pls enlighten if I'm totally wrong here.

Sounds like a good way to either destroy batch performance or square compiler/scheduling complexity :oops:

Why do you think that halving the batch size also would destroy performance?

I'm not really convinced of that. The HD3850 has less ram bandwidth (and only half the memory) than the 8800GT, but it generally seems to behave better in this regard - no sharp dropoffs.

Well, if it's not BW or latency, then I can think of one last thing. I remember when I saw the first GTS 320 tests it seemed odd that it ran out of memory earlier than its 256M competitors. I had it down to some possible peculiar issue with the 320bit memory bus, but after all, it may be a common "feature" of the G8x series.

No that's not true. They aren't bound to the shader clusters. You can increase the shader cluster count to whatever you want while keeping the sampler count intact (example: rv630 has 3 shader clusters, 2 (x4) samplers).

There's a predefined route in each GPU, that's what I mean by bound - but my assumption about the lack of flexibility is obviously wrong, by the RV630 example. And, now I'm even more confused why AMD didn't slot in at least two more quads in the RV670.

Well doubling the shader core clock certainly doesn't come for free - in theory you'd need pretty much twice the transistor count for two times the maximum achievable clock (if you want to retain the IPC). Maybe it's slightly less than that (because certain areas might even now probably tolerate already a much higher clock), but don't expect miracles.

Can you elaborate here? I don't imagine you can get the double clock with the same amount of transistors (if for nothing else, you need more redundancy to keep yield at bay), but I don't see why you would need the double.
 
Why do you think that halving the batch size also would destroy performance?
Hmm, I may have mis-understood the suggestion, I took it as adjust the layout but keep the same number of ALUs ie either 2*32*(4+1)=320 which would be bad for batch size or 8*8*(4+1)=320 which would make the already complex compilation worse.
I see now you're talking 4*8*(4+1)=160 but at 2* clock =320.
Still the same scheduling issue, RV670 is already smaller in transistors than G92 & I think you'd be down to 8 texture units?
 
Crossbar VS ring bus: as far as I remember the high level operation of the ring bus, it was mentioned that the average data has to travel half the ring as the bus is bi-directional. Doesn't it translate to better average latency where a mass of data is transferred? Pls enlighten if I'm totally wrong here.

Going even half way around the ring will still take longer on average than going PTP on a crossbar.
 
Can you elaborate here? I don't imagine you can get the double clock with the same amount of transistors (if for nothing else, you need more redundancy to keep yield at bay), but I don't see why you would need the double.
Consider how you do multiplication by hand. You do a lot of single-digit muls, adding together results, shifting etc. The point is that subsequent stages are dependent on the results of previous ones, so you can't do that easily in parallel. If you now implement that in hardware, it would be cheap in terms of transistor count, but have way too many stages (logic gates) in series to be able to clock it fast - since you need to wait for all gates in series until the signal has propagated through (I won't divulge into pipelined designs and such, it doesn't change the fundamental issues behind this). So, consider another multiplier design: assume you have basic blocks which can not only do one-bit multiplication, but, say, 4-bit at once. This would cut down your serial chain a lot. Of course, to be really faster those 4-bit multipliers need to have less serial stages than chained 1-bit multipliers would have. And, turns out that's easily possible (think for instance hardwired tables for all possible 4-bit input combinations), but it comes at the expense of increased logic gate count.
That's going a bit too off-topic though, if you're interested in this type of stuff I suggest you read a book on logic design.
 
Crossbar VS ring bus: as far as I remember the high level operation of the ring bus, it was mentioned that the average data has to travel half the ring as the bus is bi-directional. Doesn't it translate to better average latency where a mass of data is transferred? Pls enlighten if I'm totally wrong here.

If you want to cross a ring from the middle of the top edge to the middle of the bottom edge, you'll travel 1/2L + L + 1/2L = 2 die Lengths.

With a crossbar, you could go straight through the chips -> 1L.

Now that's the best case situation of a crossbar, but you get the idea.

In addition, data on the entry ramps of a ring can be stalled due to completely non-related data that happens to be already in that slot, which adds more latency. In a fully populated crossbar, this can never happen.

In general, from a logical architecture and area point of view, it's better to avoid rings, but there may be sufficient layout related reason not to do so, and if you have sufficient latency hiding capabilities it may not matter.
 
R600 was EOL'd and achieved practically zero market penetration. I fail to see the fascination with it. RV670 appears to be (mostly) more of the same, but is likely to achieve much higher market penetration thanks to its competitive pricing. Still won't hold a candle to GF8800 sales when all's said and done though, so why even bother?

Sorry, I'm just so down on AMD right now I just can't find much of anything positive to say about them or their products.
I'm a bit late to the discussion here I feel inclined to comment that you don't need to find something positive when trying to figure out why an architecture isn't performing as expected. Something that doesn't meet expectations can be just as fascinating technologically as something that exceeds expectations.

Yes, but since each sampler feeds one quad within a shader SIMD, the TEX:ALU ratio is constant there. The only way to change that ratio is to increase or decrease the number of shader clusters, if I interpret that architecture article correctly. Or is that what you're saying?
I think you're saying the same thing as Dave.
 
R600's Z-fill rate is too low.
Anyone wants to elaborate a bit on exactly how important Z-fill is for current games? It seems to be the only aspect where the GeForce 8x00-series have a significant advantage in theoretical benchmarks.

63.png
 
Back
Top