I don't know what silly browser you're using, but I get fps all the time. Place the mouse pointer over one of the cards and all data in the chart gets normalized to this card.Ah, I see. Silly percent normalization... Give me FPS damnit!
I don't know what silly browser you're using, but I get fps all the time. Place the mouse pointer over one of the cards and all data in the chart gets normalized to this card.Ah, I see. Silly percent normalization... Give me FPS damnit!
Crysis is even worse:That imho paints a pretty clear picture: the R600/RV670 is definitely suffering from a lack of texture filtering capacity sometimes, and I'd guess that if it would have, say, 24 tmus instead of 16, that would greatly help it be more competitive in those situations.
Well, Gothic is comparatively clearly worse than that, where the 8800GT loses 4%, and rv670 32%.Crysis is even worse:
http://www.bit-tech.net/hardware/2007/11/02/nvidia_geforce_8800_gt/8
GTS and GT lose 16%, 2900XT loses 33%.
I wonder though if (and if so by how much) this could be improved with better drivers. Maybe rv670 indeed has better hw memory management (maybe even full memory virtualization, e.g. can swap pages between main memory / card memory based on usage pattern?), but it's possible nvidia just didn't optimize for such low-mem-available situations very well.On the plus side, ATI seems to have better memory management. A 512MB card from NVidia experiences a more precipitous drop at high res and AA settings than one from ATI, and we see this in many games.
You can't just tack 8 extra TMUs into this architecture. Trilinear filtering for Int8 would be more feasible.That imho paints a pretty clear picture: the R600/RV670 is definitely suffering from a lack of texture filtering capacity sometimes, and I'd guess that if it would have, say, 24 tmus instead of 16, that would greatly help it be more competitive in those situations.
mczak said:That imho paints a pretty clear picture: the R600/RV670 is definitely suffering from a lack of texture filtering capacity sometimes, and I'd guess that if it would have, say, 24 tmus instead of 16, that would greatly help it be more competitive in those situations.
So, I don't think there's much of a "mystery" for these "performance problems", though I'd definitely like to see more benchmarks investigating this.
Didn't want to imply it was that simple. But, unlike the GF8, where the TMU units are directly tied to the shaders, it should indeed be possible in theory to change the TEX:ALU ratio - though it seems you can't increase TEX:ALU ratio without a hit to branch granularity (not sure though if the design is able to handle shader clusters without a power-of-two number of shader units, though I think this is how the 2900GT works - or does the 2900GT have 3 shader clusters with 16 units and not 4 clusters with 12? In this case the 2900GT should still have 16 tmus).You can't just tack 8 extra TMUs into this architecture. Trilinear filtering for Int8 would be more feasible.
I wonder though if (and if so by how much) this could be improved with better drivers. Maybe rv670 indeed has better hw memory management (maybe even full memory virtualization, e.g. can swap pages between main memory / card memory based on usage pattern?), but it's possible nvidia just didn't optimize for such low-mem-available situations very well.
Didn't want to imply it was that simple. But, unlike the GF8, where the TMU units are directly tied to the shaders, it should indeed be possible in theory to change the TEX:ALU ratio
Sounds like a good way to either destroy batch performance or square compiler/scheduling complexityhalving the number of VLIW units would definitely make some room for other enhancements.
I think it's more of a bandwidth / access latency problem. The HD3870 has much more BW than the 8800GT, and the ringbus is theoretically superior when it comes to latencies (never seen a measurement though).
Quite the opposite, probably. A PTP crossbar will likely have better (lower) latency. The point of the ring-bus is distribution....and the ringbus is theoretically superior when it comes to latencies (never seen a measurement though).
I'm not really convinced of that. The HD3850 has less ram bandwidth (and only half the memory) than the 8800GT, but it generally seems to behave better in this regard - no sharp dropoffs.I think it's more of a bandwidth / access latency problem. The HD3870 has much more BW than the 8800GT
No that's not true. They aren't bound to the shader clusters. You can increase the shader cluster count to whatever you want while keeping the sampler count intact (example: rv630 has 3 shader clusters, 2 (x4) samplers). As I said, the samplers are however bound to the quads inside the clusters (thus my theory of a theoretically possible gpu with 5 (x4) sampler units, and 3 shader clusters (x20x5 units)). Well according to the b3d r600 article anyway .In the R6xx, the TMUs are not hardwired to the shaders, but they're still bound logically - so TMU unit #1 always serves shader cluster #1, and so on.
While it may be possible to change this, the R6xx architecture seems less flexible in these matters than the competition.
Well doubling the shader core clock certainly doesn't come for free - in theory you'd need pretty much twice the transistor count for two times the maximum achievable clock (if you want to retain the IPC). Maybe it's slightly less than that (because certain areas might even now probably tolerate already a much higher clock), but don't expect miracles.I wonder if AMD could do the same trick with the shader core clock as nVidia - halving the number of VLIW units would definitely make some room for other enhancements.
Yes, but since each sampler feeds one quad within a shader SIMD, the TEX:ALU ratio is constant there. The only way to change that ratio is to increase or decrease the number of shader clusters, if I interpret that architecture article correctly. Or is that what you're saying?The texture units are more like another shader SIMD themselves. They run un parallel with the shader SIMD's and server the quads across them. This is where we get the 2 dimensional scalability from.
Sounds like a good way to either destroy batch performance or square compiler/scheduling complexity
I'm not really convinced of that. The HD3850 has less ram bandwidth (and only half the memory) than the 8800GT, but it generally seems to behave better in this regard - no sharp dropoffs.
No that's not true. They aren't bound to the shader clusters. You can increase the shader cluster count to whatever you want while keeping the sampler count intact (example: rv630 has 3 shader clusters, 2 (x4) samplers).
Well doubling the shader core clock certainly doesn't come for free - in theory you'd need pretty much twice the transistor count for two times the maximum achievable clock (if you want to retain the IPC). Maybe it's slightly less than that (because certain areas might even now probably tolerate already a much higher clock), but don't expect miracles.
Hmm, I may have mis-understood the suggestion, I took it as adjust the layout but keep the same number of ALUs ie either 2*32*(4+1)=320 which would be bad for batch size or 8*8*(4+1)=320 which would make the already complex compilation worse.Why do you think that halving the batch size also would destroy performance?
Crossbar VS ring bus: as far as I remember the high level operation of the ring bus, it was mentioned that the average data has to travel half the ring as the bus is bi-directional. Doesn't it translate to better average latency where a mass of data is transferred? Pls enlighten if I'm totally wrong here.
Consider how you do multiplication by hand. You do a lot of single-digit muls, adding together results, shifting etc. The point is that subsequent stages are dependent on the results of previous ones, so you can't do that easily in parallel. If you now implement that in hardware, it would be cheap in terms of transistor count, but have way too many stages (logic gates) in series to be able to clock it fast - since you need to wait for all gates in series until the signal has propagated through (I won't divulge into pipelined designs and such, it doesn't change the fundamental issues behind this). So, consider another multiplier design: assume you have basic blocks which can not only do one-bit multiplication, but, say, 4-bit at once. This would cut down your serial chain a lot. Of course, to be really faster those 4-bit multipliers need to have less serial stages than chained 1-bit multipliers would have. And, turns out that's easily possible (think for instance hardwired tables for all possible 4-bit input combinations), but it comes at the expense of increased logic gate count.Can you elaborate here? I don't imagine you can get the double clock with the same amount of transistors (if for nothing else, you need more redundancy to keep yield at bay), but I don't see why you would need the double.
Crossbar VS ring bus: as far as I remember the high level operation of the ring bus, it was mentioned that the average data has to travel half the ring as the bus is bi-directional. Doesn't it translate to better average latency where a mass of data is transferred? Pls enlighten if I'm totally wrong here.
I'm a bit late to the discussion here I feel inclined to comment that you don't need to find something positive when trying to figure out why an architecture isn't performing as expected. Something that doesn't meet expectations can be just as fascinating technologically as something that exceeds expectations.R600 was EOL'd and achieved practically zero market penetration. I fail to see the fascination with it. RV670 appears to be (mostly) more of the same, but is likely to achieve much higher market penetration thanks to its competitive pricing. Still won't hold a candle to GF8800 sales when all's said and done though, so why even bother?
Sorry, I'm just so down on AMD right now I just can't find much of anything positive to say about them or their products.
I think you're saying the same thing as Dave.Yes, but since each sampler feeds one quad within a shader SIMD, the TEX:ALU ratio is constant there. The only way to change that ratio is to increase or decrease the number of shader clusters, if I interpret that architecture article correctly. Or is that what you're saying?
Anyone wants to elaborate a bit on exactly how important Z-fill is for current games? It seems to be the only aspect where the GeForce 8x00-series have a significant advantage in theoretical benchmarks.R600's Z-fill rate is too low.