Hmm, that's a very interesting point I had forgotten about completely - oopsie, and good catch!
It would also be worth doing a per-transistor/per-watt comparison - with the caveat that RV635 has more bandwidth than it needs, to a greater extent than RV670. As far as I can tell, RV670 is up to 3x the performance of RV635, but at <2x the power consumption. I don't know the precise numbers though (might only be able to compare 256MB cards: HD3650 and HD3850).
Given that AMD implied in the past that the register file was basically as big as the ALUs themselves
I don't remember that, I'm guessing foggy memory on my part.
(excluding the scheduling overhead I presume?) then this might imply a substantial part of that density benefit is due to SRAM (which is inherently much denser), which also has lower (rather than higher) power density and better (not worse) yields. Hmmm!
For what it's worth the SRAM in Cell SPEs is ~4x the density of the remainder of the SPE. I dunno how we'd assess this factor in ATI GPUs - and we still don't know how much register file there is (512KB seems like a reasonable guesstimate) or whether it is actually double-implemented (to get dual-porting). L2 cache is another 256KB. The RBE caches could be 10s of KB, I suppose. There's also hierarchical-Z/stencil buffers - but I presume they're small enough to ignore.
I'd presume that the TMUs, for example, would use a lot less memory (although you *might* want to scale the texture cache along with them)
RV670 has 2x the L2 texture cache that RV635 has, so I think there's a strong chance the scaling is linear.
but the fact remains that what you just pointed out made my arguement quite different; I still suspect ALU/TMU density is higher than the density of the 'unique' stuff for a variety of reasons, but certainly power density and yields become a much more complex equation now. Hmm!
One of the things I've been fiddling with is just how much of R6xx is "fixed": PCI Express, command processor, input assembler, rasteriser, setup, inter-stage buffers, instruction/constant-buffer caches, virtual memory system, ring bus buffering/command-processing etc.
So far I've narrowed it down to anywhere in the range 90-254M transistors
If I simply plump for a number in the middle, 180M, then 27% of RV670's transistors are "fixed". That's 48% of RV635
I don't know how to narrow down R6xx any further...
So, in summary we have:
- a substantial part of R6xx is fixed
- R6xx is comparatively low in performance per watt (against NVidia) despite R6xx "low" unit counts (TUs, RBEs)
- super-linear scaling (from RV635 to RV670) both in per-unit performance and performance per watt
it seems to me that the "fixed" portion of R6xx GPUs is distorting things substantially. Obviously the scaling of RV635->RV670 is fairly risky basis for this point of view, being a single data point. But what you gonna do?...
---
I can think of one spanner to throw in here: ALU utilisation. It seems to me it's very hard to find any code that exercises R6xx ALUs (whilst also fully exercising the rest of the GPU). So with "easy" code what we might be seeing is that power consumption in RV670 is never topping out, whereas RV635 might be closer to being fully utilised (due to the lower ALU:TEX ratio).
Jawed