I don't buy that. It may be a bit more efficient but I doubt it's really that much.
20-30% on a good day with a following wind, I'd say. You need predominantly serial-scalar operations with minimal transcendentals/texturing. Transcendentals and texturing both increase overall utilisation of the ALUs (MI is an expensive unit) but both increase the chances of putting bubbles into the MAD lane.
If you quote "shader bound cases" I'll point you at the 3dmark06 perlin noise test (which, coincidentally, is pretty much the only benchmark which shows scaling about what you'd expect between G92 and G94 so it's probably REALLY shader alu bound). In this benchmark, a HD3870 runs neck and neck with a 8800GTS-512, so it doesn't look that much less efficient (based on peak MAD rates the HD3870 should just be a tad bit faster).
Yep this is truly an ALU-limited shader, 9.31:1 ALU:TEX ratio (in the D3D assembly).
And, as I posted before, it runs at 93% scalar utilisation on R6xx (197 instruction slots - 916 scalar operations - 4.65 scalars per instruction slot) - ignoring the TEX instructions, that is.
So NVidia's design is theoretically ~8% faster here, if you ignore dependencies across the MAD and MI units.
More fundamentally, I think NVidia's ALU design is extremely costly in a number of areas:
Register file:
- G92 has 16 register files
- RV670 has 4 register files (though I think each one prolly has a ghost copy in order to support operand fetch bandwidth)
When a shader uses lots of registers (say more than about 4 vec4s), NVidia's design suffers from a severe drop in "occupancy" due to the relatively small size of each register file. So performance hits a brick wall, much like register file pressure affected performance in NV40...G71. Decoupled texturing in G92 significantly lowers the costliness of low thread counts, though.
Thread issue:
- G92 scoreboards every operand for each instruction in a shader - "is r0.z ready to issue a MAD?"
- RV670 scoreboards texture-clauses ("have these 3 texture instructions produced their result yet?")
NVidia's design hugely increases the amount of per thread state data (since it all needs to be scoreboarded) which also means the instruction/thread issue logic is pretty complex. Registers have read-after-write latency that can lower the population of available threads, something that never affects RV670.
Branching:
- G92 is forced to flush the ALU pipeline (~ 8 clocks?) when a thread turns out not to need to run that clause - if the clause is at least 5 or 8 instructions long (compiler makes a guess about the threshold for clause flushing)
- RV670 never flushes its ALU pipeline due to branching, it always swaps the thread for this test - stalls only arise when available threads are entirely exhausted
This hurts most in code that loops an indeterminate number of times, I guess (variable loop count per object within a thread). Arguably this is a style of coding that's rare, so doesn't matter.
While there are advantages in NVidia's design:
Instruction-issue:
- G92 issues 2 operations per processor
- RV670 issues 5 operations
G92 has relatively simple ALU compilation aided by the fact that attribute-interpolation increases opportunities to maximise ALU utilisation. The requirement to schedule attribute-interpolation instructions does make compilation more complex, of course. RV670 is comparatively easy to run at very low utilisation with serially-dependent scalar instructions - but such code doesn't make up the majority of graphics shaders.
Texture (memory fetch) latency hiding:
- G92 issues all texture operations independently
- RV670 prefers to issue texture operations in clauses (e.g. 4 TEX operations)
This reduces the number of threads that G92 needs, per SIMD, in order to hide texturing latency, which has a knock-on effect of lowering register file consumption.
Register file usage:
- G92 serial-scalar instruction issue
- RV670 5-way instruction issue - but the ALU pipeline contains 5 scalar registers per object
G92 can use less registers in the compiled shader by re-ordering instructions (e.g. splatting vector instructions such that a single scalar register can be used as "scratch" for all channels of the vector result over the duration of the vector instructions). Though RV670's pipeline registers (effectively a mini register file of 8 clocks * 5 scalars per processor * 64 objects = 2560 scalars, 10KB) also means it can reduce consumption of the register file for purely "scratch" registers (results that are only needed for one "clock" - actually 8 clocks of pipeline time).
Branching granularity:
- G92 has a basic thread granularity of 16 - though this doesn't apply to pixel shading (32) and will prolly be 32 in future designs
- RV670 has thread granularity of 64. Future designs can either be smaller or larger. I'd tend to expect larger.
Comparatively RV670 hurts on all shader code that allows incoherent branching, with the worst effect seen in geometry or vertex shaders.
Ultimately NVidia saved ALU die space by running at 2x (plus) clock rates when compared with the GPU's core clock. Comparatively I think this is ATI's key drawback in terms of ALU implementation (not "serial scalar") while I think the simplicity of thread-control and the high thread:register-file ratio means ALU capability will scale very rapidly.
Jawed