Well a wide SIMD architecture is always going to look good in pure throughput tests like those. But what about more realistic workloads like
these?
Which of those is an ALU-specific test? I know 3DMark06 Perlin Noise is ALU-bound (just about).
And doesn't more general code have a lot more scalar dependencies by nature since it's not working against vectorized data as much as a typical 3D process would?
HD4870 can't get any slower than the serial MAD test I linked, i.e. 68% performance per mm2 or 37% of the absolute performance of GTX285.
As to the "nature" of more general code, the issue is really about the memory system. Some general code is so compute bound it barely uses any kind of memory resources, either video RAM or on-die shared RAM - just registers, basically. That code will be quite happy in naive scalar form.
But any time bandwidth/latency are part of performance you have to forget about programming a scalar machine in purely scalar terms. You're now programming a
vector memory architecture. Gathers should be maximally coherent, you don't want to induce waterfalls in register/constant fetches and the memory system needs nice aligned operations to maximise memory controller and cache performance.
The SIMDness of the GPU, the 32-wide batches, is simply not enough to save you. By vectorising your use of data you're naturally making it work well on a vector GPU. It's why texturing is in quads, because the cost of not doing so is terrible.
Well F@H performance seems to say that they're not enough....
Eh? Until AMD re-writes the core to use LDS/GDS, F@H tells us precisely nothing.
True, it's just that right now a branch costs AMD at least 5x what it costs Nvidia in terms of idle resources. That's gotta catch up to them at some point.
I think large batch sizes are a far more pressing problem. Oh, by the way, I've realised that the batch size of RV770 is really double what I've been thinking. Because a pair of batches runs together in the ALUs in AAAABBBBAAAABBBB etc., any incoherency in either batch kills the other batch, too.
---
If GTX285's ALUs are ~25% of the die, that's about 118mm2. Meanwhile HD4870's ALUs are ~30% of the die, about 77mm2.
So a purely ALU-based comparison of performance per mm2 for HD4870 against GTX280:
- float MAD serial - 57%
- float4 MAD parallel - 273%
- float SQRT serial - 221%
- Float 5-inst. Issue - 239%
- int MAD serial - 137%
- int4 MAD parallel - 279%
Worst case, AMD's ALUs are 76% bigger than NVidia's when running serial scalar code. Most of the time they're effectively 50% of the size in terms of performance per mm2.
Jawed