It's a combination of all the factors I listed earlier, pipeline stalls, imbalances between vertex/pixel pipe utilisation, ALUs sitting idle due to instruction dependency (including texturing), etc.
Xenos is the only GPU with this architecture and there aren't any usable benchmarks that I'm aware of.
A simple example of Xenos's gain from the USA relates to the rendering passes that work purely on geometry (there's no pixel shading required), e.g. Z-buffer pre-fill. These rendering passes can utilise all 48 shader pipelines, hence tackling a higher geometry complexity than, say, an 8 pipeline SM3 GPU running at the same speed in the same time. In this scenario, an 8v/24p GPU is wasting 75% of its pipelines.
This thread, intermittently, is quite useful:
http://www.beyond3d.com/forum/showpost.php?p=581315&postcount=64
That posting refers to 5 vertex shader passes.
Currently there's no evidence from R520 to support/deny that
out of order thread scheduling is a win (a feature shared by Xenos, and directly related to efficiency gains centred on ameliorating texturing latency). That's because GPUs already have advanced texture-latency hiding techniques. It's frustrating not getting a clear indication.
RV530 might provide such evidence (since it utilises a 3:1 ALU:Texture ratio, like Xenos) but it's clouded by too much other stuff. With only 4 texture pipes it seems to be radically better at texturing than RV515 (also 4 texture pipes and nearly identical core and memory clocks). Hinting that higher ALU:Texture ratio is a good thing - but as I say, clouded by other variables.
So, right now, we have no evidence for a range of efficiency gains in Xenos that relate specifically to the ALU:texture workload of typical game shaders.
Jawed