http://blogs.msdn.com/shawnhar/archi...er-second.aspx
http://texhnologix.blogzine.jp/texhn...shader_pe.html
http://texhnologix.blogzine.jp/texhn...madd_perf.html
http://texhnologix.blogzine.jp/texhn...llrate__1.html
http://texhnologix.blogzine.jp/texhn...fillrate_.html
Xenos's ALUs are a small evolution from the baseline set by R300, towards R600. R300's MAD+ADD was chopped down effectively for Xenos, into MAD+SF (or MAD + scalar ADD, if I remember right).
Simplistically, R300 and Xenos can issue two independent instructions each clock cycle, MAD + SF. In R300 this is vec3 MAD + SF. In Xenos it's vec4 MAD + SF. R300 can do vec4 MAD, with the SF joining in. The rationalisation for Xenos's design is that it's got to do both vertex and pixel shading, and vertex shading more commonly needs to operate on vec4 data (x,y,z,w) whereas in pixel shading vec3 (red, green, blue) is often all that's needed (hence the bias of R300's pixel shaders).
R300 uses the ADD ALU as a
pre-processor for MAD instructions (mostly for Directx 8 "fixed functions", like scaling by 2x). At best you can get 3 instructions out of R300 (which is the same all the way up to R580), MAD for RGB, SF (e.g. reciprocal) and ADD/DX8-FF. The latter must
always deliver its result to the MAD+SF ALU, though, it cannot write to a register (took me ages to realise this restriction
). As far as I can tell Xenos integrates the DX8-FFs and there's no "auxilliary ALU" like R300's ADD on the side.
R600 is vec4 MAD+SF but the twist is that it's 5 entirely independent instructions. On a good day it is 2x faster than R300 per clock, per ALU, but it prolly averages 30-50% faster.
Jawed