I'd agree they have to be doing some sort of packing. With no hints that the ALUs are running at 2x the core clock for ATI it's highly unlikely they're going scalar.
The Hexus report of 2GHz being the target is a hint - though many round here treat it with total derision. I interpret that as a hint that the ALU pipes are targetted to run at 2GHz with the rest of the GPU running at 1GHz.
Separately there are so many patent application documents that refer to just 2 operands being fetched, it seems like a strong hint that R600 is 2-clock MAD. If that's the case, then I don't think it's a huge leap to the ALUs being simplified but fast, i.e. 2GHz (or 2x main clock, whatever that ends up being).
Which brings on the question which setup is easier to pack for? While Nvidia did make the SF more compact it's still not as small as a MADD and other than GPGPU work it just doesn't seem like the SF would show up very much.
Special functions seem to be used a reasonable amount.
The MI/SF unit in G80 is a bit peculiar because for each clock it can produce either a single SF or it can produce four interpolated values - so it's a bit deceptive. It's quite clever because the four interpolation calculations all share a single look-up table (well, set of tables). So, most of the effort goes into interpolation.
Things might get a bit interesting but what about 4x the pixels instead of 2x. That should make packing substantially easier. Vec1->Vec4, Vec2->2*Vec4, Vec3->3*Vec4.
I don't understand. If the array is 32 components wide, then it can either process 8 vec4s, 16 vec2s or 32 scalars in parallel (the odd one out being vec3, prolly 8 of those in parallel).
The problem with packing seems to me to be primarily one of fetching the operands. I posted some thoughts on a staggered striped register file in the R600 thread.
There's a few video processing related patent applications that explicitly describe some intriguing data-routings from operands to multiple ALU components (in parallel) that are almost enough to support my packing theory...
Something like a Vec4+SF would be very similar to xenos but with improved packing to increase efficiency. Vec3/4 could still benefit from the cross/dot operations and anything smaller could be pushed through sideways to keep the hardware happy. If anything branches just kick it back out until you find more of em. Besides the memory bus is 512bits. It'd make sense if they were passing around 512bit data units.
Yeah I'm thinking that once you can do packing, it also dramatically improves dynamic branching.
I think ATI mentioned somewhere that R600 was to be Xenos done right or something to that degree. Other than the SF they should be able to keep most of the ALUs working happily regardless of the type of data going through. The SF I suppose could even be part of a TMU if those are becoming programmable.
From the patent I linked earlier:
[0062] With a step size n of 2<-5> , max error for the approximation method described herein is: x={fraction (1/24)} h4={fraction (1/24)}(2<-20> )=0.0000000397 or 24 bits precision. According to one embodiment, either the sine or cosine is calculated in one clock cycle using a pipelined circuit with 5 multipliers and a 4-input adder. For comparison with the Taylor method described in the DirectX specification, the computations instead could be performed in an 8-clock macro in the following manner: P(x)=f(x0)+(x-x0)(P1+(x-x0)(P2+(x-x0)P3))
[0063] The operation would start with an initial subtraction to find [Delta]x, (x-x0), followed by three multiply-accumulate operations for each function. If the initial subtraction used to move the angle into the first quadrant is added, then there are two subtractions and six multiply accumulates to get both function results. This compares with the three multiply-accumulates, three multiplies, and two adds needed for the Taylor method described in the DirectX specification.
I think this is a strong hint that R600 uses an 8-clock macro to perform SF calculations. And I like to infer that every ALU in the SIMD array can perform the SF calculation in parallel. Though the patent describes significantly lowered precision being adequate for some of the multiplies which mitigates against that theory.
Jawed