Change of insight

whereas the vector grouping of previous GPU generations' ALUs (and possibly R600) meant a group of ALUs (e.g., "5D," or 4D + scalar) work on a single vector in parallel (though that "+ scalar" is still a mystery to me as to whether it simultaneously works on another vector or something else--time to re-read some B3D reviews).
Vector+Scalar issue is just what it says -- processing together one vec(n) AND one scalar instruction, but that's not a case for all architectures (NV3x pixel ALU was a straight vec4 type, AFAIK).
 
whereas the vector grouping of previous GPU generations' ALUs (and possibly R600) meant a group of ALUs (e.g., "5D," or 4D + scalar) work on a single vector in parallel (though that "+ scalar" is still a mystery to me as to whether it simultaneously works on another vector or something else--time to re-read some B3D reviews).
The 4D + 1D means that in 1 instruction/clock cycle you could calculate the following:
C.x += A.x * B.x
C.y += A.y * B.y
C.z += A.z * B.z
C.w += A.w * B.w
or a swizzle of this, like:
C.x += A.y * B.x
C.y += A.z * B.y
C.z += A.x * B.z
C.w += A.w * B.w

And in addition to that, you could also do:
s += u * v
where s, u, and v are independent from A,B, and C.

On a scalar machine, this would take 5 clock cycles.
 
Back
Top