More Shader Questions

Mat3

Newcomer
A couple quick questions about shaders...

1.
With the mostly vector based shaders from DX9 and earlier, which were typically vec3-4 + scalar, if the work called for two non-dependant scalar operations, did (or could) one of the scalar instructions go through the vector unit?

2.
In the R600 shaders, each 5-unit block works on a pixel, correct? Each array has 16 such units, so 16 pixels are processed per clock. I've seen information that vectors could be split up to fill all five units:

When you look at the PS rate you see that its actually issuing 1.25 Vector MADD’s (x 64) per cycle. Indicating that rather than scheduling vectors as:

RGBA_
RGBA_
RGBA_
RGBA_
RGBA_

Its scheduling:

RGBAR
RGBAG
RGBAB
RGBAA


Would that mean it's working on 20 pixels instead of 16? Is it only the fifth unit that can do that (the fat one)?

Thanks
 
Think of that list of RGBA's as a list of sequential inscructions for one "pixel". In that latter case it just means that its working on more than one instruction on a pixel in the same cycle.
 
A couple quick questions about shaders...

1.
With the mostly vector based shaders from DX9 and earlier, which were typically vec3-4 + scalar, if the work called for two non-dependant scalar operations, did (or could) one of the scalar instructions go through the vector unit?
Yes - with the unused channels masked off.

2.
In the R600 shaders, each 5-unit block works on a pixel, correct? Each array has 16 such units, so 16 pixels are processed per clock. I've seen information that vectors could be split up to fill all five units:

Would that mean it's working on 20 pixels instead of 16? Is it only the fifth unit that can do that (the fat one)?
No, it's working on a minimum of 5 (all vec4) or a maximum of 20 (all scalar) instructions that are all independent. It's always merely instruction level parallelism. The number of pixels is unchanging.

In:

MAD ro.xyz r1 r2 r3

there are, in effect, 3 parallel (independent) instructions:

MAD r0.x r1.x r2.x r3.x
MAD r0.y r1.y r2.y r3.y
MAD r0.z r1.z r2.z r3.z

R600 treats them as independent for the sake of compilation. They could all be scheduled for execution in the same instruction group (of up to 5 instructions), or they could be scheduled in separate instruction groups.

Jawed
 
OK, so hopefully I'm understanding this correctly. Then would it be possible to take the 5th scalar in the 5-pack and separate it from the other 4 so that they wouldn't have to all work on the same pixel? So instead of [16 X 5], it would be [[16 X 4] + 16 ].

That way, for a 5 component vertex shader, the 5th could be separated and sent to scalars.
And for pixel shaders, the scalars would work on separate pixels than the superscalars (but still in the same batch or thread or whatever it's called). So assuming a group of pixels need to have 4 parallel instructions done, the hardware would do 20 pixels per clock (16 + 4.. the extra 4 being from the 16 scalars, each taking one component). Or if 2 parallel instructions were needed, it would be 24 pixels (16 + 8).

Maybe some of the density in the shader arrays is lost, but better utilitzation is gained.. Is that possible with VLIW, and can pixel groups be varying sizes?
 
OK, so hopefully I'm understanding this correctly. Then would it be possible to take the 5th scalar in the 5-pack and separate it from the other 4 so that they wouldn't have to all work on the same pixel? So instead of [16 X 5], it would be [[16 X 4] + 16 ].

That way, for a 5 component vertex shader, the 5th could be separated and sent to scalars.
And for pixel shaders, the scalars would work on separate pixels than the superscalars (but still in the same batch or thread or whatever it's called). So assuming a group of pixels need to have 4 parallel instructions done, the hardware would do 20 pixels per clock (16 + 4.. the extra 4 being from the 16 scalars, each taking one component). Or if 2 parallel instructions were needed, it would be 24 pixels (16 + 8).

Maybe some of the density in the shader arrays is lost, but better utilitzation is gained.. Is that possible with VLIW, and can pixel groups be varying sizes?
This kind of split would double the scheduling cost for a small average benefit in utilisation. Utilisation in the 4-way VLIW unit is still going to suffer from most of the same scenarios that we have already.

Anything's possible, but trying to discern what's desirable is trickier. If you look around at stream processor designs you'll find that there are all sorts of interesting combinations of ALUs and scheduling...

Jawed
 
OK, so hopefully I'm understanding this correctly. Then would it be possible to take the 5th scalar in the 5-pack and separate it from the other 4 so that they wouldn't have to all work on the same pixel? So instead of [16 X 5], it would be [[16 X 4] + 16 ].

That way, for a 5 component vertex shader, the 5th could be separated and sent to scalars.
And for pixel shaders, the scalars would work on separate pixels than the superscalars (but still in the same batch or thread or whatever it's called). So assuming a group of pixels need to have 4 parallel instructions done, the hardware would do 20 pixels per clock (16 + 4.. the extra 4 being from the 16 scalars, each taking one component). Or if 2 parallel instructions were needed, it would be 24 pixels (16 + 8).

Maybe some of the density in the shader arrays is lost, but better utilitzation is gained.. Is that possible with VLIW, and can pixel groups be varying sizes?

No that wouldn't be possible. I think your not understanding how the ALU blocks in Rx6xx work. Each 5 ALU cluster operates on 4 pixels (1 pixel per clock) over a 4 clock period then moves to the next pixel batch. Due to the arrangement of the ALUs within the cluster they can execute a total of 5 scalar instructions per clock, or 20 OPs within the 4 clock period over 4 pixels. The number of pixels (or primitives, vertices etc) in flight is constant. A Rx6xx SIMD array process batches of 64 pixels every 4 clocks. The variable is the number of instructions that are performed across that 64 pixel batch in the 4 clock period (up to a maximum of 320).
 
Back
Top