NV30,35 & R300/R350 Pixel Shader Pipes Compared (New inf

Dave Baumann

Gamerscore Wh...
Moderator
Legend
OK, in another thread I described R300's pipeline as "shallow and wide" and NV30/35's as "Narrow and deep". However Eric Demers (sireric) did a presentation at shader day that surprised me because he imparted a little more detail on R300's Shader pipeline that appears to make that statement a little erroneous. Unbeknown to many before there is not two ALU's (1 Vec3 + 1 scalar) per R300/R350 pipeline, but 4 - Full Vec3/Scalar ALU's and "mini" Vec3/Scalar ALU's (capable of some basic math functions.

When we compare R300/R350 to NV30 and NV35 this appear to be what we have (as far as I can discern so far):

Code:
   NV30 (x4)           NV35 (x4)           R300/R350 (x8) 
----------------   ---------------       --------------- 
|              |   |             |       |             | 
| Tex addr     |   |  Tex addr   |       |Tex addr proc| 
| proc /       |   |   proc /    |       |             | 
| Full FP32    |   | Full FP32   |       --------------- 
|     ALU      |   |     ALU     |               |
----------------   ---------------         -------------
       |                  |               |            |
----------------   ---------------   ------------ ----------- 
|              |   |    Mini     |   | Full Vec | |  Full   | 
|   FX12 ALU   |   |    FP32     |   | FP24 ALU | | Scalar  | 
|              |   |    ALU      |   |          | | FP24 ALU| 
----------------   ---------------   ------------ ----------- 
       |                  |               |            |
----------------   ---------------   ------------ ----------- 
|              |   |    Mini     |   | Mini Vec | |   Mini  | 
|   FX12 ALU   |   |    FP32     |   | FP24 ALU | | Scalar  | 
|              |   |    ALU      |   |          | | FP24ALU | 
----------------   ---------------   ------------ ----------- 
       |                  |               |            | 
       |                  |               --------------
       |                  |                      |

Apart from the very basic Math ops Eric wasn't keen to give away any more details on what instructions the Mini ALU's support. So far we know that the 'Mini' FP32 ALU's in NV35 support ADD, SUB, MUL, DP3, DP4 (off the top of my head).

Update: A little more detail on the known functionality / limitations of the mini ALU's...

R300/R350 - ADD, SUB, MUL

NV35 - Both can do ADD, SUB, MUL, DP3, DP4. Both combined can do an FP32 MAD, or each can do an FP15 MAD. PS1.4 range still [-2,2]. Each of NV30' FX12 units could do 2 MUL per cycle - the new FP units can't so some PS1.1-1.4 shaders will be slower here.
 
Dave, Interesting the mini-ALU of R3x0 !
Some months ago, I talked about it on this forum. I found it in my tests but ATI said me that this was probably an error in my tests or an optimisation made by the shader engine.

As I said you, the NV35 mini-FP32 can work together to do a FP32 MAD. Each of them can do a MAD but only in FP16.

The NV35 loses the double MUL ability of the NV20-NV34's FX units.

Unfortunately, the range in PS1.4 is still -2/2.
 
I would imagine that the Mini FP24 units would be used for the PS1.x instruction modifiers. They could possibly be used for other things too though.
 
Luminescent said:
Edit: Wait a minute, you got to go to shader day.

Did you think I pulled this out of my backside? ;)

Anyway, I've made an update to the original post with some information on the "mini" ALU's that I know and based on Damien's (Tridam) findings.
 
Tridam said:
Colourless said:
I would imagine that the Mini FP24 units would be used for the PS1.x instruction modifiers. They could possibly be used for other things too though.

http://www.beyond3d.com/forum/viewtopic.php?p=131019#131019

I have strange results with R350/R300. It seems able to do one MUL for free with every instruction.

That behaviour miight further suggest that the extra units are intended to do the PS1.x instruction modifiers since they are all just muls of 0.125, 0.25, 0.5, 2, 4 or 8.

Might have to run a test or 2 myself. If i'm correct, than you shouldn't get the mul for free if you use an instruction modifier.
 
IIRC ABS is free on R300. I am not sure whether it's the mini-Vec or others. I'll do some tests to find out.

EDIT: a quick and dirty test suggest that the full units can do free ABS, but the mini units can't. The results are:

Code:
mov r1, c1
mad r0, v0, c0, r1
mad r0, r0, c1, r0
mul r0, r0, v1
texld r1, t0, s0
add r0, r0, r1
mov oC0, r0

takes 3 cycles.

Code:
mov r1, c1
mad r0, v0, c0, r1
mad r0, r0, c1, r0
mul r0, r0, v1
abs r0, r0
texld r1, t0, s0
add r0, r0, r1
mov oC0, r0

takes 3 cycles.

Code:
mov r1, c1
mad r0, v0, c0, r1
mad r0, r0, c1, r0
mul r0, r0, v1
texld r1, t0, s0
add r0, r0, r1
abs r0, r0
mov oC0, r0

takes 4 cycles.

Code:
mov r1, c1
mad r0, v0, c0, r1
mad r0, r0, c1, r0
mul r0, r0, v1
abs r0, r0
texld r1, t0, s0
add r0, r0, r1
abs r0, r0
mov oC0, r0

takes 4 cycles.
 
Dave, I'm assuming that the large alu's of the R3xx can compute a single cycle fmad/dp, while the small alu's are limited to either a mul, add, or sub instruction, right?
 
Oh great... Now I don't understand even R300/R350 anymore... :(
Judging by this info following shader should run 2 clocks per pixel R300/R350 and 1 clock on NV35:
Code:
add_sat r0, c0, v1 
add_sat r0, r0, c1 
add_sat r0, r0, -c2 
mov oC0, r0
Yes it doesn't run like that. It takes 3 clocks on R300 and 4 clocks on NV35... Now a direct question: what am I missing?

pcchen: I guess that in your case driver is just being smart and rearranges your shader into:
Code:
mov r1, c1 
mad r0, v0, c0, r1 
texld r1, t0, s0
mad r0, r0, c1, r0 
mad r0, r0, v1, r1
mov oC0, r0
 
MDolenc said:
Oh great... Now I don't understand even R300/R350 anymore... :(
Judging by this info following shader should run 2 clocks per pixel R300/R350 and 1 clock on NV35:
Code:
add_sat r0, c0, v1 
add_sat r0, r0, c1 
add_sat r0, r0, -c2 
mov oC0, r0
Yes it doesn't run like that. It takes 3 clocks on R300 and 4 clocks on NV35... Now a direct question: what am I missing?

It seems that constant read is not free on NV35.
 
Re: NV30,35 & R300/R350 Pixel Shader Pipes Compared (New

Dave or some other bigbrain, could you please translate this into "thicky-type-geek-with-little-to-no-coding-experience" lingo for me please?

What does it mean, what benefits/limitations/new things does it imply?
 
Dio said:
Reverse engineering by committee? :)
Yeah, it's just like design by committee except that nothing is actually built and so it doesn't matter if a crap job was done :)

Oh, and guys, I'd be really surprised if the code you wrote mapped in a 1:1 way with the HW. Take a look at the description of the XBOX shader HW that was leaked - completely different instruction set.
 
Tridam said:
It seems that constant read is not free on NV35.
Might be, but what about R300? I gave read your comment about free mul with every instruction, but I can't get a free mul (unless I have mul, add combo)...

Oh, and guys, I'd be really surprised if the code you wrote mapped in a 1:1 way with the HW. Take a look at the description of the XBOX shader HW that was leaked - completely different instruction set.
Might be, but I'd still expect hadrware to map muls and adds 1:1... Where is that description of XBox HW shader?
 
MDolenc said:
Oh great... Now I don't understand even R300/R350 anymore... :(
Judging by this info following shader should run 2 clocks per pixel R300/R350 and 1 clock on NV35:
Code:
add_sat r0, c0, v1 
add_sat r0, r0, c1 
add_sat r0, r0, -c2 
mov oC0, r0
Yes it doesn't run like that. It takes 3 clocks on R300 and 4 clocks on NV35... Now a direct question: what am I missing?
Actually, because of constant propagation optimization, it should execute in 1 cycle on an R3x0 (eventually). Something like:
add_sat oC0, (c0+c1-c2), v1

We are working hard on improving our current PS compiler, so that it can map PS ops to our HW in an optimal way. The current stuff is pretty simple. The HW is naturally very fast and executes well. However, it will get better. That's also why one should be careful when trying to determine our internal architecture based on shader code.
 
Re: NV30,35 & R300/R350 Pixel Shader Pipes Compared (New

digitalwanderer said:
Dave or some other bigbrain, could you please translate this into "thicky-type-geek-with-little-to-no-coding-experience" lingo for me please?

What does it mean, what benefits/limitations/new things does it imply?
It means developers have more to pay attention to when optimizing their shaders and therefore can potentially squeeze out a few more fps than they previously would have been able to.
 
Re: NV30,35 & R300/R350 Pixel Shader Pipes Compared (New

Ostsol said:
digitalwanderer said:
Dave or some other bigbrain, could you please translate this into "thicky-type-geek-with-little-to-no-coding-experience" lingo for me please?

What does it mean, what benefits/limitations/new things does it imply?
It means developers have more to pay attention to when optimizing their shaders and therefore can potentially squeeze out a few more fps than they previously would have been able to.
For both the NV3x and the R3xx series? Does this give hope to nVidia for improved-to-the-point-of-decent DX9 gameplay? (BTW-Thanks for taking the time to explain, it's most appreciated. :) )
 
sireric said:
MDolenc said:
Oh great... Now I don't understand even R300/R350 anymore... :(
Judging by this info following shader should run 2 clocks per pixel R300/R350 and 1 clock on NV35:
Code:
add_sat r0, c0, v1 
add_sat r0, r0, c1 
add_sat r0, r0, -c2 
mov oC0, r0
Yes it doesn't run like that. It takes 3 clocks on R300 and 4 clocks on NV35... Now a direct question: what am I missing?
Actually, because of constant propagation optimization, it should execute in 1 cycle on an R3x0 (eventually). Something like:
add_sat oC0, (c0+c1-c2), v1
Be careful now - saturating addtions are not associative (and can thus not be constant-folded) unless you make sure that all inputs have the same sign. If v1, c0, c1 and c2, for example, have the following values:
v1 = 1
c0 = 1
c1 = 1
c2 = 1
then MDolenc's instruction sequence will return 0, whereas your suggested optimization will return 1.
 
arjan de lumens said:
sireric said:
MDolenc said:
Oh great... Now I don't understand even R300/R350 anymore... :(
Judging by this info following shader should run 2 clocks per pixel R300/R350 and 1 clock on NV35:
Code:
add_sat r0, c0, v1 
add_sat r0, r0, c1 
add_sat r0, r0, -c2 
mov oC0, r0
Yes it doesn't run like that. It takes 3 clocks on R300 and 4 clocks on NV35... Now a direct question: what am I missing?
Actually, because of constant propagation optimization, it should execute in 1 cycle on an R3x0 (eventually). Something like:
add_sat oC0, (c0+c1-c2), v1
Be careful now - saturating addtions are not associative (and can thus not be constant-folded) unless you make sure that all inputs have the same sign. If v1, c0, c1 and c2, for example, have the following values:
v1 = 1
c0 = 1
c1 = 1
c2 = 1
then MDolenc's instruction sequence will return 0, whereas your suggested optimization will return 1.

Good point. 3 instructions then.
 
Back
Top