Discuss NV40's shader unit architecture

991060

Regular
Since some detail info about NV40's shader unit is already revealed,I think it's time for us to have some discussion.

What I know about NV40:

SU1(shader unit1) can fetch 1 tex@full speed or do one 4 component alu ops per clock and it has free fp16 normalization.

SU2 can not fetch tex, but can do one 4 component alu ops per clock. No free fp16 normalization.

Both SU can do co-issue in 3/1 or 2/2 manner and the operation between 2 SU seems to be indepedent.

And maybe there's mini ALU in each SU, I don't know what they're for. Maybe for register/instruction modifier just like what we saw in R3XX.
 
Also, nVIDIA claimed NV40's pixel shader unit to be SIMD style, does it mean its dynamic branching scheme is implemented through conditional write mask instead of "true" branching?
 
I didn't see this thread when I posted to your other one, so I'll re-post here with some additions:

Damn, that is one very impressive architecture. It may be 222 million transistors, but they'll do a lot better than double NV30's shader performance, especially since texture instructions don't prevent a simultaneous arithmetic instruction anymore.

I still think R420 will keep pace in shader speed, because most of the time you can pair a texture op with an ALU op (at least with current multitexture-ish shaders like in HL2 and Doom3), so R300/R420 should have similar per-pipe performance to NV40 according to that diagram. R420 should have a higher clock too, right?

Now it just a question of how efficient all that is. Each pipe seems to be a lot more transistor efficient than an NV30 pipe yet many times faster with more features, so they may well have sacrificed FIFO sizes or other aspects. I'll make a tentative prediction that dependent texture access will not be as fast as R300/R420 per pipe, and the register limitation will still be significant.
 
Hmm, I'm not very sure if tex and alu ops can be paired in NV40. The diagram I saw suggested no, but there's an example given by nVIDIA suggested otherwise:
Code:
#pass 1
texld r0, t0, s0              #tex fetch
mad r0, r0, c1.r, c1.g    #_bx2 in tex
nrm_pp r1.rgb, t4         #nrm in SU1
dp3 r1.r, r1, r0             #dp3 in SU2
mul r0.a, r0, r0             #co-issue in SU2
#pass 2
mul r1.a, r0.a, c2.a       #co-issue in SU1
mul r0. rgb, r1.r, r0      #co-ssue in SU1
add r0.a, r1.r, r1.r        #x2 in SU1
mad r0.rg, r0.a, t7, -c1 # mad w/2 const in SU2
mul r1.ba, r1.a, r0.a, c2 # co-issue in SU2

I leave the comment unchanged. Now can anyone tell me how this shader can be executed in 2 passes without tex/alu ops pairing?
 
991060 said:
Also, nVIDIA claimed NV40's pixel shader unit to be SIMD style, does it mean its dynamic branching scheme is implemented through conditional write mask instead of "true" branching?
No. The released nVidia documents don't indicate this:
• Divergent (data-dependent) branching
is more expensive
– Depends on which pixels take which
branches
There would be no difference in performance if all instructions were always executed.
 
Well, I'm not refering to the GDC paper actually.

Many info I saw about NV40 seems to be conflicting,just like they were written in a rush. :?
 
That code seems to correspond with NVidia's diagram quite accurately. It was just an ideal example of how NV40's pipes could get 10 ops done in just 2 clock cycles, and a net throughput of 8 pixels per clock with this lengthy shader.

Note: For anyone that might be confused (I know you aren't, 991060. This is for others), "pass" is not a full pass like in multipass rendering. It's internal, like what R200, R300, and NV30 (maybe NV2x too, but doubt it) all do when running shaders. A pass is done on a bunch of pixels that get shoved into a FIFO, then the next pass starts when all the texture accesses are complete.
 
991060 said:
Hmm, I'm not very sure if tex and alu ops can be paired in NV40. The diagram I saw suggested no, but there's an example given by nVIDIA suggested otherwise:
Code:
#pass 1
texld r0, t0, s0              #tex fetch
mad r0, r0, c1.r, c1.g    #_bx2 in tex
nrm_pp r1.rgb, t4         #nrm in SU1
dp3 r1.r, r1, r0             #dp3 in SU2
mul r0.a, r0, r0             #co-issue in SU2
#pass 2
mul r1.a, r0.a, c2.a       #co-issue in SU1
mul r0. rgb, r1.r, r0      #co-ssue in SU1
add r0.a, r1.r, r1.r        #x2 in SU1
mad r0.rg, r0.a, t7, -c1 # mad w/2 const in SU2
mul r1.ba, r1.a, r0.a, c2 # co-issue in SU2

I leave the comment unchanged. Now can anyone tell me how this shader can be executed in 2 passes without tex/alu ops pairing?

I think it's dependency driven. In the above example. SU1 is doing very little -- the FP16 normalization, which they indicated was a pre-ALU operation. I wonder how much more complex the compiler has to be, given the asymmetry of the hardware
 
991060 said:
I leave the comment unchanged. Now can anyone tell me how this shader can be executed in 2 passes without tex/alu ops pairing?

I you saw the other "passes" in the said shader, it's even more confusing :)
 
To be honest, I'm confused. :oops:

There're tex fetch and alu ops in pass1, but the diagram says SU1 can only do one kind of instruction OR another, not at the same time(clock), what I missed? :?
 
If you consider pass 3 or later, it looks like shader 0 can work with texture units, I.E. you can have a computation instruction, followed by a texld, and followed by computation instructions, all in one pass.

However, I thought about a possibility: is it possible that the texture address must come from shader 0? That is, if a computation instruction puts its result in r0, and a texld uses r0 as texture address, they will work together. However, if a computation instruction puts its result in r0, but texld use r1 as texture address, they can't work together.

Just my speculation, though :)
 
991060 said:
To be honest, I'm confused. :oops:

There're tex fetch and alu ops in pass1, but the diagram says SU1 can only do one kind of instruction OR another, not at the same time(clock), what I missed? :?

But it's only doing the "free" normalization that you were talking about, not some arbitrary arithmetic op. This is additional to what is shown in the diagram, probably because the average reader doesn't know what a normalization is anyway. Maybe it can only be used with data from texture ops (just a guess, since that's when you'd most often do it, and you can guarantee it's FP16 or lesser data).
 
Mintmaster said:
991060 said:
pcchen said:
I you saw the other "passes" in the said shader, it's even more confusing :)
Yes, it is, do you understand it?
Did you take a peek at my explanation of "pass" above? I figured you already knew that, but maybe I was being presumptuous.

My understand on this internal pass is pretty vague, is it equal to a clock?

My confusion comes from the fact that it seems tex ops is free if paired with alu ops in NV40, this is exactly what we saw in R3XX rather than in NV3X which is supposed to share some common characters with NV40.
 
Mintmaster said:
Note: For anyone that might be confused (I know you aren't, 991060. This is for others), "pass" is not a full pass like in multipass rendering. It's internal, like what R200, R300, and NV30 (maybe NV2x too, but doubt it) all do when running shaders. A pass is done on a bunch of pixels that get shoved into a FIFO, then the next pass starts when all the texture accesses are complete.
Thx Mintmaster. I am looking forward to the development of this thread.
 
991060 said:
My understand on this internal pass is pretty vague, is it equal to a clock?
Not really, but it generally takes one clock to complete on R300 and NV40 (assuming there are no stalls).

Let's say it takes 50 clocks from the time the pixel pipe asks for a texture access to the time it comes back fully filtered. Then you want to buffer at least 50 pixels at a time in each pipe. Each clock you do pass #1 on a pixel, and put any temporary values for this pixel into the FIFO. When 50 pixels are done, you configure the shader to do pass #2, and now you feed the pixels coming out of the FIFO back into the shader, and repeat until you've complete all passes. Then you repeat for the next 50 pixels.

Note that when you do this, the net throughput is one "pass" per pixel per clock per pipe, even though you have to wait 50 cycles for the texture access.

Chances are NVidia doesn't have much of a FIFO going from SU2 back into SU1 since there's no texture access allowed there. The FIFO is the reason for the register dependency in NV30 (AFAIK), as each register means more room in the fixed-size FIFO, and fewer clock cycles to absorb texture latency before you have to catch the pixels coming back out of the FIFO. Then you get stalls (you can't do any more work while you wait for the texel), and performance slows down.

So for R300, when they say 64 instuction limit that really means only 64 internal passes as far as the hardware is concerned. It doesn't seem to me like it's a big deal to support more instructions, but I guess the R300 team probably thought people aren't going to push that boundary very far since you need at least 64 cycles to complete each pixel. Guess they underestimated the power of their beast :)

This whole explanation is oversimplifying it a bit, but hopefully you get the idea.
 
R3x0 passes are a kind of texture dependency delimiters. Each texture dependency step needs one pass. It's not correlated to clocks.

NV3x does not have such passes, it can do texture reads that are dependant in every instruction. You could say that it can do one "pass" every clock.

If that code snippet is right, then it seems like it's NOT the same kind of "pass" as with R3x0. I think it should be read as "clock effectively". The reason why they don't just write clock can be that it's not all done in the same clock cycle, even if the throughput is one "pass" per clock. Rremember, there's lots of "threads" running, and the execution jumps around between them.

More things from that code:
The FP16 norm runs in parallel with the SU1/TEX unit.
The SU1/TEX unit has an extra R3x0-like mini-alu.
The SU2 unit also has an extra R3x0-like mini-alu.
There's no mini-alu specific instructions, but when the compiler finds normal instructions that follows the rather specific mini-au rules, it's changed to a mini-alu instruction. ("_bx2" or "x2" here.)


[Edit] Didn't see your last post Mintmaster.

Oh, you use "pass" a bit differently than I did. I thought about the "phase" in R200, which also is present in R300 even if it isn't explicitly stated. After reading your last post I agree with you. One pass = one round through all quads in flight. But if the whole mulithreading is abstracted away, it could read as "one clock".
 
From the "source of an acquaintance of an acquaintance of a..." department, I had been told each pipe can deal with up to seven instructions per cycle. One texture lookup, four vector instructions and two helper/complex operations. By helper instructions, I mean the more complex sin/cos/sqrt type instructions that aren't simple multiply-and-accumulate. Each unit takes care of two of the four instructions and the pairings are 3/1 and 2/2. Of course a full four component instruction is also possible for each pair.

The information turned out to be correct for the pairings, so I wouldn't be too surprised if the other tidbits are correct. I don't know who the source's source is so who knows :)

I don't have any information regarding dependencies between the two units and the shader program and pixel flow in general. It seems to me that at minimum the monolithic shader block would be able to switch between pixels as to hide latency of the texture lookups and more complex operations. This may be why they are saying branch penalty depends on the path that each pixel takes. If all pixels in a quad take the same path then the branch would probably be "free" and if only one pixel takes a given path then the latency can probably be masked by switching to the others. If two pixels take one path and the other two pixels take the other path then there could be a latency problem. Actually its going to be heavily dependant on the operations performed in each path in any of the cases, and I don't think that this is where branching is the most useful anyway. Its easy to use branching to avoid doing texture lookups and blending, and you're certainly going to see a performance gain in many cases even if the latency of the branch is high.
 
Back
Top