What do you think of this NV30 Pipeline diagram?

Arun

Unknown.
Moderator
Legend
Hey everyone,

After thinking a little bit more about the NV30 pipeline organization, I came up with the following:

NV30Pipelines.png


A little "simplification" I added is that I check for COL - in practice, I'd guess it's a separate value determined by the driver.
Also, I'm not sure about the organization of register combiners and stuff, but you get the point hopefully.

What I mean by FP32 Part 1 and FP32 Part 2 is simple: Both units got sufficent silicon to do the work of one TMU, but got *part* of the required additionnal silicon for being able to do a FP32 operation, and both parts are complementary.
Another explanation is that they're the same units, and that both can do TMU work in 1 cycle or FP32/FP16 work in 2 cycles. I don't find it as logical though, because it would seem to complicate matters needlessly ( although you've got the advantage of having less different type of units to bugfix / optimize... )


So, what do you think? It's obviously not perfect, and maybe even very wrong, but still, it does look quite good to me :)


Uttar
 
Its not a case of the "TMU" being the FP32 unit. Its the texture address processor that the "FP32" unit in NV30.
 
Oopsy. Never figured that one out...
Hmm, does that mean there also are texture lookup units?


Uttar
 
Yes.

pixelshader.gif


If you look at the R300 PS diagram then I'd say that you'd double up on the texture units but have the "Floating Point Address Processor" and "Floading Point Color Processor" combined in the same functional unit in the NV30 pipe.
 
When doing just Z/Stencil, there's no situation when any of the fragment processing (FP or FX) can be used. That doesn't mean that the FP/FX hardware isn't used, but probably the pipeline would be designed more around the 4 pixel case with the 8Z being an addition, instead of a main feature.

This is speculation, but I'd guess the pipeline is more along the lines of:

Code:
RASTERIZER
  |
DIVIDE-PIXELS-TO-4-PIPELINES
    |
   registerbank (R0,R1,..,H0,H1,..)
    |
    | float vertex attributes *equations* f[WPOS],f[TEX0],f[TEX1]...
    | |
   FLOAT/FP
     | \
     |  \
     | TEXTURE
     |  /
    /| |
   / | | integer vertex attributes *equations* f[COL0],f[COL1]
  |  | |  |
  |  | | INTERPOLATE-INTEGER-ATTRIBUTE
  |  | |  |
  | REGCOMB/FX
  |  | |  |
  | REGCOMB/FX
   \  |
   registerbank (R0,R1,..,H0,H1,..)
    |
COLLECT-PIXELS-FROM-4-PIPELINES
  |
  | (either 4xRGBA+Z or 8xZ, both are 256 bits total).
  |
STENCIL/DEPTHTEST/BLEND/FRAMEBUFFER-ACCESS (16 pipelines)

Blending/framebuffer access is best done in as short pipeline as possible to avoid dependencies, so probably it's a separate optimized step (and it seems to be pretty similar to GF4 at least in features). For full speed 4 pixels/cycle 4xAA they'd need at least 16 units.

So what about the 8Z mode? In normal mode for every fragment pipeline outputs 4 pairs of 32-bit values (RGBA color and Z).Perhaps in 8Z mode they instead output two 32-bit Z values.This would mean every pixel pipeline calculates two Z valuesinstead of one: the first value is the normal one and the second is calculated using the FP32 unit and output as thecolor. This would also explain why there can be no color output (even flat shading) in this mode.

With 2X multisampling 16 units are still enough to do double the Z per cycle. With 4X multisampling the Zonly speed advantage should disappear (unless there are more than 16 units). Haven't tested this though.

Another related note: At least the floating point vertex attributes appear not to be stored directly in registers. Instead they are interpolated as they are needed. This explains why using a f[TEXx] register in an arithmetic operation uses an extra cycle, the first cycle is used by the FLOAT unit to actually interpolate the value (to a temp register), and the second cycle (round) performs the actual calculation.

This is actually a common method, since one can store a per triangle equation for each attribute (3-4 values per attribute depending on how it's represented) instead of per fragment information. There are many more fragments than triangles in flight, so this should save register space. It seems the fixed point values (f[COL0/1]) do not have this penalty, so I added a dedicated hardware unit (INTERPOLATE-INTEGER-ATTRIBUTE) for that, although it could be that those two attributes are rasterized and stored directly instead of as equations.

Finally, I strongly suspect the 4 pipelines are not totally indepent. Since pixels are processed in 2x2 squares (necessary for mipmapping and DDX/DDY), both the interpolation and texture fetch operations of the adjacent pixels are not independent. Interpolation can be simplified by first calculating the value for one of the pixels, and the adjacent ones are then easier. Similarly adjacent pixels probably request nearby texels, so optimizations would be possible there as well.

One "proof" of this is the TXD instruction, which allows one to give the DDX/DDY derivatives for the texture (and thus control mipmapping/anisotropy). This instruction runs a bit over 4 times slower than a normal texture fetch. Probably the pipelines have to take turns to do the texture fetches, as the fetches are now truly indepent and not adjacent like in the normal case.

Edit: Noticed the diagram by Hyp-X, which also looks very possible. I missed the issue about where the normal Z is interpolated. It could very well come from the rasterizer, and in that case the second Z might come directly from there as well. Or it could be calculated by the FP units as I guessed. There's really no way to tell. But in any case the 8Z mode is probably a feature of the framebuffer operations after the fragment pipeline instead of the pipeline itself.
 
thepkrl said:
With 2X multisampling 16 units are still enough to do double the Z per cycle. With 4X multisampling the Zonly speed advantage should disappear (unless there are more than 16 units). Haven't tested this though.

Someone else (I forget who) did and posted the results here some time ago (soon after the 4x2 configuration was discovered). They were as you guessed: 8 Z/clock in 2xMSAA, but only 4 Z/clock in 4xMSAA.
 
I've always wondered why both texture lookup units take input from one texture adress calculator. Shouldn't each texture lookup unit have its own, discrete, texture address calculator? Woudln't this be a possibility in NV35 because it contains two fp32 units per pipeline?
 
cho said:
the combiner hardwares can be used in the middle of fragment programs? :rolleyes:

Yes and no.

There's no way to specify combiner operations inside a fragment program, so technically no. However, the number of FX12-operations you can perform between each FP32 operation exactly matches what you could do with two combiners (without the extra scale/bias tricks or dual issue for color and alpha, which don't require additional arithmetic units). Also, register combiners on NV30 have 10 fractional bits (unlike GF4 which has 8 ), just like FX12 numbers (I've tested this).

So I assume it's the same hardware doing both FX12 and Register Combiner ops. And to be precise, the FP32 unit can also do FX12, but presumably no combiner operations.
 
Yep, both of those diagrams certainly make more sense to me.

My bet would be on Hyp-X's diagram, although some of thepkrl's ideas really seem interesting to me, including using the FP unit for doing interpolation - maybe some type of hybrid, eh...

I think my primary error when doing my diagram is that I assumed the NV31 and NV30 were very indentical - sounds like they aren't really.
My diagram is pretty much some type of hybrid of NV31 and NV30: possibility to do shading with 4 pipelines, but only loopback with 2, as in the VN31 - but the NV31 really got dedicated color & Z units it seems.

Although, that makes me wonder...
AFAIK, nVidia claims 4 Z outputs per pipeline in all of the NV3x ( which actually indirectly means the NV30 is 4 real pipelines, considering it was proved it had 16 Z output units, LOL ) - but...
Does the NV31 have 16 Z output units, or does it only have 8? Never seen any tests on that one...


Uttar
 
Hyp-X,

re the diagram, are you sure you didn't mean the 2nd tex sampler to go to the 2nd reg combiner, rather than the first reg combiner taking the fp32 shader and the two tex samplers as input and the second reg combiner just the output from the first?
 
darkblu said:
Hyp-X,

re the diagram, are you sure you didn't mean the 2nd tex sampler to go to the 2nd reg combiner, rather than the first reg combiner taking the fp32 shader and the two tex samplers as input and the second reg combiner just the output from the first?

It's more complicated than the way I draw it.
If you want to be precise there should be a register file involved, but that's the cloudy part of the NV30.

If I understand correctly, the register combiner can take it's input from 6-8 FX12 registers. The FP32 unit, the texture samplers, and the register combiner can write to those registers.
 
Back
Top