Luminescent said:Wow, could it be there are Nvidia lurkers here?
Absolutely. The company just doesn't seem to keen on letting them post here.
Luminescent said:Wow, could it be there are Nvidia lurkers here?
They are most likely register combiner units as described in NV_register_combiner extension specs.Uttar said:8 FX12 MUL/ADD units
8 FX12 MUL units
The FX12 MUL units have the use of enabling 8 LRP ops/clock in FX12 mode instead of 4. They are also less sophisticated than the other parts of the pipeline: heck, they can't even be used when the op is dependent, contrary to the rest of the pipeline. They're obviously different, but I'm wondering in which way they are...
Luminescent said:If it is true that only the fp/tex unit has ddx/ddy filtering capability, what about all the other special functions? Does this mean the other fp units will not be capable of cos/sin/rsq, etc.?
Where the bolded items represent the new additions and subtractions to the NV30 pipeline. Accordingly, an NV30 pipeline resembles this:temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
FLOAT <-- temp registers
|
FLOAT <-- temp registers
|
(loopback to temporary registers or output)
These models are in accordance to thpkrl's architectural findings and are consistent with the data we have to date.(thepkrl)
temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
INTEGER <-- f[COL0],f[COL1]
|
INTEGER <-- f[COL0],f[COL1]
|
(loopback to temporary registers or output)
Basically the texture unit can only make two fetches when the textures are directly given as imputs and there are no impending results from a previous operation (dependency). In NV30, one fp shader unit seems to share its resources with the texture unit (maybe ddx/ddy, as previously surmised), perhaps explaining NV35's 2fp+2tex vs. 3fp tradeoff.(thepkrl)
When textures are fetched, nothing else can be done in the FLOAT/TEXTURE unit. The following INTEGER unit can use the fetched texture result freely. FLOAT/TEXTURE unit can perform one fetch if the coordinates come from a previous calculation (meaning 4 textures/cycle). If the coordinates come directly from inputs as in PS1.1, the unit can perform a pair of texture fetches (meaning 8 textures/cycle).
Would this explain why shader benchmarks such as rightmark (see pixel shader results here) run unoptimally on NV35 (considering 12 fp shader units). It seems either the drivers or the applications are not yet ready to properly distribute the fp ops to the 4 pipelines of 3 pipelined fp units each. The R3xx's pipeline is already configured to distribute all operations evenly, with 8 parallel fp shaders. Given its parallel nature, a pixel block is less necessary to make full use of resources (there is one shader assigned to at least 1 pixel every clock), while NV35, on the other hand, should require the pixel blocks to exploit the serial nature of its pipelined fp units.psurge:
I strongly suspect that shader ops (including ddx/ddy), are executed
on 2x2 pixel blocks - i.e. each "pipe" operates not on pixels, but on pixel stamps.
Accelerated pixel shaders allow for up to 12 pixel shader operations/clock
If you correctly pack the instructions that are dispatched to the execution units in this stage you can yield significantly more than 8 pixel shader operations per clock. For example, in NVIDIA's architecture a multiply/add can be done extremely fast and efficiently in these units, which would be one scenario in which you'd yield more than 8 pixel shader ops per clock.
It all depends on what sort of parallelism can be extracted from the instructions and data coming into this stage of the pipeline.
thepkrl:
Register usage is the key to performance, as has been mentioned earlier. For maximum performance, it seems you can only use 2 FP32-registers or 4 FP16-registers. Every two new registers slow down things, and going over 8 regs slows even more:
4.2 cyc/pix: 1reg (2 movs, 16 adds)
4.5 cyc/pix: 2reg (2 movs, 16 adds)
5.8 cyc/pix: 3reg (2 movs, 16 adds)
5.5 cyc/pix: 4reg (2 movs, 16 adds)
7.5 cyc/pix: 5reg (2 movs, 16 adds)
7.1 cyc/pix: 6reg (2 movs, 16 adds)
9.9 cyc/pix: 7reg (2 movs, 16 adds)
9.9 cyc/pix: 8reg (2 movs, 16 adds)
15.0 cyc/pix: 9reg (2 movs, 16 adds)
In the above test the N registers are used in order. If the register usage order is very mixed, performance seems to drop even more. This suggest there are about 2-4 real registers for each pixel in flight (depending if output register is counted or if extra temporaries are reserved). If more registers are used, data is moved between active registers and some slower memory buffer, which adds extra instructions.
DaveBaumann said:I still cant see that if you have 3 times the float power then that wouldn't somwhow translate into more performance in a new shader benchmark. I'd like to see a few more new shader benchmarks for the NV35 preview here