Why does the NV3x's register usage limitation exist?

Ostsol

Veteran
Just a request for some speculations. . . :)

Is there any precedent for this in CPU architecture?

The only reasoning I can think of is that there really are only two FP32 registers and an internal cache for temporarily storing data when more than two registers are needed. If this were true, the more registers needed, the more swapping back and forth between the registers and the cache would be necessary. Still, knowing the possible ramifications of this, making such a design decision is quite incomprehensible. . .
 
x86 with 6 (maybe 7 with bp) general purpose registers and mips with 32 registers
x86 has to spill registers to memory (or use tricks with bigger hidden register file and register renaming to get the performance)
 
Registers are a shared resource it seems, the more context it has to store for a pixel the less it can keep in flight and the more chance of a pipeline stall.

In vertical multithreading, which this seems a form of, the more threads the better (ignoring cache effects, which are not relevant here).
 
The only reasoning I can think of is that there really are only two FP32 registers and an internal cache for temporarily storing data when more than two registers are needed.

2? I was always under the impression that it had the same amount of temp registers as the R300 (32). The main difference between the two (at least in their fragment shader stages) was the complete lack of constant registers (since it stores constants in intruction slots)...
 
Take a look at this article:

http://www.3dcenter.de/artikel/cinefx/index_e.php

On page 5:
One special recommendation from nVidia to the use of pixel shaders is to use as few temporary registers as possible. While analyzing the Gatekeeper function we noticed that the number of quads in the pipeline depends straight on the number of temp registers. The less temp registers are used, the more quads fit into memory.

...

Before a quad can take another pass through the entire pipeline, it is neccessary to send an empty quad down the pipe for technical reasons. This is of course detrimental to the usable performance. But this influence is smaller the less empty quads are necessary. And that can be achieved by increasing the number of quads in the pipeline.
There were other performance considerations that made it easier if more quads were kept in the pipeline, but the above is the most explainable.

So, if the amount of cache was increased significantly, part of the problem would be eliminated. Another solution would be to make the two resources non-shared (in a future architecture).
 
FWIW, On slide 23 of this Graphics Hardware 2003 talk (power point), it plots the NV register usage against performance.

The following slide then describes the "gotchas" with ATIs hardware.

I think you should assume that no system is 'perfect',
 
Actually it's 96 "4D" instructions (64 color + 32 texture). If you seperate a 4D instruction into 3D vector and scalar parts, you can double the number of color instructions to 128 (the 160 number comes from 128 + 32), but that's not realistic. Furthermore, IIRC D3D does not allow this kind of variable limit of shader length.
 
Heck, maybe NV are using a stack architecture (with SIMD) for their shaders. :)

The shader would the hold TOS (top of stack) and TOS+1 internally in the shader for super fast 2 register operation, and spill to some external ram for TOS+2 ... TOS+n

Cheers
Gubbi
 
Back
Top