Why does the NV3x's register usage limitation exist?

Ostsol · Sep 13, 2003

Just a request for some speculations. . .

Is there any precedent for this in CPU architecture?

The only reasoning I can think of is that there really are only two FP32 registers and an internal cache for temporarily storing data when more than two registers are needed. If this were true, the more registers needed, the more swapping back and forth between the registers and the cache would be necessary. Still, knowing the possible ramifications of this, making such a design decision is quite incomprehensible. . .

dominikbehr · Sep 14, 2003

x86 with 6 (maybe 7 with bp) general purpose registers and mips with 32 registers
x86 has to spill registers to memory (or use tricks with bigger hidden register file and register renaming to get the performance)

MfA · Sep 14, 2003

Registers are a shared resource it seems, the more context it has to store for a pixel the less it can keep in flight and the more chance of a pipeline stall.

In vertical multithreading, which this seems a form of, the more threads the better (ignoring cache effects, which are not relevant here).

archie4oz · Sep 14, 2003

The only reasoning I can think of is that there really are only two FP32 registers and an internal cache for temporarily storing data when more than two registers are needed.

2? I was always under the impression that it had the same amount of temp registers as the R300 (32). The main difference between the two (at least in their fragment shader stages) was the complete lack of constant registers (since it stores constants in intruction slots)...

KimB · Sep 14, 2003

Take a look at this article:

http://www.3dcenter.de/artikel/cinefx/index_e.php

On page 5:

One special recommendation from nVidia to the use of pixel shaders is to use as few temporary registers as possible. While analyzing the Gatekeeper function we noticed that the number of quads in the pipeline depends straight on the number of temp registers. The less temp registers are used, the more quads fit into memory.

...

Before a quad can take another pass through the entire pipeline, it is neccessary to send an empty quad down the pipe for technical reasons. This is of course detrimental to the usable performance. But this influence is smaller the less empty quads are necessary. And that can be achieved by increasing the number of quads in the pipeline.

There were other performance considerations that made it easier if more quads were kept in the pipeline, but the above is the most explainable.

So, if the amount of cache was increased significantly, part of the problem would be eliminated. Another solution would be to make the two resources non-shared (in a future architecture).

Simon F · Sep 15, 2003

FWIW, On slide 23 of this Graphics Hardware 2003 talk (power point), it plots the NV register usage against performance.

The following slide then describes the "gotchas" with ATIs hardware.

I think you should assume that no system is 'perfect',

Dio · Sep 15, 2003

Slightly wrong wrt ATI hardware; it's 160 instructions...

pcchen · Sep 15, 2003

Actually it's 96 "4D" instructions (64 color + 32 texture). If you seperate a 4D instruction into 3D vector and scalar parts, you can double the number of color instructions to 128 (the 160 number comes from 128 + 32), but that's not realistic. Furthermore, IIRC D3D does not allow this kind of variable limit of shader length.

Gubbi · Sep 15, 2003

Heck, maybe NV are using a stack architecture (with SIMD) for their shaders.

The shader would the hold TOS (top of stack) and TOS+1 internally in the shader for super fast 2 register operation, and spill to some external ram for TOS+2 ... TOS+n

Cheers
Gubbi

Why does the NV3x's register usage limitation exist?

Ostsol

dominikbehr

MfA

archie4oz

ea_spouse is H4WT!

KimB

Simon F

Tea maker

Dio

pcchen

Moderator

Gubbi

Similar threads