The reason for the register usage performance hit is explained here:Simon F said:As I said before, the NV chip seems to suffer performance drops with increasing register usage which would seem to indicate a lack of register space and/or some strange limitations on access.
http://www.3dcenter.de/artikel/cinefx/index5_e.php
While analyzing the Gatekeeper function we noticed that the number of quads in the pipeline depends straight on the number of temp registers. The less temp registers are used, the more quads fit into memory.
The recommendation form nVidia aims at having as many quads as possible in the pipeline. Why is this so important? We found three central reasons:
* Before a quad can take another pass through the entire pipeline, it is neccessary to send an empty quad down the pipe for technical reasons. This is of course detrimental to the usable performance. But this influence is smaller the less empty quads are necessary. And that can be achieved by increasing the number of quads in the pipeline.
* Because of the length of the pipeline and the latencies of sampling textures it is possible that the pipeline is full before the first quad reaches its end. In this case the Gatekeeper has to wait as long as is takes the quad to reach the end. Every clock cycle that passes means wasted performance then. An increased number of quads in the pipeline lowers the risk of such pipeline stalls.
* The textures to read from can change in every pass through the pipeline. Because few quads result in few texture samples read in a row, the cache hit rate decreases. More memory bandwidth is required.