Demirug said:Yes, most of the stages (nVidia call it slots) are used for the texture latency (nv: > 176). But even if you do not use the texture unit you have to go to all this stages because the texture unit bypass have the same size. This is necessarily because the order of all quads can not change. There is no real threadscheduler.
Is it just me or is that an obvious modification for the NV40? If they need Branching in it, they NEED to be smarter than that anyway.
So assuming the number of "necessary" stages/slots/threads in the NV40 will greatly diminish would be a safe bet IMO. So if they reduced the number of average slots by 50-70%, and doubled the register file again as they did in the NV35... You might get to an extremely reasonable amount of register performance hit.
Heck, if you need 8 registers to get anysort of real performance hit, and the performance hit of 16 registers would be roughly the same as the one of 4-6 registers on the NV30... I'd even say that unless NVIDIA badly ****s up regarding their ALUs, their performance might be quite excellent indeed!
Of course, that's a BIG if
Uttar