Jawed
Legend
Sigh, total brain fade on the strand-shared registers - that's completely wrong and basically should be ignored. Only 128 of these can be allocated.One way to think about register pressure is to consider the maximum amount of thread state possible, while the ALUs hide their own systematic latency. In GT200 6 threads per multiprocessor are required to hide latency (register read after write), in RV770 4 threads (clause switch).
So per thread:
On NVidia it's possible to get away with only 2 threads, but the code must be compiled with no serial instruction dependencies within 3 instructions of each other, to maintain full throughput. One thread will work on NVidia, to obtain 2KB, but I'm doubtful that will use more than half the ALU cycles available. One thread on ATI will only use half the ALU cycles (with additional clause switch overhead) and with a fair bit of kludging by using normal registers (1KB) and strand-shared registers (2KB) it would be possible to get 3KB.
- GT200 - 64KB / (32 strands * 6 threads) = 333
- RV770 - 256KB / (64 strands * 4 threads) = 1024
So best case for 1 thread on ATI is 2KB - really pointless using 1 thread instead of 2, since 2 threads will also have 2KB.
Jawed