Simon F said:This seems like a rather daft benchmark. Unless I've missread the code, it looks like every instruction is dependent on the previous one. That is highly unlikely to give a realistic estimate of system performance. At least try coding something where there is a decent percentage of independent instructions. (Think throughput VS latency)
I think that with the NV40, the register performance hit is more related to the number of registers required in a single clock cycle than the total number of registers used. In other words, I think it has to do with keeping all of the execution units active. There may be other aspects of the performance impact of many registers, of course, but I believe this is the primary issue. I claim this due to the independence of Shadermark results on the number of registers used (the benchmarks in question use nothing but MAD's, of which only one can be executed in a single clock on the NV40 anyway).kayvonf said:Note that the instruction throughput tests are designed so that at least TWO consecutive instructions are independent, allowing for dual issue on the modern chips to be apparent. Back to back instructions are NOT dependent in many of the tests. To enable additional independent instructions, we'd have to generate shaders that kept more registers live, and that would skew the results on NVidia platforms.
Actually, I think the NV3x had the additional problem of using a VLIW architecture (this was stated in one of the interviews around the launch of the NV40), where with the limited available compile time for shaders, it became challenging to obtain optimal use of the hardware.ERP said:I'd guess that the limitation is the same on NV4x as it was on NV3x, and either instruction latency is significantly lower or the register bank is significantly larger. Texture loads probably just have much larger latency so they expose the register pressure.
4-Component Floating Point Input Bandwidth [GB/sec]
floatbandwidth -n -c 4 -f 4 -a single
512 SGL 1 0.5828 6.7027
512 SGL 2 0.8645 9.0374
512 SGL 3 1.1638 10.0696
512 SGL 4 1.7613 8.8714
floatbandwidth -n -c 4 -f 4 -a seq
512 SEQ 1 1.0247 3.8122
512 SEQ 2 1.8765 4.1634
512 SEQ 3 2.7843 4.2089
512 SEQ 4 3.8230 4.0872
floatbandwidth -n -c 4 -f 4 -a random -d
512 DEP-RAND 0 1.0247 3.8120
512 DEP-RAND 1 5.5978 1.3956
512 DEP-RAND 2 10.2947 1.1383
512 DEP-RAND 3 15.1948 1.0283
512 DEP-RAND 4 20.0996 0.9717
Bandwidth: Readback
Fixed Hostmem GL_RGBA Mpix/sec: 47.14 MB/sec: 179.82
Fixed Hostmem GL_BGRA Mpix/sec: 47.36 MB/sec: 180.66
Float Hostmem GL_RGBA Mpix/sec: 11.95 MB/sec: 182.37
Float Hostmem GL_BGRA Mpix/sec: 11.90 MB/sec: 181.55
Bandwidth: Download [MB/sec]
1 473.510000
2 326.464789
3 454.808158
4 620.986023
Instruction Issue[4D]
512 2.6629 ADD 4 64
512 2.6626 SUB 4 64
512 2.6607 MUL 4 64
512 3.9384 MAD 4 64
512 1.6465 EX2 4 64
512 1.6465 LG2 4 64
512 0.8415 POW 4 64
512 1.6714 FLR 4 64
512 1.6714 FRC 4 64
512 0.8377 RSQ 4 64
512 1.6469 RCP 4 64
512 1.6469 SIN 4 64
512 1.6469 COS 4 64
512 1.5541 SCS 4 64
512 4.3933 DP3 4 64
512 4.3898 DP4 4 64
512 0.8169 XPD 4 64
Scalar vs Vector Instruction Issue
512 5.2231 ADD 1 40
512 2.5690 ADD 4 40
512 4.8560 SUB 1 40
512 2.5683 SUB 4 40
512 5.2231 MUL 1 40
512 2.5690 MUL 4 40
512 4.1138 MAD 1 40
512 3.5116 MAD 4 40
Instruction Precision
RSQ absolute 9.621456e-007 1.010693e-005
RCP absolute 4.758402e-006 9.399302e-005
SIN absolute 6.390911e-008 2.272320e-007
COS absolute 7.506861e-008 2.453860e-007
EX2 absolute 7.112015e-008 3.608786e-007
LG2 absolute 2.202467e-007 1.165460e-006
SIN absolute 6.077859e-002 9.999976e-001
COS absolute 7.203143e-009 5.912550e-008
EX2 absolute 7.902257e-008 1.192086e-007