GPUBench - a CPU benchmark for GPU's

Simon F said:
This seems like a rather daft benchmark. Unless I've missread the code, it looks like every instruction is dependent on the previous one. That is highly unlikely to give a realistic estimate of system performance. At least try coding something where there is a decent percentage of independent instructions. (Think throughput VS latency)

I would tend to agree with you Simon, although it depends to some extent how the chips hide the ALU latency in the pixel shader, which is probably dependant on the number of in flight quads. There is some scope for reordering at least in the mul example tb posted, but it's limited.

The bigger problem is the same as a lot of CPU benchmarks, It's going to be very difficult to come up with instrction sequences that basically do nothing, allow for reasonable reordering and aren't grossly simplified by the pixel shader assembler. And unlike in the CPU world your output is a black box,so you have no way to know if instruction sequences have been collapsed. Probably the best you can do is test multiple sequences and look for the modal average.
 
some comments from the benchmark authors

The off the charts performance of the ADD and RCP benchmarks is due to driver optimized being intelligent and folding multiple ADD operations into a MUL, and removing repeated RCP instructions.

Note that the instruction throughput tests are designed so that at least TWO consecutive instructions are independent, allowing for dual issue on the modern chips to be apparent. Back to back instructions are NOT dependent in many of the tests. To enable additional independent instructions, we'd have to generate shaders that kept more registers live, and that would skew the results on NVidia platforms.

Yes, as was previously stated by someone else on the board, unfortunately, to generate parameterized tests, we resorted to generating the shaders in the C++ code.

GPUBench tests are designed to measure performance in contrived situations for developers writing GPGPU applicdations. We test unfiltered floating point bandwidth for this reason. By measuring GPU's in isolated directed tests, we hoped to give developers access to information that would be useful in determining if particular numerical applications are good targets of GPU acceleration. Floating point bandwidth, download, and readback rates are critical pieces of information to have when considering using GPUs for these other purposes.

For those asking questions about what the tests do, we've tried to supply some level of documentation at http://gpubench.sf.net
 
Re: some comments from the benchmark authors

kayvonf said:
Note that the instruction throughput tests are designed so that at least TWO consecutive instructions are independent, allowing for dual issue on the modern chips to be apparent. Back to back instructions are NOT dependent in many of the tests. To enable additional independent instructions, we'd have to generate shaders that kept more registers live, and that would skew the results on NVidia platforms.
I think that with the NV40, the register performance hit is more related to the number of registers required in a single clock cycle than the total number of registers used. In other words, I think it has to do with keeping all of the execution units active. There may be other aspects of the performance impact of many registers, of course, but I believe this is the primary issue. I claim this due to the independence of Shadermark results on the number of registers used (the benchmarks in question use nothing but MAD's, of which only one can be executed in a single clock on the NV40 anyway).
 
register pressure

No I believe the issue does indeed involve the total number of active registers in a shader.

However, I stated things incorrectly in my post. As long as no texture fetches are performed, which is the case with the instruction throughput test in question, NV40 should not take a hit when using more live registers in a shader, so this test could be written to use additional registers if you wanted to see how wide the hardware can go. I suspect you won't see instruction throughput increase (over the dual-issue case) on boards that are currently out on the market.
 
Oh, right, I'd forgotten about the limitation that, apparently, register memory limited the amount of quads you could have in-flight at any point in time on the NV3x. Are you thinking, then, that on the NV4x, using more registers results in reduced texture cache per quad instead?
 
I'd guess that the limitation is the same on NV4x as it was on NV3x, and either instruction latency is significantly lower or the register bank is significantly larger. Texture loads probably just have much larger latency so they expose the register pressure.
 
ERP said:
I'd guess that the limitation is the same on NV4x as it was on NV3x, and either instruction latency is significantly lower or the register bank is significantly larger. Texture loads probably just have much larger latency so they expose the register pressure.
Actually, I think the NV3x had the additional problem of using a VLIW architecture (this was stated in one of the interviews around the launch of the NV40), where with the limited available compile time for shaders, it became challenging to obtain optimal use of the hardware.

The NV4x architecture, on the other hand, doesn't uses an on-chip instruction decoder that, according to said interview, improves instruction throughput by reducing the pressure of the compiler to optimize shaders before sending them to the hardware. From what I gather, this essentially amounted to changing the instruction set so that it no longer handled low-level aspects of the hardware, but let the hardware take care of many of these things.
 
i had try to run GPUbench on the S3G DeltaChrome, but most test reoprt unkonw IHV.

here is my NV35 benchmark results

GeForce 5900 Ultra 450MHz|850MHz 66.51
Code:
4-Component Floating Point Input Bandwidth [GB/sec]
floatbandwidth -n -c 4 -f 4 -a single 
512     SGL     1   0.5828   6.7027
512     SGL     2   0.8645   9.0374
512     SGL     3   1.1638   10.0696
512     SGL     4   1.7613   8.8714

floatbandwidth -n -c 4 -f 4 -a seq 
512     SEQ     1   1.0247   3.8122
512     SEQ     2   1.8765   4.1634
512     SEQ     3   2.7843   4.2089
512     SEQ     4   3.8230   4.0872

floatbandwidth -n -c 4 -f 4 -a random -d 
512     DEP-RAND     0   1.0247   3.8120
512     DEP-RAND     1   5.5978   1.3956
512     DEP-RAND     2   10.2947   1.1383
512     DEP-RAND     3   15.1948   1.0283
512     DEP-RAND     4   20.0996   0.9717

Code:
Bandwidth: Readback 
Fixed	Hostmem  GL_RGBA       Mpix/sec: 47.14  MB/sec: 179.82
Fixed	Hostmem  GL_BGRA       Mpix/sec: 47.36  MB/sec: 180.66
Float	Hostmem  GL_RGBA       Mpix/sec: 11.95  MB/sec: 182.37
Float	Hostmem  GL_BGRA       Mpix/sec: 11.90  MB/sec: 181.55

Code:
Bandwidth: Download [MB/sec]
1	473.510000
2	326.464789
3	454.808158
4	620.986023

Code:
Instruction Issue[4D] 
512      2.6629       ADD          4         64
512      2.6626       SUB          4         64
512      2.6607       MUL          4         64
512      3.9384       MAD          4         64
512      1.6465       EX2          4         64
512      1.6465       LG2          4         64
512      0.8415       POW          4         64
512      1.6714       FLR          4         64
512      1.6714       FRC          4         64
512      0.8377       RSQ          4         64
512      1.6469       RCP          4         64
512      1.6469       SIN          4         64
512      1.6469       COS          4         64
512      1.5541       SCS          4         64
512      4.3933       DP3          4         64
512      4.3898       DP4          4         64
512      0.8169       XPD          4         64


Code:
Scalar vs Vector Instruction Issue 
512      5.2231       ADD          1         40
512      2.5690       ADD          4         40
512      4.8560       SUB          1         40
512      2.5683       SUB          4         40
512      5.2231       MUL          1         40
512      2.5690       MUL          4         40
512      4.1138       MAD          1         40
512      3.5116       MAD          4         40

Code:
Instruction Precision 
RSQ absolute 9.621456e-007 1.010693e-005
RCP absolute 4.758402e-006 9.399302e-005
SIN absolute 6.390911e-008 2.272320e-007
COS absolute 7.506861e-008 2.453860e-007
EX2 absolute 7.112015e-008 3.608786e-007
LG2 absolute 2.202467e-007 1.165460e-006
SIN absolute 6.077859e-002 9.999976e-001
COS absolute 7.203143e-009 5.912550e-008
EX2 absolute 7.902257e-008 1.192086e-007
 
There also seems to be files for ATI and others for nVidia cards - so it seems each card has a specific codepath. If you don't have a card from the IHV, the "benchmark" does not run.

Talk about a benchmark... :rolleyes:
 
Back
Top