GPUBench - a CPU benchmark for GPU's

Discussion in 'Architecture and Products' started by tb, Oct 8, 2004.

  1. wzh100

    Newcomer

    Joined:
    Aug 26, 2004
    Messages:
    44
    Likes Received:
    0
    Location:
    heaven
    kaka,interesting????????
     
  2. ERP

    ERP
    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    I would tend to agree with you Simon, although it depends to some extent how the chips hide the ALU latency in the pixel shader, which is probably dependant on the number of in flight quads. There is some scope for reordering at least in the mul example tb posted, but it's limited.

    The bigger problem is the same as a lot of CPU benchmarks, It's going to be very difficult to come up with instrction sequences that basically do nothing, allow for reasonable reordering and aren't grossly simplified by the pixel shader assembler. And unlike in the CPU world your output is a black box,so you have no way to know if instruction sequences have been collapsed. Probably the best you can do is test multiple sequences and look for the modal average.
     
  3. kayvonf

    Newcomer

    Joined:
    Jul 23, 2004
    Messages:
    4
    Likes Received:
    0
    some comments from the benchmark authors

    The off the charts performance of the ADD and RCP benchmarks is due to driver optimized being intelligent and folding multiple ADD operations into a MUL, and removing repeated RCP instructions.

    Note that the instruction throughput tests are designed so that at least TWO consecutive instructions are independent, allowing for dual issue on the modern chips to be apparent. Back to back instructions are NOT dependent in many of the tests. To enable additional independent instructions, we'd have to generate shaders that kept more registers live, and that would skew the results on NVidia platforms.

    Yes, as was previously stated by someone else on the board, unfortunately, to generate parameterized tests, we resorted to generating the shaders in the C++ code.

    GPUBench tests are designed to measure performance in contrived situations for developers writing GPGPU applicdations. We test unfiltered floating point bandwidth for this reason. By measuring GPU's in isolated directed tests, we hoped to give developers access to information that would be useful in determining if particular numerical applications are good targets of GPU acceleration. Floating point bandwidth, download, and readback rates are critical pieces of information to have when considering using GPUs for these other purposes.

    For those asking questions about what the tests do, we've tried to supply some level of documentation at http://gpubench.sf.net
     
  4. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Re: some comments from the benchmark authors

    I think that with the NV40, the register performance hit is more related to the number of registers required in a single clock cycle than the total number of registers used. In other words, I think it has to do with keeping all of the execution units active. There may be other aspects of the performance impact of many registers, of course, but I believe this is the primary issue. I claim this due to the independence of Shadermark results on the number of registers used (the benchmarks in question use nothing but MAD's, of which only one can be executed in a single clock on the NV40 anyway).
     
  5. kayvonf

    Newcomer

    Joined:
    Jul 23, 2004
    Messages:
    4
    Likes Received:
    0
    register pressure

    No I believe the issue does indeed involve the total number of active registers in a shader.

    However, I stated things incorrectly in my post. As long as no texture fetches are performed, which is the case with the instruction throughput test in question, NV40 should not take a hit when using more live registers in a shader, so this test could be written to use additional registers if you wanted to see how wide the hardware can go. I suspect you won't see instruction throughput increase (over the dual-issue case) on boards that are currently out on the market.
     
  6. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Oh, right, I'd forgotten about the limitation that, apparently, register memory limited the amount of quads you could have in-flight at any point in time on the NV3x. Are you thinking, then, that on the NV4x, using more registers results in reduced texture cache per quad instead?
     
  7. ERP

    ERP
    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    I'd guess that the limitation is the same on NV4x as it was on NV3x, and either instruction latency is significantly lower or the register bank is significantly larger. Texture loads probably just have much larger latency so they expose the register pressure.
     
  8. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Actually, I think the NV3x had the additional problem of using a VLIW architecture (this was stated in one of the interviews around the launch of the NV40), where with the limited available compile time for shaders, it became challenging to obtain optimal use of the hardware.

    The NV4x architecture, on the other hand, doesn't uses an on-chip instruction decoder that, according to said interview, improves instruction throughput by reducing the pressure of the compiler to optimize shaders before sending them to the hardware. From what I gather, this essentially amounted to changing the instruction set so that it no longer handled low-level aspects of the hardware, but let the hardware take care of many of these things.
     
  9. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    422
    Likes Received:
    16
    i had try to run GPUbench on the S3G DeltaChrome, but most test reoprt unkonw IHV.

    here is my NV35 benchmark results

    GeForce 5900 Ultra 450MHz|850MHz 66.51
    Code:
    4-Component Floating Point Input Bandwidth [GB/sec]
    floatbandwidth -n -c 4 -f 4 -a single 
    512     SGL     1   0.5828   6.7027
    512     SGL     2   0.8645   9.0374
    512     SGL     3   1.1638   10.0696
    512     SGL     4   1.7613   8.8714
    
    floatbandwidth -n -c 4 -f 4 -a seq 
    512     SEQ     1   1.0247   3.8122
    512     SEQ     2   1.8765   4.1634
    512     SEQ     3   2.7843   4.2089
    512     SEQ     4   3.8230   4.0872
    
    floatbandwidth -n -c 4 -f 4 -a random -d 
    512     DEP-RAND     0   1.0247   3.8120
    512     DEP-RAND     1   5.5978   1.3956
    512     DEP-RAND     2   10.2947   1.1383
    512     DEP-RAND     3   15.1948   1.0283
    512     DEP-RAND     4   20.0996   0.9717
    Code:
    Bandwidth: Readback 
    Fixed	Hostmem  GL_RGBA       Mpix/sec: 47.14  MB/sec: 179.82
    Fixed	Hostmem  GL_BGRA       Mpix/sec: 47.36  MB/sec: 180.66
    Float	Hostmem  GL_RGBA       Mpix/sec: 11.95  MB/sec: 182.37
    Float	Hostmem  GL_BGRA       Mpix/sec: 11.90  MB/sec: 181.55
    Code:
    Bandwidth: Download [MB/sec]
    1	473.510000
    2	326.464789
    3	454.808158
    4	620.986023
    Code:
    Instruction Issue[4D] 
    512      2.6629       ADD          4         64
    512      2.6626       SUB          4         64
    512      2.6607       MUL          4         64
    512      3.9384       MAD          4         64
    512      1.6465       EX2          4         64
    512      1.6465       LG2          4         64
    512      0.8415       POW          4         64
    512      1.6714       FLR          4         64
    512      1.6714       FRC          4         64
    512      0.8377       RSQ          4         64
    512      1.6469       RCP          4         64
    512      1.6469       SIN          4         64
    512      1.6469       COS          4         64
    512      1.5541       SCS          4         64
    512      4.3933       DP3          4         64
    512      4.3898       DP4          4         64
    512      0.8169       XPD          4         64

    Code:
    Scalar vs Vector Instruction Issue 
    512      5.2231       ADD          1         40
    512      2.5690       ADD          4         40
    512      4.8560       SUB          1         40
    512      2.5683       SUB          4         40
    512      5.2231       MUL          1         40
    512      2.5690       MUL          4         40
    512      4.1138       MAD          1         40
    512      3.5116       MAD          4         40
    Code:
    Instruction Precision 
    RSQ absolute 9.621456e-007 1.010693e-005
    RCP absolute 4.758402e-006 9.399302e-005
    SIN absolute 6.390911e-008 2.272320e-007
    COS absolute 7.506861e-008 2.453860e-007
    EX2 absolute 7.112015e-008 3.608786e-007
    LG2 absolute 2.202467e-007 1.165460e-006
    SIN absolute 6.077859e-002 9.999976e-001
    COS absolute 7.203143e-009 5.912550e-008
    EX2 absolute 7.902257e-008 1.192086e-007
     
  10. Robbitop

    Newcomer

    Joined:
    Oct 22, 2003
    Messages:
    77
    Likes Received:
    4
    Location:
    Rostock - Germany
    GPU Bench does'nt recognize my Deltachrome and the app interrupts.
     
  11. vnet

    Newcomer

    Joined:
    Jan 24, 2004
    Messages:
    70
    Likes Received:
    0
    There also seems to be files for ATI and others for nVidia cards - so it seems each card has a specific codepath. If you don't have a card from the IHV, the "benchmark" does not run.

    Talk about a benchmark... :roll:
     
  12. tb

    tb
    Newcomer

    Joined:
    Feb 7, 2002
    Messages:
    241
    Likes Received:
    0
    Location:
    Germany / Thuringia
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...