GPCBenchmark - A OpenCL General Purpose Computing benchmark

Discussion in 'GPGPU Technology & Programming' started by Arnold Beckenbauer, Apr 30, 2010.

  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    AFAIK, the only available 64-bit op's in Cypress are FMA or Mul or Add per cycle. :???:
     
  2. Florin

    Florin Merrily dodgy
    Veteran Subscriber

    Joined:
    Aug 27, 2003
    Messages:
    1,650
    Likes Received:
    220
    Location:
    The colonies
  3. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,416
    Likes Received:
    350
    Location:
    Germany
    #23 Arnold Beckenbauer, May 1, 2010
    Last edited by a moderator: May 1, 2010
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Yes, DP-ADD and DP-FMA result in the same FLOPs on Cypress.
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I wonder if they used a naive BW limited algorithm.
     
  6. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    I've actually had the opposite experience with DirectCompute, so I'd be wary of pronouncing architectural judgments. Particularly in OpenCL everyone's drivers/compilers seem pretty immature so far in my experience.

    I'd also like to point out that these sorts of benchmarks are becoming less and less useful... performance portability is basically dead for anything non-trivial - even across the same GPU line, but definitely between vendors. Writing efficient code is increasingly involving architecture-specific paths. Mint's note of the different matrix multiply algorithms is a perfect example, but more complex code has even more variations, peaks and valleys.
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    This is actually a pretty serious problem for GPGPU in general, IMHO. Of course, CPU has similar problems too, but since the performance cliff for CPUs are generally much more gradual, they tolerate suboptimal codes (for that particular architecture) better, while GPUs, in general, can't.

    Especially currently we have two major GPU vendors with very different underlying architecture. rarely one program performs well on both architectures. Even realtively simple kernels need to be designed specifically for both architectures, which increases costs. Not to mention current OpenCL compilers are still relatively immature despite they are all based on LLVM.

    DirectCompute is, on the other hand, more mature thanks to its longer development history (the compiler is very similar to older HLSL compilers). However, currently DirectCompute is still not as powerful as OpenCL or CUDA (for example, it still doesn't accept thread sync in non-unrolled loops).
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    256x256 is simply too small to exercise any algorithm on any high-end GPU. It's only ~32MFLOP to do the entire computation. Think about how long it takes to launch threads and the count of threads per SIMD required to hide latency.

    See the graphs in the best-case NVidia SGEMM here:

    http://forums.nvidia.com/index.php?showtopic=159033&st=0

    See table 6, figure 38 and figure 52. It's interesting to see how an improvement on Volkov's algorithm for "large N" actually has worse performance at N=256.

    Regardless, ~200GFLOPS at N=256 is clearly not exercising the GPU.

    Back to this test: ~1153 times per second on HD5870 is ~39GFLOPS. 2537 times per second on GTX480 is ~85GFLOPS.

    For whatever reason GF100 is much much happier here.

    Jawed
     
  9. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Not totally true. It won't accept sync's inside dynamic control flow, but I'm pretty sure CUDA has the same limitation (it's a programming model problem really). The HLSL compiler still has a few bugs and rejects stuff that should be valid, but it does properly accept non-unrolled static loops with coherent group syncs.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    As to the overall question of variation amongst GPU architectures getting in the way, well I see three factors:
    1. I bet Larrabee isn't as picky - GPUs are super-picky and theoretically getting less picky as time goes by
    2. algorithms are tricky: if you can get 50x speed-up (or performance per watt or per $, etc.), do you really care it's possible to get 60x speed-up? 70x? etc.
    3. comparing with CPU/existing-HPC is fraught with the fact that those platforms aren't often optimal, either:
    http://www.drdobbs.com/high-perform...2JVN?cid=RSSfeed_DDJ_HighPerformanceComputing

    Nothing too suprising there - but the point being that conventional, non-GPU, systems are still working out significant performance wrinkles. GPUs are hardly alone - and there's mileage in saying that GPUs are actually easier for certain kinds of problems.

    Jawed
     
  11. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,416
    Likes Received:
    350
    Location:
    Germany
    I like this benchmark, because it shows, what my HD4850 doesn't support (and won't support). But its results are strange.
     
  12. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Right but that's more of CPU-style optimization/algorithm argument... and thats why very few get too low level into CPU architecture-specific optimization for the most part. The problem is on GPUs it's often more like 4x faster (or slower!) or more, and that's the difference between perfectly reasonable and non-interactive in some cases...
     
  13. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Well, I couldn't get it to work in my multiple loop nlm kernel back then (it works only after I removed [loop] before all loops, or after using [unroll], but it makes the kernel very slow).
    If this is caused by a compiler bug, then it's much more serious than other compiler bugs in CUDA or OpenCL.
     
  14. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Can you post the loop/code? It definitely works in some situations (and should in all) because even basic scans/reductions/etc. wouldn't work without that ability...
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Shortly x86 is going to have 3 quite different vector instruction sets: SSE-based, AVX and LRBni. That's going to cause years of "pain", with a "4-fold" range...

    In theory OpenCL is well-placed to smooth over some of that pain, since the baseline encompasses scalar, vec4, vec8 and vec16 processing as well as being task-parallel and data-parallel. Though OpenCL is still showing quite severe CPU performance problems...

    On the other side of the coin there seems to be serious research into using Atom as a cluster processor, because of its power efficiency. That versus LRBni should be interesting...

    So GPUs "look bad" because one doesn't have to cast one's eyes far to see most of the range of the problem. But with x86 the field is so big it's harder to spot the wild variations, to spot terrible absolute performance or awful scaling with core count.

    Though I admit there's little documentation of the "we failed with GPGPU" type, which obviously biases things.

    Jawed
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I have to look up for it, I doesn't do compute shader very often :p
    It has quite complex loops (total 3 layers and the syncs are in the second layer) but all loops are static and predictable (with constant loop counts). I think it could be just a stupid compiler bug.
     
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    LRBni is special, and I don't think we'll see it in normal x86 CPU soon. AVX, on the other hand, is "compatible" with SSE, that is, almost all SSE instructions have their counterparts in AVX. So it makes "upgrade" from SSE to AVX relatively easy.

    OpenCL is good at this, yes. Once you have a vec2 version of something (going from scalar to vec is the most difficult part), it's relatively easy to make vec4, vec8, or vec16 version of the same thing.

    There are some blade servers using Atom IIRC. However, they are generally used for web servers or those relatively IO-heavy/computation-light servers.
     
  18. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0

    1. I disagree, I think certain benchmarks can be very useful, particularly in GPGPU, in deciding WHERE to being optimizing. It isn't going to help much to optimize ALUs if you are fetch bound, etc, etc... not only that, but I think they give a good indication as to where the bound exists in some cases and this can decide the transformations that need to be made.

    2. I agree that code is becoming, and should be, more arch specifc. Now only if we could convince the non-GPGPU HPC community that not all GPUs are made the same, that would be something!

    3. I agree wtih Jawed, 256^2 is not a big problem size, it's pretty small, you won't really see a lot of latency hiding until you start hitting 3k^2 or 4k^2 and larger.

    Overall though, I wouldn't put much stock in this type of benchmark. It's written generically at a high level, though it makes for interesting conversation.
     
  19. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    If GPGPU wants to go mainstream, the code is going to have to be generic. If it can't do well with generic code that goes through an architecture specific optimizer then you have a bad design.

    Both bit and small arrays are important. 256x256 is fine, the problem is that probably need a 4kx4k and a 16kx16k and possibly a 32kx32k/64kx64k.

    Most code is written generically at a high level. People aren't going to optimize for every architecture especially in GPUs where there are so many. Being able to get good performance on relatively generic code through a combination of software compilation/optimization and hardware will be important.

    Even for the HPC guys, the problem space they are involved in, they are spending the majority of their optimization/code time working on much larger issues than the instruction stream for a particular GPU.
     
  20. Florin

    Florin Merrily dodgy
    Veteran Subscriber

    Joined:
    Aug 27, 2003
    Messages:
    1,650
    Likes Received:
    220
    Location:
    The colonies
    Would you consider this a disadvantage for any particular current GPU architecture, or are they all equally affected?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...