OpenCL demo of a galaxy with 40,000 stars program

Discussion in 'GPGPU Technology & Programming' started by g__day, Dec 16, 2009.

  1. g__day

    Regular

    Joined:
    Jun 22, 2002
    Messages:
    580
    Likes Received:
    2
    Location:
    Sydney Australia
  2. DudeMiester

    Regular

    Joined:
    Aug 10, 2004
    Messages:
    636
    Likes Received:
    10
    Location:
    San Francisco, CA
    Nice!

    I just finished a 500,000 element particle system with OpenCL and OpenGL interop. With some HDR bloom it looks rather like a nebula, actually.

    Now that this is working, I'm going to finish my photon mapper.
     
  3. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,112
    Likes Received:
    2,579
    Location:
    Germany
    Nice!
    But since it's OpenCL, why don't you make it available for Geforces too? It's missing a file named cal_source.il, which seems to be part of Ati Stream - that's not what Open Standards are for *SCNR*
     
  4. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,515
    Likes Received:
    441
    Location:
    Varna, Bulgaria
    That's rather old N-body sim, featured on the Stream SDK developer page. The available distribution still does not include the new R800-specific optimizations.
    AFAIK, this one is written for CAL, not OCL?!
     
  5. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,112
    Likes Received:
    2,579
    Location:
    Germany
    Seems likely, but 1,8 TFLOPS in a "real application" (tm Intel) is nothing to sneeze at, especially if there are some potential optimizations missing.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,009
    Likes Received:
    1,001
    Location:
    London
    Optimised version

    http://galaxy.u-aizu.ac.jp/trac/note/wiki/Tests_With_RV870

    I wonder if this version is using shared memory? I suspect not, for what it's worth, as the focus is purely on VLIW issue. Probably just some unrolling.

    Also, counting "38 FLOPS" per force calculation is fine when making a comparison with CPU code - but other GPGPU N-body implementors usually assign far less - this is because GPUs have intrinsic SP sqrt and rcp instructions.

    AMD's SDK 2.0 beta 4 contains a basic graphical N-body, written in OpenCL.

    Jawed
     
  7. sebx

    Newcomer

    Joined:
    Dec 28, 2009
    Messages:
    1
    Likes Received:
    0
    Hello I'm new here , looking around for a moment and trying this OPEN_CL demo and I'm suprised it works quite well on my Radeon HD5770 about 1 teraflops was measured with 40000 galaxy . But I wonder how it could work since I don't have ATI Stream instaled only 9.12 hotfix driver on Vista SP2 x64 ?
     
  8. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,515
    Likes Received:
    441
    Location:
    Varna, Bulgaria
    From the description of the kernel sample:
     
  9. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,824
    Likes Received:
    500
    Location:
    Torquay, UK
    I've squeezed 2.2TF out of my 1GHz clocked HD5870 :smile:

    That's :shock: crunching power!
     
  10. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    That's perfectly normal, you only need the drivers to run applications using `ATI Stream'.

    In fact, changing drivers will introduce subtle compilation differences that can result in catastrophic performance regressions (like getting 10x slower) without any change to the original application. It's lovely.

    On a more on-topic note, as Jawed noticed, the specific computation that application does could achieve up to 1.6 `TFlop' on RV770 and 3.7 `TFlop' on RV870, as customary for that field it seems they have a very special definition of `Flop'.

    That said, with realistic Flop counts, keep in mind that computing gravitational force interactions can't consist only in multiply-add operations (about half of them are in practice) so getting more than 2/3 of the multiply-add performance is simply impossible.

    A last detail, it should be relatively easy to exploit the symmetry of the problem and thus executing this workload twice faster than presented in this work. Recent ATI hardware provides far enough register space (shared memory isn't even needed) to do so (and thus RV870 would achieve 7 `TFlops')

    PS: Of course, direct O(N^2) methods in single precision have dubious uses, as fast multipole methods can achieve higher precision and are orders of magnitude more efficient.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...