GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,781
    Likes Received:
    445
    Location:
    Torquay, UK
    Today finally I decided to uninstall my Steam SDK beta and put final version on W7 64bit.
    As I reported previously, on SDK beta I haven't had any performance difference running different batch sizes.
    Now on final release it behaves as it should!
    This points out to a fact that beta SDK was assuming constant work size no matter what program was asking for.

    Here is pic. from my last run:
    [​IMG]
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    Two things you might like to try:

    Change the "__constant" declarations for "__global const". __constant has a special meaning for the way GPUs work - this is a feature of D3D10, but there are effective size limits. Additionally, the way fetches are performed from such a resource means that any time two work-items running in the same hardware thread want to fetch two different elements, you'll lose performance ("constant waterfalling").

    Try the "cl-mad-enable" or "cl-fast-relaxed-math" compiler options. Or, even, "-cl-unsafe-math-optimizations".

    Jawed
     
  3. chavvdarrr

    Veteran

    Joined:
    Feb 25, 2003
    Messages:
    1,165
    Likes Received:
    34
    Location:
    Sofia, BG
    Does it makes sense?
    After all, if one wants to have reliable results, such flags are lose-lose
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    No harm in finding out the limits... Also, with OpenCL stuff being new/immature, things like this might escape from (or crash into...) problems in the compiler(s).

    Jawed
     
  5. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Thanks Jawed, I will try.

    Probably not many Mac user here but we (at http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=240#p29397) have finally discovered why SmallptGPU wasn't working with Apple's OpenCL GPUs. The problem was in a bug of the Apple's OpenCL compiler related exactly to "__constant" memory buffers : http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2148

    I uploaded a new version (1.6) to the usual place, it includes the fix for MacOS and a new optional rendering kernel for old-school ray tracing ('80 years :wink:):

    [​IMG]

    It is funny to use mainly because it is very fast (indeed).


    Some update from the support for multiple OpenCL devices:

    [​IMG]


    This is SmallptGPU2 running at the same time on my OpenCL GPU device ATI HD4870 and OpenCL CPU device Q6600. You can read the workload distribution on the "Help"screen (i.e. 90.5% done by the GPU, 9.5% done by the CPU). It is optionally visible also on the screen as a green bar on the left (the zone rendered by the GPU) and red bar (the one done by the CPU). This is gone be very interesting with multi-GPU setups.
     
  6. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    15,375
    Likes Received:
    4,283
    Nice, and looks like it automatically uses all 4 cores. I'm assuming you didn't do anything special to tell it to use more than 1 core?. So far I'm really liking what I'm seeing out of all of these early OpenCL projects. :)

    Regards,
    SB
     
  7. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    For whatever reason, the OSX OpenCL implementation for Nvidia GPUs does seem to be a fair bit faster than Nvidia's own OpenCL for Windows. As I posted earlier, a stock GTX280 scores around 2250K at most workgroup sizes in Windows 7. Which hasn't changed significantly with your new 1.6alpha.

    Interestingly though, my Mac Mini's humble 9400M already does about 880K using the same 1.6alpha on Snow Leopard.

    Which means:
    OSX G9400M = 16 stream processors at 1100 Mhz - OpenCL reports 2 compute units = 880K
    Win7 GTX280 = 240 stream processors at 1296 Mhz - OpenCL reports 30 compute units = 2250K

    Looks to me like there is plenty of room for improvement in Nvidia's OpenCL library.

    [​IMG]

    I did notice that workgroup sizes have a larger impact on my mini. There isn't much of a difference between 64 and 192 on the GTX280, but the larger number doubles the score on the 9400M (and 192 is the maximum size that works, whereas the G200 goes up to 512).

    Looking forward to your releases with multiple device support Dade. Good stuff :)
     
  8. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    V1.6

    CPU = 26%, 1/2 of my 295 = 99%:
    [​IMG]
     
  9. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,781
    Likes Received:
    445
    Location:
    Torquay, UK
    Direct Lighting looks great! (I know! I like old Imagine 3D looks of RT)
    Thanks Dade:!:

    Here is my pic:

    [​IMG]


    Also I can't wait for version 2.0!
     
  10. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,454
    Likes Received:
    343
    Wow ... I'd have never thought that it could run that fast with clfinish after each pass.
     
  11. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    The changes in the rendering kernel between 1.6alpha and 1.6 hurt the GTX280 on Windows:
    alpha 64SIZE: 2250K
    1.6 64SIZE: 1459k

    Other sizes don't change much.

    The direct lighting one runs rather well on Windows:
    1.6 64SIZE_DL: 39965k

    Curious though is that Nvidia's OpenCL library for Linux is about twice as fast on the same video card for the normal kernel, with hardly any difference between alpha and 1.6:
    alpha 64SIZE: 4391K
    1.6 64SIZE: 4398K

    But then again the direct lighting one runs at only half the speed of Windows:
    1.6 64SIZE_DL: 21235K

    Confused, yes I am.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    Wow, that's good. That should mean a GTX285 is about the same as HD5870, I guess.

    This also puts a very different perspective on the subject of register allocation. I have to admit I was surprised to see from the earlier results in the thread that workgroup size makes very little difference on NVidia.

    This, of course, might merely be reflecting the overall uselessness of the NVidia implementation currently - whereas the Apple implementation is showing a meaningful variation with workgroup size.

    Yeah, this is pretty interesting - Apple seems to be compiling direct to PTX but NVidia's doing "something else" and it's working pretty poorly.

    Jawed
     
  13. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
  14. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    Yeah if you extrapolate the 9400M score a G200 should theoretically be able to do around 13000K+ if properly tuned.

    Around their developer site they're pretty consistent in calling OpenCL just another part of their CUDA initiative. Maybe the current implementation has some quick and dirty translation going on.

    Anyway, Linux does a little better already on G200 (up from 2250K):
    [​IMG]
     
    #114 Florin, Jan 9, 2010
    Last edited by a moderator: Jan 9, 2010
  15. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    I'm quite confused too. But look, the results you are obtaining under Linux are quite good. There are about where they should be when compared, for instance, with my 4870. This is should be the proof it is mostly a driver issue: it looks like buffer transfer between CPU and GPU has some wired behaviour from the performance point of view.

    After all the tests we have done, I'm quite convinced that the NVIDIA OpenCL driver needs some more tuning before to show consistent performances.

    @Silent_Buddha: yup, the AMD/ATI OpenCL CPU device spawns as many threads as the cores available. At the moment I have the opposite problem, I would like to have some direct control on the number thread spawned in order to not overload the CPU (and slow down threads dedicated to drive the GPUs).

    @Lightman: can I ask you what tool do you use for measuring GPU load ? I guess it runs under Windows but it looks like something quite useful ... your 19,000,00+ sample screenshot has sold me a brand new 5870, I placed my order yesterday :wink:

    @Talonman: you could try the version 2.0 with your 3xGPUs setup. I uploaded a preliminary version at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha1.tgz

    This version should run on all GPUs (and CPUs on Apple/ATI) available. It is highly untested and I'm experiencing a problem under Windows: my PC resets after few seconds like if my power supply couldn't sustain the load :shock: Everything works fine under Linux ... quite strange. Well, it may crash but you can give it a try.
     
  16. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,442
    Likes Received:
    325
    Location:
    Varna, Bulgaria
    Just to note something weird with v1.6:

    [​IMG]

    That strange pattern is visible for a fraction of the second, while the program loads up. Never saw this before, with older versions... or may be it's caused by the new Catalyst 10.1 beta driver?!
    Anyway, it doesn't appear to be a bug -- everything runs flawlessly as before. ;)
     
  17. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    No, no, Fellix, it is just "intended", it is a kind of "debug" features I added while we were tracking the Apple's problem with Jens. The frambuffer is initialized with that kind of pattern in order to detect what the GPU returns (i.e. black overwriting black is hard to detect while it is easy over that pattern). I didn't removed the pattern because it could help to track some future problem too.
     
  18. Florin

    Florin Merrily dodgy
    Veteran

    Joined:
    Aug 27, 2003
    Messages:
    1,601
    Likes Received:
    133
    Location:
    The colonies
    Yeah I get the same pattern with Nvidia. And on the slow Mac, it's not just a fraction of a second either ;)
     
  19. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,442
    Likes Received:
    325
    Location:
    Varna, Bulgaria
    Ata boy! :lol:
    Click - it does run OK for me, although the GPU load time is a tad lower (98% vs. 93%) than v1.6 in single device mode.
     
  20. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,781
    Likes Received:
    445
    Location:
    Torquay, UK
    Dade, the software is MSI Afterburner. It is based on RivaTuner but has very easy interface.

    I will give version 2.0 a go in the morning. :smile:
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...