GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Talonman, you may be interested in trying http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.1beta1.tgz (note: the pre-compiled Windows executable there is slightly outdated and the camera movements could be a bit slow).

    SmallLuxGPU is showing some nice progress with the capability to run on multiple OpenCL GPUs/CPUs and native threads (i.e. you can use the CPU even if you don't have an OpenCL CPU device like with the NVIDIA). This is SmallLuxGPU running on my brand new PC with 8 native threads (i.e. hyper-threading) and 1 GPU:


    [​IMG]

    It looks like (thanks to a lot simpler kernel when compare with SmallptGPU) it works well with NVIDIA hardware, check this screenshot of Jens's MacPro with an 8800GT:

    [​IMG]

    The screen shot posted by Eros on Luxrender forums is particularly interesting:


    [​IMG]

    The Eros's screenshot is particularly interesting because it shows some quite good result achieved with GTX 295. Overall, I really like the capability to run on about any OS and on any number of CPUs and GPUs available :wink:
     
  2. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,426
    Likes Received:
    10,320
    Yah, from all this preliminary work you guys are doing in OpenCL, I'm quite impressed with what it's able to do, bugs and all. I'm actually starting to get somewhat excited to see what OpenCL will bring in the coming years.

    Regards,
    SB
     
  3. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    I gave it a run Dave... ;)

    Note that Precision now displays each GPU's utilization. I have already verified that it reports the same % as GPU-z.

    When I start the program, it starts out at about 26,000K Samples/sec, then drops down to about 18,500K Samples a sec. If I let it run, it gradually keeps climbing higher slowly.

    This is my first run, giving it about 8 minutes run time... 24,213K Samples/sec.
    [​IMG]
    Note that my GPU utilization is 8%, 10%, and 8%. CPU is running at 87% utilization.


    I started it up a second time, to see if my water cooled CPU was getting hot... Looks fine!
    I was getting 25,479K Samples/sec.
    Note that my GPU utilization is 9%, 11%, and 9%.
    [​IMG]


    This is the BigMonkey run...
    Note that my GPU utilization is 22%, 25%, and 17%.
    [​IMG]


    Running the Loft file: Note that my GPU utilization is 69%, 77%, and 78%.
    [​IMG]
     
    #163 Talonman, Jan 14, 2010
    Last edited by a moderator: Jan 14, 2010
  4. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Last one... LuxBall: :)
    Note that my GPU utilization is 42%, 62%, and 49%.
    [​IMG]
     
  5. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,426
    Likes Received:
    10,320
    Too many BIG images, it's actually making my browser hitch. :p

    Regards,
    SB
     
  6. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Mine automatically get re-sized to a smaller image when it's uploaded... (I thought?) :)

    Do you also have to click on the image to see it full size?
     
  7. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,969
    Likes Received:
    963
    Location:
    Torquay, UK
    Well, I really like new and improved SmallPT 2.0 Beta 2 :smile:
    Thanks Dade:!:

    Here is what I promised to Talonman:
    [​IMG]

    and to show it's not suicide shot ...

    [​IMG]

    :razz:

    I think I've also found small bug. As you can see on my second screen the average score is really low but you can trace from GPU load graph as well as CPU load graphs that everything was full bore for over 6000 passes. That counter just reset itself around 5900-6000 pass. It started to climb upwards from 0. :mad:

    From other observations, sometimes GPU and CPU usage fluctuates a lot, especially at the beginning (first 1000 passes) but then stabilizes.
    My GPU is golden! Not so much my memory ... (it's not stable at stock, luckily SmallPT seems to not care about that) Mr. Baumann I will need HD5890 quite soon so I can send my card for RMA ;]
    The app reacts quite nicely to GPU mem overclock.
     
  8. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Wow... over 20,000K Samples/sec!!

    Sweet...

    Thanks for the post Lightman.
     
  9. CNCAddict

    Regular

    Joined:
    Aug 14, 2005
    Messages:
    290
    Likes Received:
    2
    Too bad he's covering up his wallpaper :wink:
     
  10. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Hehe, new record :!:

    @Lightman: the sample counter reset is probably an overflow problem .. it is easy to run out of bits on statistic counters at 20M Samples/sec :wink:

    Just to give you an idea, Kevin Beason was spending 124 minutes to render this image at 5000 samples per pixel with a Q6600 (with the original SmallPT): http://www.kevinbeason.com/smallpt/
    Rendering the same image takes 3 minutes on your PC. Your HD5870 is nearly 40 time faster than a Q6600 :shock:


    @Talonman: you could try to spawn native threads in order to used the CPU for rendering too. You have only to edit one of the ".bat" file and to change the parameters to something like:

    SmallLuxGPU.exe 4 0 1 64 640 480 scenes\luxball.scn

    (first parameters is the number of native threads to spawn).

    However it looks like your CPU has already some problem keeping your 3 GPUs busy so this could just end to slow down the rendering (my be a value of 3 or 2 native threads could be better).

    Overall, I'm happy to see that NVIDIA cards work well with SmallLuxGPU and they don't suffer of the problems shown with SmallptGPU.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    On your PC, David, does the HD5870 ever run at 100% utilisation?

    Jawed
     
  12. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Jawed, I have definitively to check. I have zero tools for Linux to evaluate the GPU performance/utilization; I need to find a good tool for Windows and check.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    In version 1.0 of SmallLuxGPU there are timings for the time taken by the GPU versus the time taken by the CPU. They show the GPU as active for only a small portion of the time. Does version 1.1 use multiple CPU threads to feed work to the GPU and process the results, in order to maximise GPU utilisation?

    It seems GPU utilisation will increase with triangle count and/or the volume of the scene - so it may not be a worry that utilisation is low with these scenes. But there could be a problem with multiple GPUs in rendering rigs if the CPU can't keep them fed.

    Have you tried installing your old graphics card into your new PC alongside the HD5870?

    Jawed
     
  14. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    For Windows you can use CCC to check GPU utilization, in the Overdrive section, if you don't want to use other tools like GPU-Z.

    In my case, the utilization never exceeds 97%, in contrast to some D3D/OGL utilities.
     
  15. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    You hit the nail on the head Dave...

    I edited the LuxBall file, being we already have my above post for comparison.
    I tried to spawn 4 CPU threads: CPU went to 100%, and the app stopped responding.
    I then went to 3 threads, and CPU went to 100%, and the app stopped responding.
    I then went for 2 CPU threads, and the app ran:
    [​IMG]
    CPU@ 97% utilization.
    GPU's @ 28%, 30%, and 30%.

    Looks like not enough CPU, even with a quad core@ 3.73GHz. :shock:

    My thinking is until Nvidia gets the CPU utilization lower, just to keep the GPU's happy, we will continue to suffer performance issues.
    My performance in this test, was cut in 1/2 by spawning 2 native CPU rendering threads.
    Further, my 3 GPU's utilization is lower, and sporadic.
    Granted we may not be using the optimum work group size for me, but the point still holds.
     
    #175 Talonman, Jan 15, 2010
    Last edited by a moderator: Jan 15, 2010
  16. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Even better Jawed, there are 2 threads for each GPU. One is dedicated generate rays/collect ray intersection results. The seconds just execute the OpenCL kernel. While the first thread works on ray buffer A, the second is intersect rays in ray buffer B (and vice versa at the successive step).

    It works like a 2 stage pipeline, so the execution time is equal to the max(time stage CPU, time stage GPU).

    Note: the first stage (i.e. CPU stage) does a constant work (the complexity doesn't scale up with the number of triangles used for the scene) while the second stage (i.e. GPU stage) scale up with the triangle count.

    This mean that, no matter how slow your CPU is (or fast your GPU is) there is a scene complex anough to keep your GPU busy.

    This pattern is clearly shown Talonman's tests, in simplier scenes, the CPU is the bottleneck while the GPU usage raise with more complex scenes.

    This should work quite well in pratice because current Luxrender scenes are often using more than 1 million of triangles (Luxball scene is 262k triangles). For instance, I have rendered this (1 million of triangles) scene http://www.luxrender.net/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=9019 for Pinko in about a weekend on a network of 6xQuadcores ... I found the idea to render this image in one night at home quite sexy :wink:

    Yup, you still need a bit of belance between CPUs and GPUs on your system with this approach. However you can always find a scene complex enough to keep your GPUs busy ... however you could have no need to render a such complex scene :idea:

    I have tried yesterday but it wasn't recognized by the motherboard, I just removed the 4870 planning to do some test in the week-end (ahah, there is never enough time to do all :roll:).
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    The 2 threaded pipeline with the alternating buffers is a nice idea. I guess there's no limit on the number of buffers you could set up, each assigned to a thread, with a final thread that handles OpenCL/kernel execution.

    It'll be interesting to see the kind of speed-up you get on that room interior. That's quite an impressive render :shock:

    Jawed
     
  18. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Are you converting that room interior scene to be rendered on our GPU(s)?

    I would love to see that... :razz:
     
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    It kind of sucks that you need CPU intervention every time a ray hits something, particularly since you need over 10k samples per pixel to get a clean image.

    Have you thought about generating rays that weren't truly random? I would imagine that the incoherence of random rays traversing a BVH would be a nightmare for bandwidth efficiency, though if you write your code correctly then at least branch coherency shouldn't be a problem beyond the fact that different rays take different amounts of time to finish.

    What I'm thinking is that maybe you can figure out a way to have all rays go in approximately the same direction (i.e. small random perturbations about the same base ray, then warped as required by the BRDF if necessary), and do it in a way that after 1000 rays per pixel you have roughly uniform distribution.
     
  20. CNCAddict

    Regular

    Joined:
    Aug 14, 2005
    Messages:
    290
    Likes Received:
    2
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...