GPU Ray-tracing for OpenCL

Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.

  1. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    OpenCL supports, or better exposes, the presence of multiple OpenCL platforms available on the system (i.e. ATI, NVIDIA, etc.). However, like for multiple devices, it requires an application capable to use them. At the moment, I'm not even sure if ATI/NVIDIA drivers can coexist in the same system (it is probably something quite untested).

    Anyway, I'm just grabbing the first platform available and I definitively need to add a configuration file to SmallLuxGPU to have an option to select/enable/disable the platforms and the devices available on the system ... something to do in the next week-end :wink:
     
  2. ElMoIsEviL

    Newcomer

    Joined:
    Nov 3, 2003
    Messages:
    21
    Likes Received:
    0
    Location:
    Ottawa, Canada
    Perfect... thanks :)

    That should be quite fun to play with :) Great work so far btw.
     
  3. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
  4. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
  5. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,969
    Likes Received:
    963
    Location:
    Torquay, UK
  6. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    From NVIDIA OpenCL SDK:

    [...]
    2. On Windows Vista and Windows 7 (32 and 64 bit) with driver 190.89, multi-GPU configurations and applications may not obtain parallel scaling for OpenCL apps from use of a second or additional GPU's.
    [...]

    It looks like they are aware of the problems.
     
  7. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    Lets hope they are aware of the solution too!
     
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Hi David,

    After going through a few driver revisions, I finally got SmallptGPU running on my comp (Core 2 Duo 2.4GHz, 8800 GTS 640MB). I was actually playing with the original Smallpt program first to understand the algorithm better and fool around with it.

    Running Smallpt GPU, I was getting 370k samples/s, while I got 500k samples/s multithreaded on my CPU. I then realized that you made some algorithm changes with the direct lighting section. If I restored it to be similar to the original Smallpt code, I got 900k samples/s (and the image matched the CPU version, too).

    I see how the direct lighting lowers the pass requirement for any target image quality, but I didn't expect such a dropoff in performance (though looking back I can see how it doubles the ray tests for diffuse hits). So in that function I tried replacing "for (i = 0; i < sphereCount; i++)" with "i = 8;" (and the "continue" with a "return") and performance doubled to 635k samples/s. This is really bizzare, because it's completely coherent branching. I also tried taking out the ray tests in SampleLights() to get a feel for the performance (this is the ideal location to do it because it doesn't affect program flow) and doing some math I get 70 ns to test a ray, or 8000 shader cycles on my card. That's pretty damn steep for nine ~20-cycle sphere-ray tests!

    (FYI, in the SampleLights(), you need to delete the FLOAT_PI factor to get the lighting to be the same as the regular unbiased path tracing algorithm. I worked through the math, and it makes sense.)

    It's not branch incoherence, because I eliminated all divergent branching by deleting everything but the DIFF block (thus making all objects diffuse), and performance was the same. I was hoping to use NVidia's OpenCL profiler, but it crashes for me.

    So I finally decided to play around with CUDA, and after a bunch of frustration I finally got a port working. Basing it on your original OpenCL code (370kSamples/s for me, or ~5MRays/s), I first got 10 MRays/s. A decent improvement, but still not where I wanted it to be, particularly because I didn't get constant buffers working yet. When I did, I got up to 330 MRays per second! Now that's more like it! (~20 instr * 9 spheres + ~60 instr. between rays) * 330M is about 70% of my card's peak rate, ignoring divergence and estimation errors, so it's probably about as optimal as it gets. Is it true that NVidia has no caching of memory access unless you have an image object? The sphere data is 400 freaking bytes, so it's shocking that it can be the cause of a 30x difference in performance.

    Anyway, this is so awesome. I had no idea that a brute force method like path tracing was so close to real-time. Attached below is an image of what my lowly 96 SP card can do in just three seconds, and I wouldn't be surprised if the 5870 was 20x as fast as my card. So David, if you can figure out how to properly get data into the constant buffers with OpenCL, you'll see monsterous speed improvements. The second best option is to use image objects.

    I only have a 64-bit executable, but for those of you with NVidia cards, here it is:
    http://www.its.caltech.edu/~nandra/SmallptCUDA.zip
     

    Attached Files:

  9. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    422
    Likes Received:
    16
    hmm, on my gtx 275(240sp), i got ~0.75Grays/s .

    [​IMG]
     
  10. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Hehe, whoops, there's still some junk left in the console from the CUDA SDK sample I (heavily) modified.

    750 MRays/s is really good. I think your card will get around 2500 kSamples/s with SmallptGPU, which is around 30 MRays/s. I'll see if I can do anything to David's source code to get OpenCL performing where it should.
     
  11. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    CUDA version, running on 1/2 of my 295:
    [​IMG]
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
  13. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Direct light sampling usually reduce the noise present in each sample by a HUGE amount. What I'm trying to say is that it is useless to double samples/sec if you increase the generated noise by 5 times. You are effectively slowly the rendering by removing direct light sampling (i.e. the overall time required to produce a noise free image).

    Like I said above performance = (samples/sec) / (noise per sample).

    The noise without direct light sampling can be easily increase to an unacceptable level. Just do the following simple test: highly reduce the radius of the light source in the scene (and increase the power of light emitted).

    Your code will be practically unable to render the scene (i.e. the probability for a path to hit the light sources are so low that you will generate just black samples 99% of the times).

    Oh, well, I always throw a couple of PI in the code I wrote, most of the time I'm even right ;)

    It is but only for static scenes (a quite huge limitation in some application).

    It is quite easy to get in constant memory, just declare the the pointer __constant inside the kernel :wink:

    The problem is constant memory is just limited to 32/64kbytes in most hardware and it isn't of any practical use for no-trivial scenes. BTW, constant memory is horrible bugged at the moment in Apple's OpenCL implementation (it is a known bug).

    I'm doing some test with NVIDIA and it looks like the way you access memory it is incredible important for them, I have modified SmallLuxGPU to access memory with just 2 float4 reads (and than I unpack the data on other registers) and the result is more that 2 times faster on NVIDIA cards: http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=380#p30296

    So if you are looking for a way to speed up smallptgpu, you could try to change the way data are read from global memory. You can probably run 2 or 3 times faster.
     
    #213 Dade, Jan 25, 2010
    Last edited by a moderator: Jan 25, 2010
  14. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Heh, I don't have any scene loader right now. It actually crashed with me ("Display driver has stopped responding and has been recovered") with SmallptGPU, but it should be linear with the number spheres if there's no BVH.

    Actually, I just wrote this to see why the OpenCL version was running so slow and figure out what kind of speed can be achieved. Now that I've seen the kind of performance achievable with my own eyes, I have bigger plans :grin:

    Maybe all those people pushing for raytracing were onto something...
     
  15. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Oh, I absolutely agree, though maybe I wasn't clear in my earlier post. I implemented it in my CUDA code, though I changed the random point generation and some math to make it faster.

    The quality jump is huge, which is why I compared the CUDA perf to the 370k Samples I got in SmallptGPU instead of the 900k Samples I got without DL. The latter was just an exercise to compare with the original CPU Smallpt, and to compare DL quality to non-DL quality. Like I mentioned, once I took out the pi factor I got equal images, and now there's no going back...

    Not necessarily. Remember that realtime physics engines already use a BVH and simplified primitives to approximate real geometry, and it's updated every frame.

    Oh I know you tried to do everything you could, but what I'm telling you is that NVidia's (and probably ATI's) OpenCL driver is not using const memory. With CUDA, I get a 30x increase in speed by using constant memory, 60x compared to SmallptGPU. Not 2-3x.

    You can still put a big chunk of the BVH in there, and then go to memory after that. A scene can become a lot more manageable when you can divide it up into 5,000 spheres.
     
    #215 Mintmaster, Jan 25, 2010
    Last edited by a moderator: Jan 25, 2010
  16. Talonman

    Newcomer

    Joined:
    Jan 2, 2010
    Messages:
    64
    Likes Received:
    0
    We do still have your latest version posted here, correct buddy? :wink:

    Version SmallptGPU-v2.0alpha2B

    edit:
    Opps... Never mind. You were talking SmallLux, I was talking SmallptGPU.
     
    #216 Talonman, Jan 25, 2010
    Last edited by a moderator: Jan 25, 2010
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Because traversing a tree in memory is expensive, what about intersecting triangles with rays, rather than rays with triangles?:

    Code:
    for each triangle
        for each ray
            intersect-test
    
    So, each triangle is a work-item. Or, you can launch 4 triangles per work-item.

    Rays are linear in memory and logically coherent across all work-items when being accessed - which makes for great cache behaviour. (Though obviously the population of triangles is normally so large that load-balancing amonst work-groups will result in incoherence.)

    EDIT: Oh, and since the rays are now being coherently fetched, you can dump a huge number of them into local memory for sharing amongst all items in a work-group. This improves cache behaviour even further.

    Jawed
     
  18. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    @Jawed: your approach is O(n) with the number of triangles while any acceleration structure (i.e. BVH, KD-tree, QBVH, etc.) transforms the cost of tracing a ray in O(log(n)). You idea can be blazing fast but only for a very small number of triangles. In any other case even a CPU with a BVH accelerator will be faster.

    The BVH of the scene with 2,700,000 triangles is more than 180 MBytes. There isn't very much you can do with 32Kbytes.
     
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    You mean rasterization? :razz:

    The problem with that, Jawed, is everything but the primary ray. Primary rays are automatically sorted in space, hence the speed of rasterization. Secondary rays (and beyond) are not, and are very difficult to sort. There may be some research into this area, but I'm not sure.

    Think of a balanced binary tree with 4095 nodes. It will take 12 accesses to get to the leafs. If add to each leaf another 12 levels, you get 16.8 million nodes. Thus by putting the first 4095 nodes in fast memory, you save half the number of expensive accesses. Then if you have coherency, caches for each compute unit will store localized portions of the BVH instead of each one replicating those base nodes again and again.

    Putting the top of the tree in constant memory can give you very large gains in speed.
     
  20. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    Rasterizing primary rays and raytracing secondary can give really good performance. There are lots of papers on this, including one I was on ;-). As for resorting secondary rays, some work has been done on this in various papers. You can literally resort individual rays and try to find the tradeoff of sort overhead vs memory access coherence. It's always going to be a tricky thing to balance and will be heavily scene dependent. On real scenes some form of bin sort way work out really nice. But for synthetics, like a room full of glass/chrome spheres, things go so incoherent so fast I'm not sure you will get much speedup.

    As far as where Jawed was heading, you could setup you tree/BVH to have more triangles on the leaves, that would say fit in the scratch pad memory available and try to pound through those nicely. But, that doesn't help much if all of the rays in a vector/warp/wavefront all want to intersect different parts of the scene.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...