GPU Ray-tracing for OpenCL

EDIT: Upon removal and restart.. the system now works with the ATi GPUs but no longer can do OpenCL on the nVIDIA GPU. It appears that both vendors own OpenCL.dll files don't pick up the other vendors products.

OpenCL supports, or better exposes, the presence of multiple OpenCL platforms available on the system (i.e. ATI, NVIDIA, etc.). However, like for multiple devices, it requires an application capable to use them. At the moment, I'm not even sure if ATI/NVIDIA drivers can coexist in the same system (it is probably something quite untested).

Anyway, I'm just grabbing the first platform available and I definitively need to add a configuration file to SmallLuxGPU to have an option to select/enable/disable the platforms and the devices available on the system ... something to do in the next week-end ;)
 
OpenCL supports, or better exposes, the presence of multiple OpenCL platforms available on the system (i.e. ATI, NVIDIA, etc.). However, like for multiple devices, it requires an application capable to use them. At the moment, I'm not even sure if ATI/NVIDIA drivers can coexist in the same system (it is probably something quite untested).

Anyway, I'm just grabbing the first platform available and I definitively need to add a configuration file to SmallLuxGPU to have an option to select/enable/disable the platforms and the devices available on the system ... something to do in the next week-end ;)

Perfect... thanks :)

That should be quite fun to play with :) Great work so far btw.
 
From NVIDIA OpenCL SDK:

[...]
2. On Windows Vista and Windows 7 (32 and 64 bit) with driver 190.89, multi-GPU configurations and applications may not obtain parallel scaling for OpenCL apps from use of a second or additional GPU's.
[...]

It looks like they are aware of the problems.
 
Hi David,

After going through a few driver revisions, I finally got SmallptGPU running on my comp (Core 2 Duo 2.4GHz, 8800 GTS 640MB). I was actually playing with the original Smallpt program first to understand the algorithm better and fool around with it.

Running Smallpt GPU, I was getting 370k samples/s, while I got 500k samples/s multithreaded on my CPU. I then realized that you made some algorithm changes with the direct lighting section. If I restored it to be similar to the original Smallpt code, I got 900k samples/s (and the image matched the CPU version, too).

I see how the direct lighting lowers the pass requirement for any target image quality, but I didn't expect such a dropoff in performance (though looking back I can see how it doubles the ray tests for diffuse hits). So in that function I tried replacing "for (i = 0; i < sphereCount; i++)" with "i = 8;" (and the "continue" with a "return") and performance doubled to 635k samples/s. This is really bizzare, because it's completely coherent branching. I also tried taking out the ray tests in SampleLights() to get a feel for the performance (this is the ideal location to do it because it doesn't affect program flow) and doing some math I get 70 ns to test a ray, or 8000 shader cycles on my card. That's pretty damn steep for nine ~20-cycle sphere-ray tests!

(FYI, in the SampleLights(), you need to delete the FLOAT_PI factor to get the lighting to be the same as the regular unbiased path tracing algorithm. I worked through the math, and it makes sense.)

It's not branch incoherence, because I eliminated all divergent branching by deleting everything but the DIFF block (thus making all objects diffuse), and performance was the same. I was hoping to use NVidia's OpenCL profiler, but it crashes for me.

So I finally decided to play around with CUDA, and after a bunch of frustration I finally got a port working. Basing it on your original OpenCL code (370kSamples/s for me, or ~5MRays/s), I first got 10 MRays/s. A decent improvement, but still not where I wanted it to be, particularly because I didn't get constant buffers working yet. When I did, I got up to 330 MRays per second! Now that's more like it! (~20 instr * 9 spheres + ~60 instr. between rays) * 330M is about 70% of my card's peak rate, ignoring divergence and estimation errors, so it's probably about as optimal as it gets. Is it true that NVidia has no caching of memory access unless you have an image object? The sphere data is 400 freaking bytes, so it's shocking that it can be the cause of a 30x difference in performance.

Anyway, this is so awesome. I had no idea that a brute force method like path tracing was so close to real-time. Attached below is an image of what my lowly 96 SP card can do in just three seconds, and I wouldn't be surprised if the 5870 was 20x as fast as my card. So David, if you can figure out how to properly get data into the constant buffers with OpenCL, you'll see monsterous speed improvements. The second best option is to use image objects.

I only have a 64-bit executable, but for those of you with NVidia cards, here it is:
http://www.its.caltech.edu/~nandra/SmallptCUDA.zip
 

Attachments

  • SmallptCUDA_3sec_8800GTS.jpg
    SmallptCUDA_3sec_8800GTS.jpg
    43.9 KB · Views: 37
hmm, on my gtx 275(240sp), i got ~0.75Grays/s .

13329020100125113753057.png
 
Hehe, whoops, there's still some junk left in the console from the CUDA SDK sample I (heavily) modified.

750 MRays/s is really good. I think your card will get around 2500 kSamples/s with SmallptGPU, which is around 30 MRays/s. I'll see if I can do anything to David's source code to get OpenCL performing where it should.
 
After going through a few driver revisions, I finally got SmallptGPU running on my comp (Core 2 Duo 2.4GHz, 8800 GTS 640MB). I was actually playing with the original Smallpt program first to understand the algorithm better and fool around with it.

Running Smallpt GPU, I was getting 370k samples/s, while I got 500k samples/s multithreaded on my CPU. I then realized that you made some algorithm changes with the direct lighting section. If I restored it to be similar to the original Smallpt code, I got 900k samples/s (and the image matched the CPU version, too).

Direct light sampling usually reduce the noise present in each sample by a HUGE amount. What I'm trying to say is that it is useless to double samples/sec if you increase the generated noise by 5 times. You are effectively slowly the rendering by removing direct light sampling (i.e. the overall time required to produce a noise free image).

I see how the direct lighting lowers the pass requirement for any target image quality, but I didn't expect such a dropoff in performance (though looking back I can see how it doubles the ray tests for diffuse hits).

Like I said above performance = (samples/sec) / (noise per sample).

The noise without direct light sampling can be easily increase to an unacceptable level. Just do the following simple test: highly reduce the radius of the light source in the scene (and increase the power of light emitted).

Your code will be practically unable to render the scene (i.e. the probability for a path to hit the light sources are so low that you will generate just black samples 99% of the times).

(FYI, in the SampleLights(), you need to delete the FLOAT_PI factor to get the lighting to be the same as the regular unbiased path tracing algorithm. I worked through the math, and it makes sense.)

Oh, well, I always throw a couple of PI in the code I wrote, most of the time I'm even right ;)

Anyway, this is so awesome. I had no idea that a brute force method like path tracing was so close to real-time.

It is but only for static scenes (a quite huge limitation in some application).

Attached below is an image of what my lowly 96 SP card can do in just three seconds, and I wouldn't be surprised if the 5870 was 20x as fast as my card. So David, if you can figure out how to properly get data into the constant buffers with OpenCL, you'll see monsterous speed improvements.

It is quite easy to get in constant memory, just declare the the pointer __constant inside the kernel ;)

The problem is constant memory is just limited to 32/64kbytes in most hardware and it isn't of any practical use for no-trivial scenes. BTW, constant memory is horrible bugged at the moment in Apple's OpenCL implementation (it is a known bug).

I'm doing some test with NVIDIA and it looks like the way you access memory it is incredible important for them, I have modified SmallLuxGPU to access memory with just 2 float4 reads (and than I unpack the data on other registers) and the result is more that 2 times faster on NVIDIA cards: http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=380#p30296

So if you are looking for a way to speed up smallptgpu, you could try to change the way data are read from global memory. You can probably run 2 or 3 times faster.
 
Last edited by a moderator:
Very nice work Mintmaster. There's a more complex scene with 783 spheres in it to give it more of a work out. I mentioned it before but for some reason no-one seemed to want to play:

http://forum.beyond3d.com/showthread.php?p=1378149#post1378149

Jawed
Heh, I don't have any scene loader right now. It actually crashed with me ("Display driver has stopped responding and has been recovered") with SmallptGPU, but it should be linear with the number spheres if there's no BVH.

Actually, I just wrote this to see why the OpenCL version was running so slow and figure out what kind of speed can be achieved. Now that I've seen the kind of performance achievable with my own eyes, I have bigger plans :D

Maybe all those people pushing for raytracing were onto something...
 
Direct light sampling usually reduce the noise present in each sample by a HUGE amount. What I'm trying to say is that it is useless to double samples/sec if you increase the generated noise by 5 times. You are effectively slowly the rendering by removing direct light sampling (i.e. the overall time required to produce a noise free image).
Oh, I absolutely agree, though maybe I wasn't clear in my earlier post. I implemented it in my CUDA code, though I changed the random point generation and some math to make it faster.

The quality jump is huge, which is why I compared the CUDA perf to the 370k Samples I got in SmallptGPU instead of the 900k Samples I got without DL. The latter was just an exercise to compare with the original CPU Smallpt, and to compare DL quality to non-DL quality. Like I mentioned, once I took out the pi factor I got equal images, and now there's no going back...

It is but only for static scenes (a quite huge limitation in some application).
Not necessarily. Remember that realtime physics engines already use a BVH and simplified primitives to approximate real geometry, and it's updated every frame.

It is quite easy to get in constant memory, just declare the the pointer __constant inside the kernel ;)
Oh I know you tried to do everything you could, but what I'm telling you is that NVidia's (and probably ATI's) OpenCL driver is not using const memory. With CUDA, I get a 30x increase in speed by using constant memory, 60x compared to SmallptGPU. Not 2-3x.

The problem is constant memory is just limited to 32/64kbytes in most hardware and it isn't of any practical use for no-trivial scenes.
You can still put a big chunk of the BVH in there, and then go to memory after that. A scene can become a lot more manageable when you can divide it up into 5,000 spheres.
 
Last edited by a moderator:
I'm doing some test with NVIDIA and it looks like the way you access memory it is incredible important for them, I have modified SmallLuxGPU to access memory with just 2 float4 reads (and than I unpack the data on other registers) and the result is more that 2 times faster on NVIDIA cards: http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=380#p30296

So if you are looking for a way to speed up smallptgpu, you could try to change the way data are read from global memory. You can probably run 2 or 3 times faster.

We do still have your latest version posted here, correct buddy? ;)

Version SmallptGPU-v2.0alpha2B

edit:
Opps... Never mind. You were talking SmallLux, I was talking SmallptGPU.
 
Last edited by a moderator:
Because traversing a tree in memory is expensive, what about intersecting triangles with rays, rather than rays with triangles?:

Code:
for each triangle
    for each ray
        intersect-test

So, each triangle is a work-item. Or, you can launch 4 triangles per work-item.

Rays are linear in memory and logically coherent across all work-items when being accessed - which makes for great cache behaviour. (Though obviously the population of triangles is normally so large that load-balancing amonst work-groups will result in incoherence.)

EDIT: Oh, and since the rays are now being coherently fetched, you can dump a huge number of them into local memory for sharing amongst all items in a work-group. This improves cache behaviour even further.

Jawed
 
@Jawed: your approach is O(n) with the number of triangles while any acceleration structure (i.e. BVH, KD-tree, QBVH, etc.) transforms the cost of tracing a ray in O(log(n)). You idea can be blazing fast but only for a very small number of triangles. In any other case even a CPU with a BVH accelerator will be faster.

You can still put a big chunk of the BVH in there, and then go to memory after that. A scene can become a lot more manageable when you can divide it up into 5,000 spheres.

The BVH of the scene with 2,700,000 triangles is more than 180 MBytes. There isn't very much you can do with 32Kbytes.
 
Because traversing a tree in memory is expensive, what about intersecting triangles with rays, rather than rays with triangles?
You mean rasterization? :p

The problem with that, Jawed, is everything but the primary ray. Primary rays are automatically sorted in space, hence the speed of rasterization. Secondary rays (and beyond) are not, and are very difficult to sort. There may be some research into this area, but I'm not sure.

The BVH of the scene with 2,700,000 triangles is more than 180 MBytes. There isn't very much you can do with 32Kbytes.
Think of a balanced binary tree with 4095 nodes. It will take 12 accesses to get to the leafs. If add to each leaf another 12 levels, you get 16.8 million nodes. Thus by putting the first 4095 nodes in fast memory, you save half the number of expensive accesses. Then if you have coherency, caches for each compute unit will store localized portions of the BVH instead of each one replicating those base nodes again and again.

Putting the top of the tree in constant memory can give you very large gains in speed.
 
Rasterizing primary rays and raytracing secondary can give really good performance. There are lots of papers on this, including one I was on ;-). As for resorting secondary rays, some work has been done on this in various papers. You can literally resort individual rays and try to find the tradeoff of sort overhead vs memory access coherence. It's always going to be a tricky thing to balance and will be heavily scene dependent. On real scenes some form of bin sort way work out really nice. But for synthetics, like a room full of glass/chrome spheres, things go so incoherent so fast I'm not sure you will get much speedup.

As far as where Jawed was heading, you could setup you tree/BVH to have more triangles on the leaves, that would say fit in the scratch pad memory available and try to pound through those nicely. But, that doesn't help much if all of the rays in a vector/warp/wavefront all want to intersect different parts of the scene.
 
Back
Top