GPU Ray-tracing for OpenCL

Talonman, you may be interested in trying http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.1beta1.tgz (note: the pre-compiled Windows executable there is slightly outdated and the camera movements could be a bit slow).

SmallLuxGPU is showing some nice progress with the capability to run on multiple OpenCL GPUs/CPUs and native threads (i.e. you can use the CPU even if you don't have an OpenCL CPU device like with the NVIDIA). This is SmallLuxGPU running on my brand new PC with 8 native threads (i.e. hyper-threading) and 1 GPU:


file.php


It looks like (thanks to a lot simpler kernel when compare with SmallptGPU) it works well with NVIDIA hardware, check this screenshot of Jens's MacPro with an 8800GT:

file.php


The screen shot posted by Eros on Luxrender forums is particularly interesting:


file.php


The Eros's screenshot is particularly interesting because it shows some quite good result achieved with GTX 295. Overall, I really like the capability to run on about any OS and on any number of CPUs and GPUs available ;)
 
Yah, from all this preliminary work you guys are doing in OpenCL, I'm quite impressed with what it's able to do, bugs and all. I'm actually starting to get somewhat excited to see what OpenCL will bring in the coming years.

Regards,
SB
 
I gave it a run Dave... ;)

Note that Precision now displays each GPU's utilization. I have already verified that it reports the same % as GPU-z.

When I start the program, it starts out at about 26,000K Samples/sec, then drops down to about 18,500K Samples a sec. If I let it run, it gradually keeps climbing higher slowly.

This is my first run, giving it about 8 minutes run time... 24,213K Samples/sec.
simplebeta.jpg

Note that my GPU utilization is 8%, 10%, and 8%. CPU is running at 87% utilization.


I started it up a second time, to see if my water cooled CPU was getting hot... Looks fine!
I was getting 25,479K Samples/sec.
Note that my GPU utilization is 9%, 11%, and 9%.
simplebeta2.jpg



This is the BigMonkey run...
Note that my GPU utilization is 22%, 25%, and 17%.
bigmonkey.jpg



Running the Loft file: Note that my GPU utilization is 69%, 77%, and 78%.
loftc.jpg
 
Last edited by a moderator:
Mine automatically get re-sized to a smaller image when it's uploaded... (I thought?) :)

Do you also have to click on the image to see it full size?
 
Well, I really like new and improved SmallPT 2.0 Beta 2 :smile:
Thanks Dade:!:

Here is what I promised to Talonman:
smallpt2gpu11071282pnga.jpg


and to show it's not suicide shot ...

smallpt2gpu11071282pngay.jpg


:p

I think I've also found small bug. As you can see on my second screen the average score is really low but you can trace from GPU load graph as well as CPU load graphs that everything was full bore for over 6000 passes. That counter just reset itself around 5900-6000 pass. It started to climb upwards from 0. :mad:

From other observations, sometimes GPU and CPU usage fluctuates a lot, especially at the beginning (first 1000 passes) but then stabilizes.
My GPU is golden! Not so much my memory ... (it's not stable at stock, luckily SmallPT seems to not care about that) Mr. Baumann I will need HD5890 quite soon so I can send my card for RMA ;]
The app reacts quite nicely to GPU mem overclock.
 
Wow... over 20,000K Samples/sec!!

Hehe, new record :!:

@Lightman: the sample counter reset is probably an overflow problem .. it is easy to run out of bits on statistic counters at 20M Samples/sec ;)

Just to give you an idea, Kevin Beason was spending 124 minutes to render this image at 5000 samples per pixel with a Q6600 (with the original SmallPT): http://www.kevinbeason.com/smallpt/
Rendering the same image takes 3 minutes on your PC. Your HD5870 is nearly 40 time faster than a Q6600 :oops:


@Talonman: you could try to spawn native threads in order to used the CPU for rendering too. You have only to edit one of the ".bat" file and to change the parameters to something like:

SmallLuxGPU.exe 4 0 1 64 640 480 scenes\luxball.scn

(first parameters is the number of native threads to spawn).

However it looks like your CPU has already some problem keeping your 3 GPUs busy so this could just end to slow down the rendering (my be a value of 3 or 2 native threads could be better).

Overall, I'm happy to see that NVIDIA cards work well with SmallLuxGPU and they don't suffer of the problems shown with SmallptGPU.
 
On your PC, David, does the HD5870 ever run at 100% utilisation?

Jawed, I have definitively to check. I have zero tools for Linux to evaluate the GPU performance/utilization; I need to find a good tool for Windows and check.
 
In version 1.0 of SmallLuxGPU there are timings for the time taken by the GPU versus the time taken by the CPU. They show the GPU as active for only a small portion of the time. Does version 1.1 use multiple CPU threads to feed work to the GPU and process the results, in order to maximise GPU utilisation?

It seems GPU utilisation will increase with triangle count and/or the volume of the scene - so it may not be a worry that utilisation is low with these scenes. But there could be a problem with multiple GPUs in rendering rigs if the CPU can't keep them fed.

Have you tried installing your old graphics card into your new PC alongside the HD5870?

Jawed
 
For Windows you can use CCC to check GPU utilization, in the Overdrive section, if you don't want to use other tools like GPU-Z.

In my case, the utilization never exceeds 97%, in contrast to some D3D/OGL utilities.
 
@Talonman: you could try to spawn native threads in order to used the CPU for rendering too. You have only to edit one of the ".bat" file and to change the parameters to something like:

SmallLuxGPU.exe 4 0 1 64 640 480 scenes\luxball.scn

(first parameters is the number of native threads to spawn).

However it looks like your CPU has already some problem keeping your 3 GPUs busy so this could just end to slow down the rendering (my be a value of 3 or 2 native threads could be better).

Overall, I'm happy to see that NVIDIA cards work well with SmallLuxGPU and they don't suffer of the problems shown with SmallptGPU.
You hit the nail on the head Dave...

I edited the LuxBall file, being we already have my above post for comparison.
I tried to spawn 4 CPU threads: CPU went to 100%, and the app stopped responding.
I then went to 3 threads, and CPU went to 100%, and the app stopped responding.
I then went for 2 CPU threads, and the app ran:
2threads.jpg

CPU@ 97% utilization.
GPU's @ 28%, 30%, and 30%.

Looks like not enough CPU, even with a quad core@ 3.73GHz. :oops:

My thinking is until Nvidia gets the CPU utilization lower, just to keep the GPU's happy, we will continue to suffer performance issues.
My performance in this test, was cut in 1/2 by spawning 2 native CPU rendering threads.
Further, my 3 GPU's utilization is lower, and sporadic.
Granted we may not be using the optimum work group size for me, but the point still holds.
 
Last edited by a moderator:
In version 1.0 of SmallLuxGPU there are timings for the time taken by the GPU versus the time taken by the CPU. They show the GPU as active for only a small portion of the time. Does version 1.1 use multiple CPU threads to feed work to the GPU and process the results, in order to maximise GPU utilisation?

Even better Jawed, there are 2 threads for each GPU. One is dedicated generate rays/collect ray intersection results. The seconds just execute the OpenCL kernel. While the first thread works on ray buffer A, the second is intersect rays in ray buffer B (and vice versa at the successive step).

It works like a 2 stage pipeline, so the execution time is equal to the max(time stage CPU, time stage GPU).

Note: the first stage (i.e. CPU stage) does a constant work (the complexity doesn't scale up with the number of triangles used for the scene) while the second stage (i.e. GPU stage) scale up with the triangle count.

This mean that, no matter how slow your CPU is (or fast your GPU is) there is a scene complex anough to keep your GPU busy.

This pattern is clearly shown Talonman's tests, in simplier scenes, the CPU is the bottleneck while the GPU usage raise with more complex scenes.

This should work quite well in pratice because current Luxrender scenes are often using more than 1 million of triangles (Luxball scene is 262k triangles). For instance, I have rendered this (1 million of triangles) scene http://www.luxrender.net/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=9019 for Pinko in about a weekend on a network of 6xQuadcores ... I found the idea to render this image in one night at home quite sexy ;)

It seems GPU utilisation will increase with triangle count and/or the volume of the scene - so it may not be a worry that utilisation is low with these scenes. But there could be a problem with multiple GPUs in rendering rigs if the CPU can't keep them fed.

Yup, you still need a bit of belance between CPUs and GPUs on your system with this approach. However you can always find a scene complex enough to keep your GPUs busy ... however you could have no need to render a such complex scene :idea:

Have you tried installing your old graphics card into your new PC alongside the HD5870?

I have tried yesterday but it wasn't recognized by the motherboard, I just removed the 4870 planning to do some test in the week-end (ahah, there is never enough time to do all :rolleyes:).
 
The 2 threaded pipeline with the alternating buffers is a nice idea. I guess there's no limit on the number of buffers you could set up, each assigned to a thread, with a final thread that handles OpenCL/kernel execution.

It'll be interesting to see the kind of speed-up you get on that room interior. That's quite an impressive render :oops:

Jawed
 
Yup, you still need a bit of belance between CPUs and GPUs on your system with this approach. However you can always find a scene complex enough to keep your GPUs busy ... however you could have no need to render a such complex scene :idea:
It kind of sucks that you need CPU intervention every time a ray hits something, particularly since you need over 10k samples per pixel to get a clean image.

Have you thought about generating rays that weren't truly random? I would imagine that the incoherence of random rays traversing a BVH would be a nightmare for bandwidth efficiency, though if you write your code correctly then at least branch coherency shouldn't be a problem beyond the fact that different rays take different amounts of time to finish.

What I'm thinking is that maybe you can figure out a way to have all rays go in approximately the same direction (i.e. small random perturbations about the same base ray, then warped as required by the BRDF if necessary), and do it in a way that after 1000 rays per pixel you have roughly uniform distribution.
 
Back
Top