GPU Ray-tracing for OpenCL

@Jawed: yup, I have the option to increase the number of threads to produce more work for the GPU. However the GPU load isn't bad on Luxball, it is about 80% on my PC. It is a bit too variable, it can easy go down if you point the camera in an empty zone of the scene so it may be necessary to introduce more threads in the future.

@Talonman: more than converting the scene from Luxrender to SmallLux, I'm going to insert SmallLux into Luxrender ;)

@Mintmaster: to move ray intersection on the GPU is just the first pass on migration progress that is going to take sometime. The first point was to safeguard all the time already spent in the development of Luxrender by choosing a migration path. The major problem for GPU applications, at the moment, is to not throw all the work already done out of the window. This is, in my opinion, a very important point a bit underestimate. It is not like we can develep everyhting from scratch again.
Ray coherency can help a lot but the main problem is that it is typically lost after the first bounce on a surface so it is mainly useful for eye rays (and I'm tracing 16 rays per pixel after few passes).

@CNCAddict: is it so fast ? I mean, the resolution of the video looks a bit low to me :?:

BTW, I have uploaded a new video to vimeo recorded with the latest version of SmallLuxGPU and my new i7 860/HD5870: http://vimeo.com/8799796

It includes a scene with 2,700,000 triangles:

orig-balls02.jpg


The new v1.1beta2 is available here: http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.1beta2.tgz
 
Thanks for the info, the video, and the new version of SmallLuxGPU. :)

Update: I can't find the above picture in the latest version.
 
Last edited by a moderator:
Yes, thanks. Downloading now...

It's working... :)
luxballs.jpg


What does your .bat file look like? (Your picture is bigger.)

I am using this:
@echo off
SmallLuxGPU.exe 1 0 1 1 64 640 480 scenes\luxballs.scn
pause

I know I can't use 1920 x 1200, I have tried it on other .bat files, unsuccessfully.
 
Last edited by a moderator:
Yes, thanks. Downloading now...

It's working... :)

658k samples/sec is a quite respectable result in "low latency" mode (more details below). I get about 1000k samples/sec in "high bandwidth" on my new PC (and under Linux ... for some reason Linux is quite faster than Windows).

What does your .bat file look like? (Your picture is bigger.)

I am using this:
@echo off
SmallLuxGPU.exe 1 0 1 1 64 640 480 scenes\luxballs.scn
pause

I know I can't use 1920 x 1200, I have tried it on other .bat files, unsuccessfully.

Hehe, the command line arguments start too look a bit cryptic, I added a brief description of their meanings here: http://www.luxrender.net/wiki/index.php?title=Luxrender_and_OpenCL#Comand_line_arguments

I rendered my picture at 1280x720.
 
Thanks again... ;)

So you are using:
@echo off
SmallLuxGPU.exe 1 0 1 1 64 1280 720 scenes\luxballs.scn
pause

Any other changes?

Do you know what the max res I could put in, that would work?
luxballs3.jpg


Note that going to the higher res, dropped both my CPU, and GPU utilization.
 
Last edited by a moderator:
My findings under W7 x64.

CLCPU

SmallLuxGPU.exe 0 1 1 64 1280 720 scenes\luxball.scn
SmallLuxGPUbeta1.1b1 CLCPU ~2000K rays/s and 380K samples/s

SmallLuxGPU.exe 1 0 1 1 64 1280 720 scenes\luxball.scn
SmallLuxGPUbeta1.1b2 CLCPU ~1700K rays/s and 200K samples/s (samples seems to be bugged because longer I wait lower the value where rays/s stays about the same)

Conclusion - beta 2 slower :???:


Native threads
SmallLuxGPU.exe 3 0 1 64 1280 720 scenes\luxball.scn
SmallLuxGPUbeta1.1b1 3 threads ~2060K rays/s and 389K samples/s

SmallLuxGPU.exe 1 3 0 1 64 1280 720 scenes\luxball.scn
SmallLuxGPUbeta1.1b23 threads ~2530K rays/s and 171K samples/s (samples seems to be bugged because longer I wait lower the value where rays/s stays about the same)

Conclusion - beta 2 faster :)



High Bandwidth
SmallLuxGPU.exe 0 3 0 1 64 1280 720 scenes\luxball.scn
SmallLuxGPUbeta1.1b2 3 threads ~4480K rays/s and 540K samples/s

Quite a bit faster :smile:


Tested on:
HD5870 @900/1150
Phenom II 940 @3455


EDIT:

After adding second GPU - HD4670 @800/1000 [rv730]

High Bandwidth
SmallLuxGPU.exe 0 3 0 1 64 1280 720 scenes\luxball.scn
SmallLuxGPUbeta1.1b2 3 threads ~1700K rays/s

Quite a disaster :cry:

Seems like AMD OpenCL can't handle unbalanced GPU's.
 
Last edited by a moderator:
Yes, thanks. Downloading now...

It's working... :)
luxballs.jpg


What does your .bat file look like? (Your picture is bigger.)

I am using this:
@echo off
SmallLuxGPU.exe 1 0 1 1 64 640 480 scenes\luxballs.scn
pause

I know I can't use 1920 x 1200, I have tried it on other .bat files, unsuccessfully.

It also works with a 5970 + 5870 + Core i7 and a 4870 X2 + 4870 + Core i7...

Tested both. :)

The first versions didn't work with multiple ATi cards for me but this one does.
 
So you are using:
@echo off
SmallLuxGPU.exe 1 0 1 1 64 1280 720 scenes\luxballs.scn
pause

I would change the first "1" in a "0" in order to use "high bandwidth mode" for benchmarking:

SmallLuxGPU.exe 0 0 1 1 64 1280 720 scenes\luxballs.scn

Note that going to the higher res, dropped both my CPU, and GPU utilization.

Doesn't make any sense but the NVIDIA OpenCL driver is a bit strange: yesterday I have installed NVIDIA Linux SDK on my old PC (with a 240GT). BTW, they have an awesome GPU profile ... did I said awesome ? A W E S O M E ;)

I did some test with NVIDIA OpenCL demo. They work more or less like I do (I checked the sources) and they show an unjustified high CPU load (50+% even when the CPU has nothing to do) like my code does. In my opinion this a good proof their drivers still need some twiking.

I did some profiling session with NVIDIA ocl profiler and, as expected, simple.scn (i.e. less than 50 triangles) barely keep the GPU busy for half of the time (the profiler shows me, time spent doing I/O, computation, etc., "white holes" => doing nothing):

ocl_prof1.jpg


No problems instead when the scene become more complex like with luxball (262k triangles). GPU is fully utilized:

ocl_prof2.jpg


And this picture is really interesting:

ocl_prof3.jpg


55% of the kernel execution time is spent doing memory reading (at chunk of 32bit = 1xfloat). It means that using 128bit access should highly improve the performances. It is quite interesting also to notice as branching doesn't look like a problem.
 
My findings under W7 x64.

CLCPU

SmallLuxGPUbeta1.1b1 CLCPU ~2000K rays/s and 380K samples/s

SmallLuxGPUbeta1.1b2 CLCPU ~1700K rays/s and 200K samples/s (samples seems to be bugged because longer I wait lower the value where rays/s stays about the same)

Conclusion - beta 2 slower :???:

This is a bit expected because the code has been modified to keep the OpenCL device busy. In the case of OpenCL CPU device, you have the CPU trying to keep itself more busy :smile:

EDIT:

After adding second GPU - HD4670 @800/1000 [rv730]

High Bandwidth
SmallLuxGPUbeta1.1b2 3 threads ~1700K rays/s

Quite a disaster :cry:

Seems like AMD OpenCL can't handle unbalanced GPU's.

Lightman, what kind of motherboard do you have ? I ask because on mine, if I install a second card, I downgrade the performance of the PCIe bus from 16x to 8x on both card slots.

I assume the 4670 is very slow compared to the 5870 and this could be the cause where the performance you loose going from 16x to 8x is not compensated by the second card.

Just an hypotesys but may be this is the explaination :idea:
 
I would change the first "1" in a "0" in order to use "high bandwidth mode" for benchmarking:

SmallLuxGPU.exe 0 0 1 1 64 1280 720 scenes\luxballs.scn

High bandwidth mode is a no-go for me.
Program terminates:
highimem.jpg


Thanks for all the other interesting info you posted too...
Hopefully after the drivers get a much needed update, we will indeed get additional performance.

For trivia, we did get new drivers today...
Version 196.21
I don't expect a change in OpenCL, but if this fixes anything, I will report back.

Update: Never mind... My overclocking tool Precision wont work with these, so I wont mess with them yet.
I like Precision too much. Unwinder has been informed, and is currently working to resolve the issue.
Afterburner is also affected.
 
Last edited by a moderator:
This is a bit expected because the code has been modified to keep the OpenCL device busy. In the case of OpenCL CPU device, you have the CPU trying to keep itself more busy :smile:

That makes sense :idea:

Lightman, what kind of motherboard do you have ? I ask because on mine, if I install a second card, I downgrade the performance of the PCIe bus from 16x to 8x on both card slots.

I assume the 4670 is very slow compared to the 5870 and this could be the cause where the performance you loose going from 16x to 8x is not compensated by the second card.

Just an hypotesys but may be this is the explaination :idea:

I have GA-MA790FX which uses AMD 790FX chipset. So I'm running 2x16pcie config on it (only 2 out of 4 pcie16x slots populated) which means it's something else.
The first second or 2 Cypress is doing ~65% of rays and it produces 1700k rays/s but then it plummets down to a level of RV730 and stays there.

But this is the case in almost any OpenCL application using more than one GPU on my sytem, that's why I'm suspecting it's AMD implementation at this stage.

I was wondering if you can create separate kernel (thread?/calls?) for each OpenCL device to feed them with data completely independently or that's not possible.
The other idea I have is to try run 2 instances of your app but each assinged to different GPU. Of course to make that test we would need an option to select OpenCL GPU device ...

Thanks for your hard work Dave! ;)
 
Well there is a small issue.

In both my rigs I have a 9800GT (used with the NGO crack for PhysX)

As soon as I installed the nVIDIA drivers (with OpenCL support) it now refuses to use the AMD GPUs and only see's the 9800GT. This is probably because the 9800GT is the first graphics device in my system (primary) when I installed windows and that the AMD GPUs were moved up to primary duty afterward.

Is there a way to add an option to choose which GPU device to run this on? I'll need that in order to take a screenshot with the 5970+5870+Corei7
 
Well there is a small issue.

In both my rigs I have a 9800GT (used with the NGO crack for PhysX)

As soon as I installed the nVIDIA drivers (with OpenCL support) it now refuses to use the AMD GPUs and only see's the 9800GT. This is probably because the 9800GT is the first graphics device in my system (primary) when I installed windows and that the AMD GPUs were moved up to primary duty afterward.

Is there a way to add an option to choose which GPU device to run this on? I'll need that in order to take a screenshot with the 5970+5870+Corei7
If you remove the 9800GT are you then able to see the AMD GPU devices? If not, check your PATH and see if moving the ATI Stream\bin\* entries earlier helps. You may be picking up a different (older) OpenCL.dll from the nvidia driver install. AFAIK, our OpenCL.dll is the latest and should work with both vendors' products. If you are able to see the devices after removing the 9800GT, dump out the OpenCL platforms and see if one of them is AMD.
 
If you remove the 9800GT are you then able to see the AMD GPU devices? If not, check your PATH and see if moving the ATI Stream\bin\* entries earlier helps. You may be picking up a different (older) OpenCL.dll from the nvidia driver install. AFAIK, our OpenCL.dll is the latest and should work with both vendors' products. If you are able to see the devices after removing the 9800GT, dump out the OpenCL platforms and see if one of them is AMD.

It would be hard for me to physically remove the 9800GT (seeing as my system is watercooled and it is part of the loop).

I did rename the opencl.dll in the syswow64 folder to opencl.bak and placed the ATi OpenCL dll in there but it simply wont run at all when I do that.

I'm thinking that either both vendors are playing shady tricks with their OpenCL implementations thus far (locking out the competitors product) or that this particular app can only work with a single rendering device (which is why you need CrossFireX enabled in order to use multiple Graphic cards on the AMD front).
 
Check your Windows tree to see if the install of the NVidia driver dumped its DLL in there.

Jawed

The nVIDIA drivers dump two OpenCL.dll files (one in system32 and the other in syswow64).

The ATi drivers don't seem to do that (maybe they do but now I can't get their OpenCL files to install there if they do.

If I rename or remove those two files OpenCL obviously stops working.

EDIT: Upon removal and restart.. the system now works with the ATi GPUs but no longer can do OpenCL on the nVIDIA GPU. It appears that either both vendors own OpenCL.dll files don't pick up the other vendors products or that this application can only use one rendering block (Graphics wise) at a time.

EDIT 2: I suppose I could test this by disabling CrossfireX, if only one ATi GPU is then found then I have my answer.
EDIT 3: OK, without CrossfireX it only uses the 5970 (not the 5870). So it can only use one rendering block/device at a time. Makes sense then. So primary rendering device being the 9800GT relegates the 9800GT to being the only card capable of running OpenCL with this ray trace application. If I remove the nVIDIA OpenCL.dll files, reboot, it then will use the ATi GPUs (5970+5870) provided they're in CrossfireX, if I disable CrossfireX it will use the first ATi rendering device it detects (5970). Therefore it appears to be a limitation with the application not the nVIDIA or ATi OpenCL implementations.
 
Last edited by a moderator:
You can try copying ATI's opencl.dll to the same directory of the executable so you can use ATI's OpenCL implementation. The system should start search for dll from the executable's directory first, so you shouldn't have to touch the dll in the system directories.
 
Back
Top