GPU Ray-tracing for OpenCL

Today finally I decided to uninstall my Steam SDK beta and put final version on W7 64bit.
As I reported previously, on SDK beta I haven't had any performance difference running different batch sizes.
Now on final release it behaves as it should!
This points out to a fact that beta SDK was assuming constant work size no matter what program was asking for.

Here is pic. from my last run:
smallptgpu10561248v1664.png
 
Two things you might like to try:

Change the "__constant" declarations for "__global const". __constant has a special meaning for the way GPUs work - this is a feature of D3D10, but there are effective size limits. Additionally, the way fetches are performed from such a resource means that any time two work-items running in the same hardware thread want to fetch two different elements, you'll lose performance ("constant waterfalling").

Try the "cl-mad-enable" or "cl-fast-relaxed-math" compiler options. Or, even, "-cl-unsafe-math-optimizations".

Jawed
 
No harm in finding out the limits... Also, with OpenCL stuff being new/immature, things like this might escape from (or crash into...) problems in the compiler(s).

Jawed
 
Two things you might like to try:

Change the "__constant" declarations for "__global const". __constant has a special meaning for the way GPUs work - this is a feature of D3D10, but there are effective size limits. Additionally, the way fetches are performed from such a resource means that any time two work-items running in the same hardware thread want to fetch two different elements, you'll lose performance ("constant waterfalling").

Try the "cl-mad-enable" or "cl-fast-relaxed-math" compiler options. Or, even, "-cl-unsafe-math-optimizations".

Thanks Jawed, I will try.

Probably not many Mac user here but we (at http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=240#p29397) have finally discovered why SmallptGPU wasn't working with Apple's OpenCL GPUs. The problem was in a bug of the Apple's OpenCL compiler related exactly to "__constant" memory buffers : http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2148

I uploaded a new version (1.6) to the usual place, it includes the fix for MacOS and a new optional rendering kernel for old-school ray tracing ('80 years ;)):

gallery05.jpg


It is funny to use mainly because it is very fast (indeed).


Some update from the support for multiple OpenCL devices:

file.php



This is SmallptGPU2 running at the same time on my OpenCL GPU device ATI HD4870 and OpenCL CPU device Q6600. You can read the workload distribution on the "Help"screen (i.e. 90.5% done by the GPU, 9.5% done by the CPU). It is optionally visible also on the screen as a green bar on the left (the zone rendered by the GPU) and red bar (the one done by the CPU). This is gone be very interesting with multi-GPU setups.
 
Nice, and looks like it automatically uses all 4 cores. I'm assuming you didn't do anything special to tell it to use more than 1 core?. So far I'm really liking what I'm seeing out of all of these early OpenCL projects. :)

Regards,
SB
 
Probably not many Mac user here but we (at http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=240#p29397) have finally discovered why SmallptGPU wasn't working with Apple's OpenCL GPUs. The problem was in a bug of the Apple's OpenCL compiler related exactly to "__constant" memory buffers : http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2148

I uploaded a new version (1.6) to the usual place, it includes the fix for MacOS and a new optional rendering kernel for old-school ray tracing ('80 years ;)):

<img>

It is funny to use mainly because it is very fast (indeed).

For whatever reason, the OSX OpenCL implementation for Nvidia GPUs does seem to be a fair bit faster than Nvidia's own OpenCL for Windows. As I posted earlier, a stock GTX280 scores around 2250K at most workgroup sizes in Windows 7. Which hasn't changed significantly with your new 1.6alpha.

Interestingly though, my Mac Mini's humble 9400M already does about 880K using the same 1.6alpha on Snow Leopard.

Which means:
OSX G9400M = 16 stream processors at 1100 Mhz - OpenCL reports 2 compute units = 880K
Win7 GTX280 = 240 stream processors at 1296 Mhz - OpenCL reports 30 compute units = 2250K

Looks to me like there is plenty of room for improvement in Nvidia's OpenCL library.

smallptgpumini1.jpg


I did notice that workgroup sizes have a larger impact on my mini. There isn't much of a difference between 64 and 192 on the GTX280, but the larger number doubles the score on the 9400M (and 192 is the maximum size that works, whereas the G200 goes up to 512).

Looking forward to your releases with multiple device support Dade. Good stuff :)
 
Direct Lighting looks great! (I know! I like old Imagine 3D looks of RT)
Thanks Dade:!:

Here is my pic:

smallptgpu10001248dl.png



Also I can't wait for version 2.0!
 
The changes in the rendering kernel between 1.6alpha and 1.6 hurt the GTX280 on Windows:
alpha 64SIZE: 2250K
1.6 64SIZE: 1459k

Other sizes don't change much.

The direct lighting one runs rather well on Windows:
1.6 64SIZE_DL: 39965k

Curious though is that Nvidia's OpenCL library for Linux is about twice as fast on the same video card for the normal kernel, with hardly any difference between alpha and 1.6:
alpha 64SIZE: 4391K
1.6 64SIZE: 4398K

But then again the direct lighting one runs at only half the speed of Windows:
1.6 64SIZE_DL: 21235K

Confused, yes I am.
 
Interestingly though, my Mac Mini's humble 9400M already does about 880K using the same 1.6alpha on Snow Leopard.
Wow, that's good. That should mean a GTX285 is about the same as HD5870, I guess.

This also puts a very different perspective on the subject of register allocation. I have to admit I was surprised to see from the earlier results in the thread that workgroup size makes very little difference on NVidia.

This, of course, might merely be reflecting the overall uselessness of the NVidia implementation currently - whereas the Apple implementation is showing a meaningful variation with workgroup size.

Looks to me like there is plenty of room for improvement in Nvidia's OpenCL library.
Yeah, this is pretty interesting - Apple seems to be compiling direct to PTX but NVidia's doing "something else" and it's working pretty poorly.

Jawed
 
Wow, that's good. That should mean a GTX285 is about the same as HD5870, I guess.

Yeah if you extrapolate the 9400M score a G200 should theoretically be able to do around 13000K+ if properly tuned.

Yeah, this is pretty interesting - Apple seems to be compiling direct to PTX but NVidia's doing "something else" and it's working pretty poorly.

Jawed

Around their developer site they're pretty consistent in calling OpenCL just another part of their CUDA initiative. Maybe the current implementation has some quick and dirty translation going on.

Anyway, Linux does a little better already on G200 (up from 2250K):
smallptgpulinux1.jpg
 
Last edited by a moderator:
Confused, yes I am.

I'm quite confused too. But look, the results you are obtaining under Linux are quite good. There are about where they should be when compared, for instance, with my 4870. This is should be the proof it is mostly a driver issue: it looks like buffer transfer between CPU and GPU has some wired behaviour from the performance point of view.

After all the tests we have done, I'm quite convinced that the NVIDIA OpenCL driver needs some more tuning before to show consistent performances.

@Silent_Buddha: yup, the AMD/ATI OpenCL CPU device spawns as many threads as the cores available. At the moment I have the opposite problem, I would like to have some direct control on the number thread spawned in order to not overload the CPU (and slow down threads dedicated to drive the GPUs).

@Lightman: can I ask you what tool do you use for measuring GPU load ? I guess it runs under Windows but it looks like something quite useful ... your 19,000,00+ sample screenshot has sold me a brand new 5870, I placed my order yesterday ;)

@Talonman: you could try the version 2.0 with your 3xGPUs setup. I uploaded a preliminary version at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha1.tgz

This version should run on all GPUs (and CPUs on Apple/ATI) available. It is highly untested and I'm experiencing a problem under Windows: my PC resets after few seconds like if my power supply couldn't sustain the load :oops: Everything works fine under Linux ... quite strange. Well, it may crash but you can give it a try.
 
Just to note something weird with v1.6:

68466846.jpg


That strange pattern is visible for a fraction of the second, while the program loads up. Never saw this before, with older versions... or may be it's caused by the new Catalyst 10.1 beta driver?!
Anyway, it doesn't appear to be a bug -- everything runs flawlessly as before. ;)
 
That strange pattern is visible for a fraction of the second, while the program loads up. Never saw this before, with older versions... or may be it's caused by the new Catalyst 10.1 beta driver?!
Anyway, it doesn't appear to be a bug -- everything runs flawlessly as before. ;)

No, no, Fellix, it is just "intended", it is a kind of "debug" features I added while we were tracking the Apple's problem with Jens. The frambuffer is initialized with that kind of pattern in order to detect what the GPU returns (i.e. black overwriting black is hard to detect while it is easy over that pattern). I didn't removed the pattern because it could help to track some future problem too.
 
That strange pattern is visible for a fraction of the second, while the program loads up. Never saw this before, with older versions... or may be it's caused by the new Catalyst 10.1 beta driver?!
Anyway, it doesn't appear to be a bug -- everything runs flawlessly as before. ;)

Yeah I get the same pattern with Nvidia. And on the slow Mac, it's not just a fraction of a second either ;)
 
...19,000,00+ sample screenshot has sold me a brand new 5870, I placed my order yesterday ;)
Ata boy! :LOL:
@Talonman: you could try the version 2.0 with your 3xGPUs setup. I uploaded a preliminary version at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha1.tgz

This version should run on all GPUs (and CPUs on Apple/ATI) available. It is highly untested and I'm experiencing a problem under Windows: my PC resets after few seconds like if my power supply couldn't sustain the load :oops: Everything works fine under Linux ... quite strange. Well, it may crash but you can give it a try.
Click - it does run OK for me, although the GPU load time is a tad lower (98% vs. 93%) than v1.6 in single device mode.
 
Dade, the software is MSI Afterburner. It is based on RivaTuner but has very easy interface.

I will give version 2.0 a go in the morning. :smile:
 
Back
Top