GPU Ray-tracing for OpenCL

Lightman · Jan 6, 2010

Today finally I decided to uninstall my Steam SDK beta and put final version on W7 64bit.
As I reported previously, on SDK beta I haven't had any performance difference running different batch sizes.
Now on final release it behaves as it should!
This points out to a fact that beta SDK was assuming constant work size no matter what program was asking for.

Here is pic. from my last run:

Jawed · Jan 8, 2010

Two things you might like to try:

Change the "__constant" declarations for "__global const". __constant has a special meaning for the way GPUs work - this is a feature of D3D10, but there are effective size limits. Additionally, the way fetches are performed from such a resource means that any time two work-items running in the same hardware thread want to fetch two different elements, you'll lose performance ("constant waterfalling").

Try the "cl-mad-enable" or "cl-fast-relaxed-math" compiler options. Or, even, "-cl-unsafe-math-optimizations".

Jawed

chavvdarrr · Jan 8, 2010

Jawed said:
Or, even, "-cl-unsafe-math-optimizations".

Jawed

Does it makes sense?
After all, if one wants to have reliable results, such flags are lose-lose

Jawed · Jan 8, 2010

No harm in finding out the limits... Also, with OpenCL stuff being new/immature, things like this might escape from (or crash into...) problems in the compiler(s).

Jawed

Dade · Jan 8, 2010

Jawed said:
Two things you might like to try:

Change the "__constant" declarations for "__global const". __constant has a special meaning for the way GPUs work - this is a feature of D3D10, but there are effective size limits. Additionally, the way fetches are performed from such a resource means that any time two work-items running in the same hardware thread want to fetch two different elements, you'll lose performance ("constant waterfalling").

Try the "cl-mad-enable" or "cl-fast-relaxed-math" compiler options. Or, even, "-cl-unsafe-math-optimizations".

Thanks Jawed, I will try.

Probably not many Mac user here but we (at http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=240#p29397) have finally discovered why SmallptGPU wasn't working with Apple's OpenCL GPUs. The problem was in a bug of the Apple's OpenCL compiler related exactly to "__constant" memory buffers : http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2148

I uploaded a new version (1.6) to the usual place, it includes the fix for MacOS and a new optional rendering kernel for old-school ray tracing ('80 years

):

It is funny to use mainly because it is very fast (indeed).

Some update from the support for multiple OpenCL devices:

This is SmallptGPU2 running at the same time on my OpenCL GPU device ATI HD4870 and OpenCL CPU device Q6600. You can read the workload distribution on the "Help"screen (i.e. 90.5% done by the GPU, 9.5% done by the CPU). It is optionally visible also on the screen as a green bar on the left (the zone rendered by the GPU) and red bar (the one done by the CPU). This is gone be very interesting with multi-GPU setups.

Silent_Buddha · Jan 8, 2010

Nice, and looks like it automatically uses all 4 cores. I'm assuming you didn't do anything special to tell it to use more than 1 core?. So far I'm really liking what I'm seeing out of all of these early OpenCL projects.

Regards,
SB

Florin · Jan 9, 2010

Dade said:
Probably not many Mac user here but we (at http://www.luxrender.net/forum/viewtopic.php?f=21&t=2947&start=240#p29397) have finally discovered why SmallptGPU wasn't working with Apple's OpenCL GPUs. The problem was in a bug of the Apple's OpenCL compiler related exactly to "__constant" memory buffers : http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2148

I uploaded a new version (1.6) to the usual place, it includes the fix for MacOS and a new optional rendering kernel for old-school ray tracing ('80 years ):

<img>

It is funny to use mainly because it is very fast (indeed).

For whatever reason, the OSX OpenCL implementation for Nvidia GPUs does seem to be a fair bit faster than Nvidia's own OpenCL for Windows. As I posted earlier, a stock GTX280 scores around 2250K at most workgroup sizes in Windows 7. Which hasn't changed significantly with your new 1.6alpha.

Interestingly though, my Mac Mini's humble 9400M already does about 880K using the same 1.6alpha on Snow Leopard.

Which means:
OSX G9400M = 16 stream processors at 1100 Mhz - OpenCL reports 2 compute units = 880K
Win7 GTX280 = 240 stream processors at 1296 Mhz - OpenCL reports 30 compute units = 2250K

Looks to me like there is plenty of room for improvement in Nvidia's OpenCL library.

I did notice that workgroup sizes have a larger impact on my mini. There isn't much of a difference between 64 and 192 on the GTX280, but the larger number doubles the score on the 9400M (and 192 is the maximum size that works, whereas the G200 goes up to 512).

Looking forward to your releases with multiple device support Dade. Good stuff

Talonman · Jan 9, 2010

V1.6

CPU = 26%, 1/2 of my 295 = 99%:

Lightman · Jan 9, 2010

Direct Lighting looks great! (I know! I like old Imagine 3D looks of RT)
Thanks Dade :!:

Here is my pic:

Also I can't wait for version 2.0!

MfA · Jan 9, 2010

Wow ... I'd have never thought that it could run that fast with clfinish after each pass.

Florin · Jan 9, 2010

The changes in the rendering kernel between 1.6alpha and 1.6 hurt the GTX280 on Windows:
alpha 64SIZE: 2250K
1.6 64SIZE: 1459k

Other sizes don't change much.

The direct lighting one runs rather well on Windows:
1.6 64SIZE_DL: 39965k

Curious though is that Nvidia's OpenCL library for Linux is about twice as fast on the same video card for the normal kernel, with hardly any difference between alpha and 1.6:
alpha 64SIZE: 4391K
1.6 64SIZE: 4398K

But then again the direct lighting one runs at only half the speed of Windows:
1.6 64SIZE_DL: 21235K

Confused, yes I am.

Jawed · Jan 9, 2010

Florin said:
Interestingly though, my Mac Mini's humble 9400M already does about 880K using the same 1.6alpha on Snow Leopard.

Wow, that's good. That should mean a GTX285 is about the same as HD5870, I guess.

This also puts a very different perspective on the subject of register allocation. I have to admit I was surprised to see from the earlier results in the thread that workgroup size makes very little difference on NVidia.

This, of course, might merely be reflecting the overall uselessness of the NVidia implementation currently - whereas the Apple implementation is showing a meaningful variation with workgroup size.

Looks to me like there is plenty of room for improvement in Nvidia's OpenCL library.

Yeah, this is pretty interesting - Apple seems to be compiling direct to PTX but NVidia's doing "something else" and it's working pretty poorly.

Jawed

Talonman · Jan 9, 2010

An outstanding post by dast:
http://forums.nvidia.com/index.php?showtopic=150015&st=0&gopid=978385&#entry978385

It helps to explain the performance difference that we are seeing...

Florin · Jan 9, 2010

Jawed said:
Wow, that's good. That should mean a GTX285 is about the same as HD5870, I guess.

Yeah if you extrapolate the 9400M score a G200 should theoretically be able to do around 13000K+ if properly tuned.

Yeah, this is pretty interesting - Apple seems to be compiling direct to PTX but NVidia's doing "something else" and it's working pretty poorly.

Jawed

Around their developer site they're pretty consistent in calling OpenCL just another part of their CUDA initiative. Maybe the current implementation has some quick and dirty translation going on.

Anyway, Linux does a little better already on G200 (up from 2250K):

Dade · Jan 9, 2010

Florin said:
Confused, yes I am.

I'm quite confused too. But look, the results you are obtaining under Linux are quite good. There are about where they should be when compared, for instance, with my 4870. This is should be the proof it is mostly a driver issue: it looks like buffer transfer between CPU and GPU has some wired behaviour from the performance point of view.

After all the tests we have done, I'm quite convinced that the NVIDIA OpenCL driver needs some more tuning before to show consistent performances.

@Silent_Buddha: yup, the AMD/ATI OpenCL CPU device spawns as many threads as the cores available. At the moment I have the opposite problem, I would like to have some direct control on the number thread spawned in order to not overload the CPU (and slow down threads dedicated to drive the GPUs).

@Lightman: can I ask you what tool do you use for measuring GPU load ? I guess it runs under Windows but it looks like something quite useful ... your 19,000,00+ sample screenshot has sold me a brand new 5870, I placed my order yesterday

@Talonman: you could try the version 2.0 with your 3xGPUs setup. I uploaded a preliminary version at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha1.tgz

This version should run on all GPUs (and CPUs on Apple/ATI) available. It is highly untested and I'm experiencing a problem under Windows: my PC resets after few seconds like if my power supply couldn't sustain the load

Everything works fine under Linux ... quite strange. Well, it may crash but you can give it a try.

fellix · Jan 9, 2010

Just to note something weird with v1.6:

That strange pattern is visible for a fraction of the second, while the program loads up. Never saw this before, with older versions... or may be it's caused by the new Catalyst 10.1 beta driver?!
Anyway, it doesn't appear to be a bug -- everything runs flawlessly as before.

Dade · Jan 9, 2010

fellix said:
That strange pattern is visible for a fraction of the second, while the program loads up. Never saw this before, with older versions... or may be it's caused by the new Catalyst 10.1 beta driver?!
Anyway, it doesn't appear to be a bug -- everything runs flawlessly as before.

No, no, Fellix, it is just "intended", it is a kind of "debug" features I added while we were tracking the Apple's problem with Jens. The frambuffer is initialized with that kind of pattern in order to detect what the GPU returns (i.e. black overwriting black is hard to detect while it is easy over that pattern). I didn't removed the pattern because it could help to track some future problem too.

Florin · Jan 9, 2010

fellix said:
That strange pattern is visible for a fraction of the second, while the program loads up. Never saw this before, with older versions... or may be it's caused by the new Catalyst 10.1 beta driver?!
Anyway, it doesn't appear to be a bug -- everything runs flawlessly as before.

Yeah I get the same pattern with Nvidia. And on the slow Mac, it's not just a fraction of a second either

fellix · Jan 9, 2010

Dade said:
...19,000,00+ sample screenshot has sold me a brand new 5870, I placed my order yesterday

Ata boy!

Dade said:
@Talonman: you could try the version 2.0 with your 3xGPUs setup. I uploaded a preliminary version at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha1.tgz

This version should run on all GPUs (and CPUs on Apple/ATI) available. It is highly untested and I'm experiencing a problem under Windows: my PC resets after few seconds like if my power supply couldn't sustain the load Everything works fine under Linux ... quite strange. Well, it may crash but you can give it a try.

Click - it does run OK for me, although the GPU load time is a tad lower (98% vs. 93%) than v1.6 in single device mode.

Lightman · Jan 9, 2010

Dade, the software is MSI Afterburner. It is based on RivaTuner but has very easy interface.

I will give version 2.0 a go in the morning. :smile:

GPU Ray-tracing for OpenCL

Lightman

Jawed

chavvdarrr

Jawed

Dade

Silent_Buddha

Florin

Merrily dodgy

Talonman

Lightman

MfA

Florin

Merrily dodgy

Jawed

Talonman

Florin

Merrily dodgy

Dade

fellix

Dade

Florin

Merrily dodgy

fellix

Lightman

Similar threads