GPU Ray-tracing for OpenCL

Or a faster OpenCL implementation ;) (I got 1200 kSamples per second with my CUDA version on a 8800GTS).

So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of OCL_CONSTANT_BUFFER with __constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.

For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace
Code:
    unsigned int i = sphereCount;
    for (; i--;) {
with
Code:
    for (unsigned int i = 0; i < sphereCount; i++) {
It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?

I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.

Changing OCL_CONSTANT_BUFFER for _constant has no effect on performance on HD5870 with Cat. 8.12Beta

Replacing the second bit of code gave me minimum improvement from 14834Ks/s to 15000Ks/s.

Tested on SmallPT 2.0 alpha 2 using only GPU accel. on HD5870 stock clocks.
 
Hy, I also played around with it on my 5870.

I was wondering if regular alignment and pre-calculation-rules with CPU would still apply. So I tried aligning the structures, and padding with pre-computed values under the assumption that the cache-banks will be filled by the padded sizes anyway, so we can use it for free:

Vec becomes x, y, z, l (so 16 byte, I didn't convert this to float4 though, still float[4])
Sphere becomes vectors first, then rad, rade2 = rad * rad (so hopefully 64bytes)

I saw no way no use l (vector-length) in an efficient way, so that's just empty.
Interesting is that there was no negative performance impact, which could mean the structures were already fully aligned, or the algorithm is not memory-throughput bound (because I added ~15% padding, hard to say how much because Vec is inside other structures, but's less than Vec's +33%). The Sphere-buffer grew from 33k to 45k (on "complex.scn").
Analysing the situation with Stream Analyzer shows that ALU occupation is 20% (on "complex.scn"), so I couldn't say where it stumbles over it's own feet.

before change / after change
ALU 117619.04 117622.73
Fetch 25327.94 25329.79
Write 6.00 6.00
Wavefront 12288.00 12288.00
ALUBusy 20.28 18.16
ALUFetchRatio 4.64 4.64
ALUPacking 31.61 31.61
FetchUnitBusy 12.86 11.53
FetchUnitStalled 0.02 0.03
WriteUnitStalled 0.00 0.00
ALUStalledByLDS 0.00 0.00
LDSBankConflict 0.00 0.00

Worksize was 256, here is 64:

before change / after change
ALU 117619.04 117622.73
Fetch 25327.94 25329.79
Write 6.00 6.00
Wavefront 12288.00 12288.00
ALUBusy 22.69 22.17
ALUFetchRatio 4.64 4.64
ALUPacking 31.61 31.61
FetchUnitBusy 14.53 14.21
FetchUnitStalled 0.02 0.03
WriteUnitStalled 0.00 0.00
ALUStalledByLDS 0.00 0.00
LDSBankConflict 0.00 0.00

Fetch does less afterwards, even though has more data to fetch ...
ALU has less to do probably because of rade2 = rad * rad only.

When applied to "complex.scn" speed grows from 542k to 552k (2%), the performance-measure is pretty constant because of the numbers of spheres, so this really can attributed to the changes, and not some background-task.
When applied to "cornell.scn" the performance-measure is identicall with the exception that the measure does not fluctuate really wild in the beginning anymore, but is more calm and leads to asymt. aprox. same samples/sec. as without alignment/padding.

Well anyway, I know nothing about GPU-especificalities, and I just had fun triggering the switches in various way, getting into OpenCL finally. :)=

Ciao
Niels
 
Probably the texture cache (hence being the same speed as when I use an image texture). When NVidia compiles the shader, it has no idea how big the constant buffers are going to be, which is what Micah was talking about in the quote above. Thus they may be too big to work with the dedicated constant cache (CUDA reports constant memory as 64k on my comp). With ATI, The R700 ISA document says that it can work with 16 constant buffers that have 4K float4s (i.e. 16x64k). In both cases the addressable space for constants is limited.

I'll take your word for it. Never heard of non-texture data getting pushed to the texture cache before.

I also forgot to reply to your earlier post:
Actually the profilers and debuggers did nothing for me. NVidia's OpenCL profiler wouldn't even give me a window before crashing, and their OpenCL compiler was a real pain as it kept crashing for certain situations with the source code (seems really random, like one time I passed a variable to a function with a pointer instead of by value and the crash stopped).

Oh wasn't implying that you didn't find it on your own. Just saying that profilers and debuggers (that work properly) are key to finding this sort of unexpected behavior.

Regarding performance, the .69 Gr/s is equivalent to 69,000 ks/s, so OpenCL has some catching up to do for it to reach CUDA performance, but at least it's within a factor of two now. I was really hoping the newer cards would do over 1 Gs/s, hence the choice of units :LOL:

Are there 100 samples per ray? Not following the conversion :oops:
 
Oh wasn't implying that you didn't find it on your own. Just saying that profilers and debuggers (that work properly) are key to finding this sort of unexpected behavior.
I'm not trying to toot my own horn or anything. Just saying that the profilers, debuggers, and even the compiler sucked, and I basically had to go with trial and error. AMD seems to be getting some flak for not having OpenCL support in the standard Catalyst, but IMO NVidia's is definately not ready for the public either.

Are there 100 samples per ray? Not following the conversion :oops:
Uh oh, math fail :p

69,000 kS/sec * 10 r/S = .69 Gr/s

Actually, I made a slight error there because I've been using a depth of 5 and the originial SmallptGPU uses a depth of 6. Thus it should be 12 when comparing apples to apples.
 
Uh oh, math fail :p

69,000 kS/sec * 10 r/S = .69 Gr/s

Actually, I made a slight error there because I've been using a depth of 5 and the originial SmallptGPU uses a depth of 6. Thus it should be 12 when comparing apples to apples.

Lol, yeah I realized but didn't feel like editing because the numbers still wouldn't have made sense without your clarification :)

So it's 10 rays per sample then? Thanks.
 
The feature to select OpenCL platform and single OpenCL devices was asked some time ago and it is now available in http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta3.tgz via configuration file (check the render.cfg file for an example).

Now if someone is brave enough to install NVIDIA/ATI cards and drivers on the same PC :?:

There are also few new features:
- added support for vertex colours interpolation;
- added support for configuration file;
- added support for OpenCL platform and devices selection via configuration file;
- new surface integrator architecture, it is able to generate 2 rays per step.

The new surface integrator architecture decrease the CPU load required to keep the GPU busy and this means more spare CPU cycles to render more samples. It is faster:

file.php


Seeing 3.7M samples/secs on scene with a 150k triangles and 4 light sources is quite impressive. I wonder when this GPGPU thing will stop to surprise me.
 
The feature to select OpenCL platform and single OpenCL devices was asked some time ago and it is now available in http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta3.tgz via configuration file (check the render.cfg file for an example).

Now if someone is brave enough to install NVIDIA/ATI cards and drivers on the same PC :?:

There are also few new features:
- added support for vertex colours interpolation;
- added support for configuration file;
- added support for OpenCL platform and devices selection via configuration file;
- new surface integrator architecture, it is able to generate 2 rays per step.

The new surface integrator architecture decrease the CPU load required to keep the GPU busy and this means more spare CPU cycles to render more samples. It is faster:

file.php


Seeing 3.7M samples/secs on scene with a 150k triangles and 4 light sources is quite impressive. I wonder when this GPGPU thing will stop to surprise me.

I have them both installed on my system. What would you like to know? (provided this works)

PS. I extracted the files yet don't see a render.cfg file in there.
 
I have them both installed on my system. What would you like to know? (provided this works)

PS. I extracted the files yet don't see a render.cfg file in there.

The render.cfg look like this one:

Code:
image.width = 1280
image.height = 720
# Use a value > 0 to enable batch mode
batch.halttime = 0
scene.file = scenes/simple.scn
scene.fieldofview = 60
opencl.latency.mode = 1
opencl.nativethread.count = 3
opencl.cpu.use = 0
opencl.gpu.use = 1
# Select the OpenCL platform to use (0=first platform available, 1=second, etc.)
opencl.platform.index = 0
# The string select the OpenCL devices to use (i.e. first "0" disable the first
# device, second "1" enable the second).
#opencl.devices.select = 10
# Use a value of 0 to enable default value
opencl.gpu.workgroup.size = 64
screen.refresh.interval = 50
path.maxdepth = 6

You can use "opencl.platform.index" to select the platform to use and the string "opencl.devices.select" to enable/disable single devices.

By normally running SmallLuxGPU, you should have 2 platforms listed: ATI and NVIDIA (if the 2 OpenCL drivers really work and can coexist). In that case you can select which platform to use.
 
By normally running SmallLuxGPU, you should have 2 platforms listed: ATI and NVIDIA (if the 2 OpenCL drivers really work and can coexist). In that case you can select which platform to use.

By my own tests, the OpenCL ICD in NVIDIA's latest driver (196.21 or 196.34) still can't work with ATI Stream SDK 2.0. Apparently they have different function call for getting platform ID (it's called clIcdGetPlatformIDsKHR in nvcuda.dll and clIcdDispatchGetPlatformIDsKHR in atiocl.dll and atiocl64.dll).
 
I have posted today the first rendering done with LuxrenderGPU on Lux forums: http://www.luxrender.net/forum/viewtopic.php?f=13&t=3439

file.php


This is a quite important milestone because the OpenCL code starts to move from the "toy" field (i.e. SmallptGPU, SmallLuxGPU) to the "production" field (i.e. Luxrender). Even if it is still very much experimental.

LuxrenderGPU supports some of nice feature of Luxrender Classic out of the box, including network rendering:

file.php
 
Kewl!

But the GPU render output on the right is somewhat different -- the blue translucent enclosure is darker!?
 
Can we download that version?

I can find the download link.

Talonman, It is still too experimental for the "public". Most of the configuration is hard-coded (buffer size, number of threads, etc.) inside the sources so you would have to modifying the code and recompile to run on your hardware.
 
Anyone with GTX480 willing to join the party?
I'm curious about performance of it in this great OpenCL raytracer :eek:
 
Back
Top