GPU Ray-tracing for OpenCL

Rasterizing primary rays and raytracing secondary can give really good performance.
We're doing path tracing this thread, so only 1/5th of our rays are primary ;)

Long term, though, I definately do see benefits of mixing them. We won't be pathtracing at 2560x1600 anytime soon, so the key is trying to get a limited number of secondary rays shot into a simplified scene to augment rasterization.
 
Very nice work Mintmaster. There's a more complex scene with 783 spheres in it to give it more of a work out. I mentioned it before but for some reason no-one seemed to want to play:

http://forum.beyond3d.com/showthread.php?p=1378149#post1378149

Jawed


Sorry I missed your invitation!

Here it is but in 1024x768 :smile:

smallptgpucomplex.jpg


smallptGPU.exe 0 1 64 1024 768 scenes\complex.scn

A lot slower .... need more GPU's :devilish:
 
Bring on Multi-Level raytracing (Reshetov's work at Intel, also the basis for iD's SVO raycasting although it's not limited to primary rays).
 
There's a follow up paper too on his rather sparse bio page. Personally I feel he was on to something with the MLRTA paper though, the bit in section 6. about adapting Arvo's ray classification inside MLRTA, and then said "screw it, don't know how to make this work ... lets implement something simpler to parallelize, I only have to compete with other raytracing algorithms after all and they all suck" for that second paper :)

MLRTA was the algorithm they used on Larrabee BTW (according to the Larrabee paper). That's why I mentioned it.
 
There's a follow up paper too on his rather sparse bio page. Personally I feel he was on to something with the MLRTA paper though, the bit in section 6. about adapting Arvo's ray classification inside MLRTA, and then said "screw it, don't know how to make this work ... lets implement something simpler to parallelize, I only have to compete with other raytracing algorithms after all and they all suck" for that second paper :)
It seems very difficult to me. A 5D space is huge and you're going to run into performance bottlenecks with memory space and BW when trying to sort general secondary rays. If you're just doing area lights and refraction I could see MLRT improving performance, but there's plenty of realtime techniques to get soft shadows and believable reflections/refractions.

MLRT will help simple raytracing become realtime, but it will never catch rasterization in speed for similar quality. IMO secondary diffuse lighting is the only place that raytracing has a chance to outdo rasterization for realtime graphics, and once you aim for a renderer with that quality, speeding up primary rays or coherent shadow rays does very little for you. People are doing things like spherical harmonics and light propogations volumes to help rasterization to deal with this deficiency, but figuring out visibility along arbitrary paths is a fundamental weakness of rasterization while being a strength of RT.
 
A lot slower .... need more GPU's :devilish:
Or a faster OpenCL implementation ;) (I got 1200 kSamples per second with my CUDA version on a 8800GTS).

So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of OCL_CONSTANT_BUFFER with __constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.

For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace
Code:
	unsigned int i = sphereCount;
	for (; i--;) {
with
Code:
	for (unsigned int i = 0; i < sphereCount; i++) {
It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?

I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.
 
Or a faster OpenCL implementation ;) (I got 1200 kSamples per second with my CUDA version on a 8800GTS).

So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of OCL_CONSTANT_BUFFER with __constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.

For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace
Code:
    unsigned int i = sphereCount;
    for (; i--;) {
with
Code:
    for (unsigned int i = 0; i < sphereCount; i++) {
It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?

I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.
awesome :cool: making those changes I went from 290ksamples/sec to 936 - 984ksamples/sec on a 8800m gts
 
If you know the compiler is going to be naive putting those kind of idiosyncratic ways of creating loops in your code isn't a great idea, it's probably using a conditional dependent branch rather than a simple loop instruction.

It's a rather expensive way to save on a very small number of characters (hell it comes at the cost of an extra line).
 
awesome :cool: making those changes I went from 290ksamples/sec to 936 - 984ksamples/sec on a 8800m gts
Oops went a little to fast there in my copy and replace and left in the #define __constant :oops: taking that out I get 7158 - 7520ksamples/sec:LOL:
 
Oops went a little to fast there in my copy and replace and left in the #define __constant :oops: taking that out I get 7158 - 7520ksamples/sec:LOL:
Well that's expected because I have the same GPU architecture as you. My 8800GTS 640MB is at 9300 kSamples/s now. I'm wondering if the ATI GPUs get any benefit.
 
Can you compile it for me?

I'm on x64 so I don't mind 64bit exec. ;)
Okay, but can you at least take 30 seconds to make those two changes and tell me what you get?

If you know the compiler is going to be naive putting those kind of idiosyncratic ways of creating loops in your code isn't a great idea, it's probably using a conditional dependent branch rather than a simple loop instruction.
Hey, I didn't put it there! The thing is that the idea behind the original Smallpt is to fit a full pathtracer in 99 lines of code. David just left that part of the code the way it was.
 
I tried Mintmaster's modifications on the 1.6 version.
On Radeon HD 5850, there is no big differences between them (Cornell scene 64 size is about 12xxx K ~ 13xxx K samples/sec).
On GTX 285, the difference is huge. Before modification it's 16xx K samples/sec. After modification it's ~ 36xxx K samples/sec.
 
On GTX 285, the difference is huge. Before modification it's 16xx K samples/sec. After modification it's ~ 36xxx K samples/sec.
So apparently NVidia's OpenCL compiler defines __APPLE__, and the workaround that was put in place for Apple machines wound up making NVidia cards put the spheres in the __global memory space instead of __constant. Even when I put the spheres in a texture (i.e. image object) I got a huge speedup.

Is your GTX 285 on a 64-bit machine? Can you run my CUDA port? I think you can crack 1 Gsamples/s.

Too bad about the ATI performance. I was expecting a lot more from that FLOP monster, particularly given my experience with CUDA. It's probably just an immature compiler. I bet if I coded it as a pixel shader then it would be faster...
 
Only 30x faster? Lame :p

Nice job man. Goes to show how important profilers and debuggers are for catching this sort of thing.

GTX 285:

OpenCL Dade: 1,700 ks/s
OpenCL Mint Loop: 10,000 ks/s
OpenCL Mint Constant: 37,000 ks/s
CUDA Mint: 0.69 Gr/s
 
Last edited by a moderator:
Is your GTX 285 on a 64-bit machine? Can you run my CUDA port? I think you can crack 1 Gsamples/s.

I'm using Windows 7 64 bits, but when I run the CUDA version it shows an error message:

cudaSafeCall() Runtime API error in file <.\SmallptCUDA.cpp>, line 176 : unknown error.

I've tested other CUDA programs (compiled with 3.0 beta toolkit) and they seem to be fine.
 
What driver are you using? I'm on the same setup - Win 7 64, fw195.62. Got a missing cudart64 error at first but it worked fine after installing the 3.0 beta toolkit.
 
So apparently NVidia's OpenCL compiler defines __APPLE__, and the workaround that was put in place for Apple machines wound up making NVidia cards put the spheres in the __global memory space instead of __constant.

Ahah, I can not believe this, do they really define __APPLE__ ?!?

(Well, OpenCL is an Apple's trademark. May be this is a bug present because their code is used for Apple's OpenCL too).

For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace Code:

unsigned int i = sphereCount;
for (; i--; ) {

with Code:

for (unsigned int i = 0; i < sphereCount; i++) {

I will refrain myself to write comments about a compiler that require that kind of "optimization" ;)

BTW, Chiaroscuro has posted some wonderful work done with SmallLuxGPU v1.2beta in Luxrender forum. There is a new animation available here: http://www.youtube.com/watch?v=YlGVitBaaHE (awesome !)

He has posted also some beautiful still:

file.php



And Stanford Lucy model:

file.php



The latest version of SmallLuxGPU is available here: http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta2.tgz
 
Back
Top