GPU Ray-tracing for OpenCL

Mintmaster · Jan 25, 2010

mhouston said:
Rasterizing primary rays and raytracing secondary can give really good performance.

We're doing path tracing this thread, so only 1/5th of our rays are primary

Long term, though, I definately do see benefits of mixing them. We won't be pathtracing at 2560x1600 anytime soon, so the key is trying to get a limited number of secondary rays shot into a simplified scene to augment rasterization.

Lightman · Jan 25, 2010

Jawed said:
Very nice work Mintmaster. There's a more complex scene with 783 spheres in it to give it more of a work out. I mentioned it before but for some reason no-one seemed to want to play:

http://forum.beyond3d.com/showthread.php?p=1378149#post1378149

Jawed

Sorry I missed your invitation!

Here it is but in 1024x768 :smile:

smallptGPU.exe 0 1 64 1024 768 scenes\complex.scn

A lot slower .... need more GPU's

Jawed · Jan 25, 2010

Or, some kind of acceleration structure, ouch.

Jawed

MfA · Jan 26, 2010

Bring on Multi-Level raytracing (Reshetov's work at Intel, also the basis for iD's SVO raycasting although it's not limited to primary rays).

Jawed · Jan 26, 2010

I hadn't seen that before, so since academic firewalls tend to get in the way I found a freely-linkable version:

http://lukasz.dk/files/mlrta.pdf

Jawed

MfA · Jan 26, 2010

There's a follow up paper too on his rather sparse bio page. Personally I feel he was on to something with the MLRTA paper though, the bit in section 6. about adapting Arvo's ray classification inside MLRTA, and then said "screw it, don't know how to make this work ... lets implement something simpler to parallelize, I only have to compete with other raytracing algorithms after all and they all suck" for that second paper

MLRTA was the algorithm they used on Larrabee BTW (according to the Larrabee paper). That's why I mentioned it.

Mintmaster · Jan 27, 2010

MfA said:
There's a follow up paper too on his rather sparse bio page. Personally I feel he was on to something with the MLRTA paper though, the bit in section 6. about adapting Arvo's ray classification inside MLRTA, and then said "screw it, don't know how to make this work ... lets implement something simpler to parallelize, I only have to compete with other raytracing algorithms after all and they all suck" for that second paper

It seems very difficult to me. A 5D space is huge and you're going to run into performance bottlenecks with memory space and BW when trying to sort general secondary rays. If you're just doing area lights and refraction I could see MLRT improving performance, but there's plenty of realtime techniques to get soft shadows and believable reflections/refractions.

MLRT will help simple raytracing become realtime, but it will never catch rasterization in speed for similar quality. IMO secondary diffuse lighting is the only place that raytracing has a chance to outdo rasterization for realtime graphics, and once you aim for a renderer with that quality, speeding up primary rays or coherent shadow rays does very little for you. People are doing things like spherical harmonics and light propogations volumes to help rasterization to deal with this deficiency, but figuring out visibility along arbitrary paths is a fundamental weakness of rasterization while being a strength of RT.

Mintmaster · Jan 27, 2010

Lightman said:
A lot slower .... need more GPU's

Or a faster OpenCL implementation

(I got 1200 kSamples per second with my CUDA version on a 8800GTS).

So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of OCL_CONSTANT_BUFFER with __constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.

For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace

Code:

	unsigned int i = sphereCount;
	for (; i--;) {

with

Code:

	for (unsigned int i = 0; i < sphereCount; i++) {

It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?

I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.

Lightman · Jan 27, 2010

Can you compile it for me?

I'm on x64 so I don't mind 64bit exec.

hiro · Jan 27, 2010

Mintmaster said:
Or a faster OpenCL implementation (I got 1200 kSamples per second with my CUDA version on a 8800GTS).

So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of OCL_CONSTANT_BUFFER with __constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.

For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace

Code:

unsigned int i = sphereCount; for (; i--;) {

with

Code:

for (unsigned int i = 0; i < sphereCount; i++) {

It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?

I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.

awesome

making those changes I went from 290ksamples/sec to 936 - 984ksamples/sec on a 8800m gts

MfA · Jan 27, 2010

If you know the compiler is going to be naive putting those kind of idiosyncratic ways of creating loops in your code isn't a great idea, it's probably using a conditional dependent branch rather than a simple loop instruction.

It's a rather expensive way to save on a very small number of characters (hell it comes at the cost of an extra line).

hiro · Jan 27, 2010

hiro said:
awesome making those changes I went from 290ksamples/sec to 936 - 984ksamples/sec on a 8800m gts

Oops went a little to fast there in my copy and replace and left in the #define __constant

taking that out I get 7158 - 7520ksamples/sec

Mintmaster · Jan 28, 2010

hiro said:
Oops went a little to fast there in my copy and replace and left in the #define __constant taking that out I get 7158 - 7520ksamples/sec

Well that's expected because I have the same GPU architecture as you. My 8800GTS 640MB is at 9300 kSamples/s now. I'm wondering if the ATI GPUs get any benefit.

Mintmaster · Jan 28, 2010

Lightman said:
Can you compile it for me?

I'm on x64 so I don't mind 64bit exec.

Okay, but can you at least take 30 seconds to make those two changes and tell me what you get?

MfA said:
If you know the compiler is going to be naive putting those kind of idiosyncratic ways of creating loops in your code isn't a great idea, it's probably using a conditional dependent branch rather than a simple loop instruction.

Hey, I didn't put it there! The thing is that the idea behind the original Smallpt is to fit a full pathtracer in 99 lines of code. David just left that part of the code the way it was.

pcchen · Jan 28, 2010

I tried Mintmaster's modifications on the 1.6 version.
On Radeon HD 5850, there is no big differences between them (Cornell scene 64 size is about 12xxx K ~ 13xxx K samples/sec).
On GTX 285, the difference is huge. Before modification it's 16xx K samples/sec. After modification it's ~ 36xxx K samples/sec.

Mintmaster · Jan 28, 2010

pcchen said:
On GTX 285, the difference is huge. Before modification it's 16xx K samples/sec. After modification it's ~ 36xxx K samples/sec.

So apparently NVidia's OpenCL compiler defines __APPLE__, and the workaround that was put in place for Apple machines wound up making NVidia cards put the spheres in the __global memory space instead of __constant. Even when I put the spheres in a texture (i.e. image object) I got a huge speedup.

Is your GTX 285 on a 64-bit machine? Can you run my CUDA port? I think you can crack 1 Gsamples/s.

Too bad about the ATI performance. I was expecting a lot more from that FLOP monster, particularly given my experience with CUDA. It's probably just an immature compiler. I bet if I coded it as a pixel shader then it would be faster...

trinibwoy · Jan 28, 2010

Only 30x faster? Lame

Nice job man. Goes to show how important profilers and debuggers are for catching this sort of thing.

GTX 285:

OpenCL Dade: 1,700 ks/s
OpenCL Mint Loop: 10,000 ks/s
OpenCL Mint Constant: 37,000 ks/s
CUDA Mint: 0.69 Gr/s

pcchen · Jan 28, 2010

Mintmaster said:
Is your GTX 285 on a 64-bit machine? Can you run my CUDA port? I think you can crack 1 Gsamples/s.

I'm using Windows 7 64 bits, but when I run the CUDA version it shows an error message:

cudaSafeCall() Runtime API error in file <.\SmallptCUDA.cpp>, line 176 : unknown error.

I've tested other CUDA programs (compiled with 3.0 beta toolkit) and they seem to be fine.

trinibwoy · Jan 28, 2010

What driver are you using? I'm on the same setup - Win 7 64, fw195.62. Got a missing cudart64 error at first but it worked fine after installing the 3.0 beta toolkit.

Dade · Jan 28, 2010

Mintmaster said:
So apparently NVidia's OpenCL compiler defines __APPLE__, and the workaround that was put in place for Apple machines wound up making NVidia cards put the spheres in the __global memory space instead of __constant.

Ahah, I can not believe this, do they really define __APPLE__ ?!?

(Well, OpenCL is an Apple's trademark. May be this is a bug present because their code is used for Apple's OpenCL too).

Mintmaster said:
For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace Code:

unsigned int i = sphereCount;
for (; i--; ) {

with Code:

for (unsigned int i = 0; i < sphereCount; i++) {

I will refrain myself to write comments about a compiler that require that kind of "optimization"

BTW, Chiaroscuro has posted some wonderful work done with SmallLuxGPU v1.2beta in Luxrender forum. There is a new animation available here: http://www.youtube.com/watch?v=YlGVitBaaHE (awesome !)

He has posted also some beautiful still:

And Stanford Lucy model:

The latest version of SmallLuxGPU is available here: http://davibu.interfree.it/opencl/smallluxgpu/smallluxgpu-v1.2beta2.tgz

GPU Ray-tracing for OpenCL

Mintmaster

Lightman

Jawed

MfA

Jawed

MfA

Mintmaster

Mintmaster

Lightman

hiro

MfA

hiro

Mintmaster

Mintmaster

pcchen

Moderator

Mintmaster

trinibwoy

Meh

pcchen

Moderator

trinibwoy

Meh

Dade

Similar threads