GPU Ray-tracing for OpenCL

fellix

Veteran
Grab it here!

The result is in Ksamples/sec.
Here's what is capable my precious one:

25247002.jpg

Code:
OpenCL Platform 0: Advanced Micro Devices, Inc.
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = Cypress
OpenCL Device 0: Compute units = 20
OpenCL Device 0: Max. work group size = 256
Reading file 'rendering_kernel.cl' (size 2997 bytes)
OpenCL Device 0: kernel work group size = 256
Radeon HD 5870 @ 900/5000MHz
Q9450 @ 3608MHz, 0% load! (NV users may get noticeable CPU load for some reason)
 
Very nice! With my 8800 GT, I'm getting 430k samples/second :( Moreover, the driver locks up after 50 passes (roughly). This is under Vista x64.

Code:
OpenCL Platform 0: NVIDIA Corporation
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = GeForce 8800 GT
OpenCL Device 0: Compute units = 14
OpenCL Device 0: Max. work group size = 512
Reading file 'rendering_kernel.cl' (size 2997 bytes)
OpenCL Device 0: kernel work group size = 192
 
~1250 samples/sec on a GTX 285. Maxes out one CPU core. David posts here, maybe he can give us some insight into his approach.
 
Nice!

smallptgpu10001248.jpg


There were some spikes up to 14480K samples/s every 4-6 seconds, so not too bad I think!

Win7 x64 and I think still SDK Beta on this one.
 
but then if only too 16 to make a compute unit
why does the app only report 20 compute units and not 100
1600 shaders = 20 units = 80 shaders per unit yes ?
 
It's an ALU, not "shader", stream processor or any other marketing label with narrow meaning. ;)

80 ALUs per multiprocessor or 16-wide SIMD unit -- pick one. ;)
 
Ahh so the 8800gt has 112 shaders so it takes 14 of them to make a compute unit, is that right?
and the hd 5870 has 1600 shaders, but it takes 80 of them to make a compute unit ?

8800GT has (8*2)*7= 112 SPs.

My HD4850 has 10 Compute Units (10 16-way-SIMDs). The HD5870 has 20 Compute Units.
 
Hi, I'm David, the author of SmallptGPU. I think I can clarify few points:

- About the poor performances on Nvidia, I have developed SmallptGPU on an ATI HD4870. Both ATI and NVIDIA OpenCL drivers are in a early stage of the development and have their fair amount of problems/bugs/etc. I have avoided problematic paths on ATI because is my card while I have never tested SmallptGPU on NVIDIA. I assume I'm doing something that the NVIDIA OpenCL driver doesn't like at all. The high CPU usage is a good hint of this problem.

- The sources are available on the web site, so if anyone has a fix for NVIDA cards, I will be happy to apply it.

- SmallptGPU uses the first GPU device available (there is a command line option to run on CPU device). About all the load should be on GPU, CPU is nearly unused. It is not able to use multiple devices at the same time so any SLI/CrossFire configuration will be used only at 50% of its capabilities.

- 5870 is horrible fast ... I' trying to not buy one ;)
 
You have to provide a complete argument list for the command line, e.g.:

smallptgpu 0 1 rendering_kernel.cl 640 480 scenes\cornell_large.scn

You can create new arrangements of spheres:

Code:
camera 20 100 300 0 25 0
size 7
sphere 1000 0 -1000 0 0 0 0 0.75 0.75 0.75 0
sphere 10 35 15 0 0 0 0 0.9 0 0 2
sphere 15 -35 20 0 0 0 0 0 0.9 0 2
sphere 20 0 25 -35 0 0 0 0 0 0.9 2
sphere 4 35 15 0 15 15 15 0 0 0 0
sphere 8 -35 20 0 15 15 15 0 0 0 0
sphere 8 0 25 -35 100 100 100 0 0 0 0

e.g. saved as file caustic7.scn. That has 3 light sources (the final 3 spheres), one inside each "caustic" sphere. The one inside the blue sphere is super-bright.

Jawed
 
Last edited by a moderator:
I suggest changing the work group size calculation so that it does not use the "maximum". 64 on ATI would be better than 256. 64 is the minimum size on ATI HD5870 or HD4870. Some ATI GPUs (HD43xx HD45xx HD46xx) will work best with a lower size (32 or 16).

NVidia should be happy with 32 or 64.

Might be an idea to expose the workgroup size as a command line parameter. Or, have the program try a few different values. Or, just hard code it to 64.

The register usage appears to be 49 vec4 registers. This is reasonable on ATI, resulting in 5 hardware threads (wavefronts). On NVidia it is a disaster (the equivalent of 196 registers if fiddling with the occupancy calculator), meaning that only 2 hardware threads (warps) can occupy each multiprocessor. As it happens 64 is better than 32 for the workgroup size in this scenario.

I'm not sure how NVidia handles the situation when 512 work-items are requested but the hardware can only issue 64 - I'm not sure if the hardware is spilling registers in this situation, if it is, then that compounds the disaster. Hopefully making this change will improve things dramatically.

Jawed
 
Well, let's hope Mr. David takes a note on this. ;)

But there's still the weird issue with the CPU being under load on NV hardware.
 
Back
Top