GPU Ray-tracing for OpenCL

There's a 2-element array of integer seeds that's private per work-item. This array is put into local memory. On ATI the compiler just uses registers.

I was using a int[2] array to pass around the 2 seeds used by the random number generator. I uploaded a new version available at http://davibu.interfree.it/opencl/smallptgpu/smallptgpu-v1.5beta2.tgz

I modified the OpenCL kernel to pass around 2 arguments instead of one array of two elements. Let's see if this is the source of all problems with NVIDIA :?:
 
Is there any way for the application to engage more than one OCL device, for multi-GPU systems?

Also, some mean value for the sample count would be welcome. That way the app will be a bit more benchmark friendly. ;)
 
Sorry Dade, this one doesn't produce output for me:

D:\Install\SmallptGPU-v1.5beta2>RUN_SCENE_CORNELL_64SIZE.bat

D:\Install\SmallptGPU-v1.5beta2>smallptGPU.exe 1 64 rendering_kernel.cl 640 480 scenes\cornell.scn
Usage: smallptGPU.exe
Usage: smallptGPU.exe <use CPU/GPU device (0=CPU or 1=GPU)> <workgroup size (0=default value or anything > 0 and power of 2)> <kernel file name> <window width> <window height> <scene file>
Reading scene: scenes\cornell.scn
Scene size: 9
OpenCL Platform 0: NVIDIA Corporation
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = GeForce GTX 280
OpenCL Device 0: Compute units = 30
OpenCL Device 0: Max. work group size = 512
Reading file 'rendering_kernel.cl' (size 3228 bytes)
OpenCL Device 0: kernel work group size = 384
OpenCL Device 0: forced kernel work group size = 64
Failed to wait the end of OpenCL execution: -5
 
Again, you are correct. Simple in the black w/o OC becomes 565000k/s when maximised.

I doubt you have the feeling how shocking for me is seeing a 565000k/s number for an unbiased rendering :D

Just to give you an idea, a very simple scenes usually runs at 500k/s. Complex scenes usually runs at 20-30k/s on a quadcore (check http://www.luxrender.net/forum/gallery2.php for some example of the scene I'm talking about).

I can reach 100-150k/s on a network of 6 quadcore ... 565000k/s should be forbidden by the law of physic ;)

P.S. Sample from SmallptGPU and "Luxrender" are not exactly the same thing but let's dream a bit.
 
There is an IQ problem with the latest beta -- visible aliasing on some edges and intersections:

71121974.jpg
 
Is there any way for the application to engage more than one OCL device, for multi-GPU systems?

Yes, there is a very good support for handling multiple devices (both CPUs and GPUs). It is the next thing I'm going to explore.
 
I can reach 100-150k/s on a network of 6 quadcore ... 565000k/s should be forbidden by the law of physic ;)

Heh well it'd be cooler if it were actually rendering anything, but just for kicks then:

blackrender.jpg


No overclocking or anything.
 
Last edited by a moderator:
Just for fun, this was looking up in the air, running BETA 2.
I just to see how high my numbers would go. :)

GPU was set to the same speed as posted above...
fishingn.jpg
 
Thanks guys, 2900K/sec isn't a bad result. It is still a bit far from the 5400K/s of my 4870 under Windows. I assume 1 GPU of the GTX295 and the HD4870 should run about at the same speed.

May be ATI OpenCL driver/hardware is just better for this particular kernel.

Overall, it is quite impressive how good are the performances of the new generation (i.e. ATI HD5870) compared to the old one (i.e ATI HD 48xx, NVIDIA GTX 28x/29x). Now we have only to wait for Fermi to have a complete picture ;)
 
OK, so the next thing to do to make David boggle, is to run multiple SmallLuxGPU programs. Because the GPU is only active for a fraction of the time taken to render each pass it should be possible to load up all four cores of a quad core processor by running four different instances of the program: BIGMONKEY, LOFT, LUXBALL and SPONZA :p

Use affinity to keep each one pinned to a single core.

Then take the average rays/second from each and add them all up for a grand total :LOL:

Jawed
 
Just some numbers from our other thread you might like to see... ;)
http://www.xtremesystems.org/forums/showthread.php?t=241904&page=2

I believe this is what we are looking at so far...
freeloader ---------- 5850 ------ Sample/sec -- 17,298.6K v1.5 (GPU=1007, M=1152)

freeloader ---------- 5850 ------ Sample/sec -- 13,719.6K v1.4 (GPU=1007, M=1152)
Toysoldier ---------- 5870 ------ Sample/sec -- 13,719.6K v1.4 (GPU=875, M=1300)
fellix bg ------------- 5870 ------ Sample/sec -- 13,719.6K v1.4 (GPU=900, M=1250)

safan80 ------------- 5970 ------ Sample/sec -- 11,012.8K v1.4 (Unknown)

SocketMan --------- 5770 ------ Sample/sec --- 7,535.1K v1.4 (GPU=950, M=1200)
mattkosem --------- 4890 ------ Sample/sec --- 7,520.9K v1.4 (GPU=1056, M=1000)
BeepBeep2 --------- 4850 ------ Sample/sec --- 7,172.0K v1.5 (GPU=800, M=2250)

Mechromancer ----- 4870 ------ Sample/sec --- 6,955.5K v1.5 (GPU=790, M=900)
PyrO ----------- 1/2 a 4870X2 -- Sample/sec --- 6,955.5K v1.5 (GPU=790, M=915)
redrumy3 ---------- 4870 ------- Sample/sec --- 6,375.8K v1.4 (GPU=875, M=1100)

PyrO ----------- 1/2 a 4870X2 -- Sample/sec --- 5,796.2K v1.4 (GPU=790, M=915)
NovoRei ------------ 4870 ------ Sample/sec --- 5,616.1K v1.4 (512mb, 790mhz)

Talonman -------- 1/2 a 295 ---- Sample/sec --- 2,898.1K v1.5 (C=621, SH=1512, M=1152)
Chumbucket843 -- GTX 260 ---- Sample/sec --- 2,068.7K v1.5 (C=602, SH=1369, M=1159)

Talonman -------- 1/2 a 295 ---- Sample/sec --- 1,159.2K v1.4 (C=621, SH=1512, M=1152)
Chumbucket843 -- GTX 260 ---- Sample/sec --- 1,123.2K v1.4 (C=602, SH=1369, M=1159)
DosDuoNo -------- GTX 260 ----- Sample/sec --- 1,093.2K v1.4 (C=655, SH=1125, M=1125)
 
OK, so the next thing to do to make David boggle, is to run multiple SmallLuxGPU programs. Because the GPU is only active for a fraction of the time taken to render each pass it should be possible to load up all four cores of a quad core processor by running four different instances of the program: BIGMONKEY, LOFT, LUXBALL and SPONZA :p

Use affinity to keep each one pinned to a single core.

Then take the average rays/second from each and add them all up for a grand total :LOL:

I have already started to work on SmallptGPU 2.0 in order to test/learn how the OpenCL support for multiple devices work (i.e. CPU + GPU, GPU + GPU, etc.) ;)

P.S. thanks Talonman, a lot of very interesting information, 17,298.6K on a 5850 ?!? What the hell ...


 
Last edited by a moderator:
Overall, it is quite impressive how good are the performances of the new generation (i.e. ATI HD5870) compared to the old one (i.e ATI HD 48xx, NVIDIA GTX 28x/29x). Now we have only to wait for Fermi to have a complete picture ;)
Fermi has cached read/write of spilled registers, so it should be much better, if spilling is the problem on NVidia. Need to know how many registers are being allocated or whether spillage is occurring.

Jawed
 
Back
Top