GPU Ray-tracing for OpenCL

Silent_Buddha · Jan 10, 2010

Dade said:
@Silent_Buddha: yup, the AMD/ATI OpenCL CPU device spawns as many threads as the cores available. At the moment I have the opposite problem, I would like to have some direct control on the number thread spawned in order to not overload the CPU (and slow down threads dedicated to drive the GPUs).

I was just wondering about that myself. That was my next question for you. Whether there was a way to limit the number of threads/number of cores used on a CPU. As I'd imagine it could be problematic in say a game for instance, if the core game required 1-2 cores for good performance, but would then like anything using OpenCL to be able to fully occupy all remaining cores.

Regards,
SB

MfA · Jan 10, 2010

Can't you just increase the priority on the GPU rendering threads?

PS. does calling clfinish at near 200 Hz (on the faster ATI cards) really have no impact on performance? (If I had a graphics card which could run it I'd try doing clfinish once every 20 invocations or so instead, but I don't.)

Talonman · Jan 10, 2010

Dade said:
@Talonman: you could try the version 2.0 with your 3xGPUs setup. I uploaded a preliminary version at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha1.tgz

This version should run on all GPUs (and CPUs on Apple/ATI) available. It is highly untested and I'm experiencing a problem under Windows: my PC resets after few seconds like if my power supply couldn't sustain the load Everything works fine under Linux ... quite strange. Well, it may crash but you can give it a try.

Thinks! Preliminary report is in...

She works... she works!! :smile:

This was with running the 'smallptGPU' executable.
Now getting 3,880K Samples/sec:

GPU-z reporting workload distribution@
First 1/2 of my 295 - 73%
Second 1/2 of my 295 - 50%
280 Dedicated PhysX - 98%
Q6600 - 29%

Looking good too...

I will report back with my using the various 'Size' bat files.

UPDATE:
smallptGPU - 3,880
RUN_SCENE_CORNELL_32SIZE - 2,401
RUN_SCENE_CORNELL_64SIZE - 3,880
RUN_SCENE_CORNELL_128SIZE - 3,516
RUN_SCENE_SIMPLE_64SIZE - 77,879.7

Talonman · Jan 10, 2010

Running the Simple, workload distribution reported by GPU-z was:
First 1/2 of my 295 - 35%
Second 1/2 of my 295 - 15%
280 Dedicated PhysX - 92%
Q6600 - 30%

FYI - No system reset problems of any kind. She is solid as a rock so far.

You may also be interested in knowing that the SmallptGPU, and the Simple Scene run fine against each other...
GPU-z reporting workload distribution@
First 1/2 of my 295 - 71%
Second 1/2 of my 295 - 41%
280 Dedicated PhysX - 96%
Q6600 - 63%
Getting 3602.0K Samples/sec on both instances.

I was playing with the '1' key, that is the reason you see the 'Updating OpenCL Device workloads' so many times.

Dade · Jan 10, 2010

Wow, Talonman, thank you, I love the pictures with the 3 GPUs at work.

Good if it is stable .. now I'm a bit worried of the status of my power supply

or my be it is just a bug in the ATI Windows XP driver because everything works fine under Linux.

Talonman · Jan 10, 2010

Dade said:
Wow, Talonman, thank you, I love the pictures with the 3 GPUs at work.

Good if it is stable .. now I'm a bit worried of the status of my power supply or my be it is just a bug in the ATI Windows XP driver because everything works fine under Linux.

Thanks...

Look here for more BETA testers on the new version. (ATI users too)

http://www.xtremesystems.org/forums/showthread.php?t=241904&page=3

I think the GPU workload balancing could use a bit of an adjustment, but love the progress your making.

Keep up the fine job.

Florin · Jan 10, 2010

I made a binary of the 2.0alpha on OSX, needed just a few CFLAGS really (-I/opt/local/include -L/opt/local/lib -lboost_thread-mt, this is for boost 1.41 from macports which is probably most widely used). And a minor typo on line 34 of displayfunc.cpp __APPLE_ needs extra underscore. So nothing big, works sweet.

Looking at utilisation I reckon it's still leaving cycles on the table for now but great new development Dade

Here's a shot of one of the few boxes where the CPU does a sizeable part of the work:

Florin · Jan 10, 2010

Jawed said:
Try the "cl-mad-enable" or "cl-fast-relaxed-math" compiler options. Or, even, "-cl-unsafe-math-optimizations".

Jawed

unsafe didn't seem to make a difference, and the current release is actually slightly faster (~16K s/sec) w/o the other 2 options on OS X.. curiously

Lightman · Jan 10, 2010

Dade - can load balancing between GPU and CPU be manually adjusted after application do it's automatic balance? The problem I can see is when someone like me likes to change GPU and CPU freq. on the fly, the app will not work optimally.
Also for some reason it's not balancing ideally for clocks above 850MHz on my GPU (the CPU/GPU ratio stays roughly the same when it should be giving more work for GPU). With manual adjustement I can give for example 96% for GPU and only 4% for CPU which probably give me best performance in 1GHz GPU - 3.5GHz QC CPU confing.

Hope this can be implemented!

Florin · Jan 10, 2010

Hyperthreading with ATI Stream SDK

The i7 is almost exactly as fast as the GTX280 (ie proof positive that Nvidia's current OpenCL.dll is sh*t)

Jawed · Jan 10, 2010

Copy one of the batch files to make a new one called complex.bat and make it like so:

smallptGPU.exe 1 1 64 640 480 scenes\complex.scn

This is a new scene file. The number of spheres is quite meaty...

Jawed

Dade · Jan 10, 2010

Lightman said:
Dade - can load balancing between GPU and CPU be manually adjusted after application do it's automatic balance? The problem I can see is when someone like me likes to change GPU and CPU freq. on the fly, the app will not work optimally.
Also for some reason it's not balancing ideally for clocks above 850MHz on my GPU (the CPU/GPU ratio stays roughly the same when it should be giving more work for GPU). With manual adjustement I can give for example 96% for GPU and only 4% for CPU which probably give me best performance in 1GHz GPU - 3.5GHz QC CPU confing.

Hope this can be implemented!

It is just a matter of adding few keybindings to hand tune the % of workload assigned to each device, I'm going to add this feature

You my have notice that at the moment the workload % is decided after 10secs of "Profiling" with the work evenly split among all devices. I guess it is a too short period but increasing the profiling time would be just annoying so hand tuning is probably the best solution.

Dade · Jan 10, 2010

Jawed said:
Copy one of the batch files to make a new one called complex.bat and make it like so:

smallptGPU.exe 1 1 64 640 480 scenes\complex.scn

This is a new scene file. The number of spheres is quite meaty...

Jawed, that scene is quite "terrible" given the current brute force intersection algorithm used in smallptGPU.

My first idea was to keep smallptGPU very simple but I guess it is worth adding a simple BVH to accelerate ray intersections so we can play with thousands of spheres instead that just 5 or 6

Talonman · Jan 10, 2010

For trivia, w1zzard, the programmer of GPU-z is at Xtreme Systems, was asked what he was using to read the % of workload on each GPU. He posted "I'm using the official interface provided by nvidia in their nvapi."

I don't know if you could also use it or not...

It would be fun to see the actual workload % change on the screen, as we hit the keys.

http://www.evga.com/forums/tm.aspx?m=91863&mpage=3
Posted by Spongebob28:
"I want to join in on the games.

V2.0 Alpha

Geforce 8800 GT Perf. index 1.00 Workload done 34.2%
Geforce GTX 295 Perf. index 2.28 Workload done 21.8%
Geforce GTX 295 Perf. index 2.28 Workload done 44.0%

Rending time 0.625 sec pass 447 Sample/sec 1820.8K."

CNCAddict · Jan 11, 2010

I'm curious what the differences are between David's program and this one??

http://code.google.com/p/tokaspt/

He is claiming 185.6M 4 bounce rays per second. I would need about 13 HD5850's to get close to that with SmallptGPU

Dade · Jan 11, 2010

CNCAddict said:
I'm curious what the differences are between David's program and this one??

http://code.google.com/p/tokaspt/

He is claiming 185.6M 4 bounce rays per second. I would need about 13 HD5850's to get close to that with SmallptGPU

Samples are not Rays. SmallptGPU traces 12 rays (2 for each path vertex, 6 path depth max.) to generate a sample. It means running at 228M rays/sec in case of the 19000M samples/sec of a 5870.

You could look at the "spp" (i.e.Sample Per Sec) statistic there to try to do a direct comparison.

Dade · Jan 12, 2010

I uploaded a new version of smallptgpu with the capability to hand tune the workload on OpenCL devices at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha2.tgz

You can press 'n' or 'm' to select the OpenCL device and 'v' or 'b' to dynamically increase/decrease the workload.

CNCAddict · Jan 12, 2010

I'm a 100% noob and was not sure if he was using some sort of fancy code to speed things up. Thanks for the clarification!!

cho · Jan 12, 2010

Dade said:
I uploaded a new version of smallptgpu with the capability to hand tune the workload on OpenCL devices at http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha2.tgz

You can press 'n' or 'm' to select the OpenCL device and 'v' or 'b' to dynamically increase/decrease the workload.

the GPU load of GPU-Z 0.38 is "0%" with this version.

My GPU is NVIDIA GeForce 9600GT 512MB (Windows 7 x64, FW 195.62) .

Talonman · Jan 12, 2010

Outstanding job again Dave!!

By manually adjusting the workload, I now can get 5173.9K Samples/sec. (A new record on my system.)
First 1/2 of my 295 - 96%
Second 1/2 of my 295 - 97%
280 Dedicated PhysX - 96%
Q6600 - 30% (Exactly the same as the first 2.0Alpha, without the manual GPU workload distribution feature.)

Apparently balancing the GPU's workload does NOT change the CPU's utilization at all...

Note that I now feel there is no need for GPU-z anymore, I simply need to get all 3 GPU's to read as close to 33.3% as I can, using your program.
The displayed workload distribution % in your program is accurate.

The Simple Scene also has the manual adjustment feature, and it also is working well:

GPU Ray-tracing for OpenCL

Silent_Buddha

MfA

Talonman

Talonman

Dade

Talonman

Florin

Merrily dodgy

Florin

Merrily dodgy

Lightman

Florin

Merrily dodgy

Jawed

Dade

Dade

Talonman

CNCAddict

Dade

Dade

CNCAddict

cho

Talonman

Similar threads