GPU Ray-tracing for OpenCL

Talonman · Jan 3, 2010

Florin said:
Is your 295 still OCed? Your sig says EVGA GTX 295 C=621 S=1512 M=1152?

Yes, I am running at: C=621, Shaders=1512, and Memory=1152

BTW - Where is the 'edit post' button?

Florin · Jan 3, 2010

Talonman said:
Yes, I am running at: C=621, Shaders=1512, and Memory=1152

BTW - Where is the 'edit post' button?

Err I think that's reserved for members over a certain post count. Kind of a pain, but I guess newbie stealth edits ruined it..

Talonman · Jan 3, 2010

Thanks.... Drag!!

Dade · Jan 3, 2010

Thanks all, the performances on NVIDIA are still quite disappointing, I will try to give a look to NVIDIA OpenCL samples to see if I can find an explanation to the poor performances.

Talonman · Jan 3, 2010

If we knew what the CPU was up to, I think it would help to put our finger on the performance issue.

Florin · Jan 3, 2010

Talonman said:
If we knew what the CPU was up to, I think it would help to put our finger on the performance issue.

I've tried a mild overclock (133 bus speed to 145) on the i7, which gave a single core top speed of 3.625 Ghz, and it didn't seem to affect the score.

If you run the 6600 at stock speed, do the scores change?

Jawed · Jan 3, 2010

What is the CPU load like when running simple.scn? That might be a clue.

Jawed

Talonman · Jan 3, 2010

odd...

Still high!

Talonman · Jan 3, 2010

Florin · Jan 3, 2010

Jawed said:
What is the CPU load like when running simple.scn? That might be a clue.

Jawed

12-13% of 4 cores/4 HT. About 91800k samples/sec

The Win7 scheduler bounces the process around between the 4 real cores. If I force the affinity of smallptgpu.exe to just 1 core, it just maxes that out 100% and the score doesn't change.

Hmm when overclocking the CPU to 145 (actually gives a top speed of 3760 as the max turbo multiplier seems to be 26 on this CPU instead of 25 like I thought) this score does seem to go a bit higher, like a mean of 92400k or so. Too small to be significant though I reckon.

fellix · Jan 3, 2010

That's strange, I'm getting just over 70000K in this scene with the 5870 -- no where near Talonman's numbers.

Looks like some specific workload is hogging NV hardware or the driver in the heavier scene?!

Sinistar · Jan 3, 2010

Could the NV Opencl be offloading some of the work to the cpu, explaining why simple on one core of the GTX 295 is faster than a HD 5870?

Simple on my HD 5870 get 71000K

Jawed · Jan 3, 2010

The simple scene consists of a lot of "nothing to do" rays, I guess (off to infinity and beyond). This should mean the duration of the kernel is short. The variation with CPU clock seems to suggest that CPU-side stuff is some kind of bottleneck. Finally the lower ATI performance for this scene seems to suggest that kernel launch overhead is higher on ATI.

A way to check this is simply to point the camera at "black", i.e. press cursor-up-arrow until everything disappears.

Jawed

Sinistar · Jan 3, 2010

249000K doing that.

Florin · Jan 3, 2010

~444000k with simple in the black.
~470000k with bus speed overclock / process affinity set

Jawed · Jan 3, 2010

So, that's effectively about 1500 frames per second (passes). If you maximise the application then the sample rate should increase as the CPU time should become less of a bottleneck.

Jawed

Florin · Jan 3, 2010

Again, you are correct. Simple in the black w/o OC becomes 565000k/s when maximised.

This one I don't get. Why does the CPU become less of a bottleneck at full screen?

Edit: oh wait I see it actually ups the resolution when maximising

pcchen · Jan 3, 2010

Dade said:
Thanks all, the performances on NVIDIA are still quite disappointing, I will try to give a look to NVIDIA OpenCL samples to see if I can find an explanation to the poor performances.

Do you use any private array in your kernel? NVIDIA does not support indexed register right now, so it uses global memory to do that. So it's going to be slow if you use private arrays in non-predictable manner. If this is the case, and your arrays are small enough, it may be beneficial to substitute them with shared memory (the local buffer in OpenCL). Of course, you'd need one for each thread and the local buffer is not very big (only 16KB on NVIDIA's hardware) so it's probably not going to work well if you need to have large arrays.

Jawed · Jan 3, 2010

Dade said:
Thanks all, the performances on NVIDIA are still quite disappointing, I will try to give a look to NVIDIA OpenCL samples to see if I can find an explanation to the poor performances.

On ATI the register allocation has gone down by 1 and there's slightly less instructions. The reduction in register allocation doesn't affect the count of hardware threads which is approximately: floor(256/NUM_GPRS). The number of clause temporaries used can have a small impact, since these count towards register allocation.

I see the new kernel function GenerateCameraRay which looks like an attempt to control some variables' lifetime. Seems to have worked on NVidia...

Jawed

Jawed · Jan 3, 2010

pcchen said:
Do you use any private array in your kernel?

There's a 2-element array of integer seeds that's private per work-item. This array is put into local memory. On ATI the compiler just uses registers.

Jawed

GPU Ray-tracing for OpenCL

Talonman

Florin

Merrily dodgy

Talonman

Dade

Talonman

Florin

Merrily dodgy

Jawed

Talonman

Talonman

Florin

Merrily dodgy

fellix

Sinistar

I LIVE

Jawed

Sinistar

I LIVE

Florin

Merrily dodgy

Jawed

Florin

Merrily dodgy

pcchen

Moderator

Jawed

Jawed

Similar threads