Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.
Yes, I am running at: C=621, Shaders=1512, and Memory=1152
BTW - Where is the 'edit post' button?
Err I think that's reserved for members over a certain post count. Kind of a pain, but I guess newbie stealth edits ruined it..
Thanks all, the performances on NVIDIA are still quite disappointing, I will try to give a look to NVIDIA OpenCL samples to see if I can find an explanation to the poor performances.
If we knew what the CPU was up to, I think it would help to put our finger on the performance issue.
I've tried a mild overclock (133 bus speed to 145) on the i7, which gave a single core top speed of 3.625 Ghz, and it didn't seem to affect the score.
If you run the 6600 at stock speed, do the scores change?
What is the CPU load like when running simple.scn? That might be a clue.
12-13% of 4 cores/4 HT. About 91800k samples/sec
The Win7 scheduler bounces the process around between the 4 real cores. If I force the affinity of smallptgpu.exe to just 1 core, it just maxes that out 100% and the score doesn't change.
Hmm when overclocking the CPU to 145 (actually gives a top speed of 3760 as the max turbo multiplier seems to be 26 on this CPU instead of 25 like I thought) this score does seem to go a bit higher, like a mean of 92400k or so. Too small to be significant though I reckon.
That's strange, I'm getting just over 70000K in this scene with the 5870 -- no where near Talonman's numbers.
Looks like some specific workload is hogging NV hardware or the driver in the heavier scene?!
Could the NV Opencl be offloading some of the work to the cpu, explaining why simple on one core of the GTX 295 is faster than a HD 5870?
Simple on my HD 5870 get 71000K
The simple scene consists of a lot of "nothing to do" rays, I guess (off to infinity and beyond). This should mean the duration of the kernel is short. The variation with CPU clock seems to suggest that CPU-side stuff is some kind of bottleneck. Finally the lower ATI performance for this scene seems to suggest that kernel launch overhead is higher on ATI.
A way to check this is simply to point the camera at "black", i.e. press cursor-up-arrow until everything disappears.
249000K doing that.
~444000k with simple in the black.
~470000k with bus speed overclock / process affinity set
So, that's effectively about 1500 frames per second (passes). If you maximise the application then the sample rate should increase as the CPU time should become less of a bottleneck.
Again, you are correct. Simple in the black w/o OC becomes 565000k/s when maximised.
This one I don't get. Why does the CPU become less of a bottleneck at full screen?
Edit: oh wait I see it actually ups the resolution when maximising
Do you use any private array in your kernel? NVIDIA does not support indexed register right now, so it uses global memory to do that. So it's going to be slow if you use private arrays in non-predictable manner. If this is the case, and your arrays are small enough, it may be beneficial to substitute them with shared memory (the local buffer in OpenCL). Of course, you'd need one for each thread and the local buffer is not very big (only 16KB on NVIDIA's hardware) so it's probably not going to work well if you need to have large arrays.
On ATI the register allocation has gone down by 1 and there's slightly less instructions. The reduction in register allocation doesn't affect the count of hardware threads which is approximately: floor(256/NUM_GPRS). The number of clause temporaries used can have a small impact, since these count towards register allocation.
I see the new kernel function GenerateCameraRay which looks like an attempt to control some variables' lifetime. Seems to have worked on NVidia...
There's a 2-element array of integer seeds that's private per work-item. This array is put into local memory. On ATI the compiler just uses registers.