Discussion in 'Rendering Technology and APIs' started by fellix, Dec 27, 2009.
Glad to help.
NVidia users normally won't have a CPU capability on their systems as they need the ATI driver to be installed (this driver contains both CPU and GPU support for OpenCL, and installs on systems that don't have ATI graphics). I think Vista prevents multiple IHV drivers from being installed, but XP and W7 should be OK.
Anyway, all the fun is going to be in multiple-GPU. Talonman should be happy then.
You know me well!!
Optix Libraries FTW!
My favorite multi-GPU program is Mandelbulb. Due to it using the Optix libraries, it does do an outstanding job of using all of your GPU's installed in the system. :heart:
(That's both GPU's operating in SLI mode, and Dedicated PhysX mode.)
The CPU workload will also make use of multiple cores if available.
To Download - Post 29, page 2.
My GPU workload distribution with the app running:
I wish more CUDA libraries would just handle that, as well as Optix apparently does.
End Thread Jack!
I have read this news on NVIDIA OpenCL forum: http://forums.nvidia.com/index.php?showtopic=153438
No idea how it works but it looks interesting for CPUs.
I have read also this post: http://forums.nvidia.com/index.php?showtopic=154710
It could be a good explanation of why the SmallptGPU CPU usage is so high for NVIDIA users.
Is there a known way to disable blocking when alternating memory buffers are being used?
or is the better question...
How can we run SmallptGPU without using alternating memory buffers all-together.
Not sure I rate their chances against AMD and Intel...
So it has no meaningul impact on actual rendering performance then. Also NVidia can change the way this OpenCL event is handled so that it's treated like an event rather than a spin.
Sorry, I still can't edit. At least my post count is going up, and some day edit will be unlocked!
Is there any reason to think that the 'RUN_SCENE_SIMPLE_64SIZE', would not need to use alternating memory buffers due to a lot of "nothing to do" rays?
NVidia has coded the event triggering "badly", which uses 100% of a core. Same code on ATI doesn't do this. The OpenCL is entirely sane as far as I can tell, and this is purely down to NVidia.
The other side of the coin is that NVidia appears to have ~2x higher kernel invocation rate than ATI. This would be an advantage if ray tracing needed thousands of separate invocations per second.
This invocation rate advantage might be due to the way NVidia is handling events (i.e. the high CPU usage) or it might not. I don't know.
Actually, to be sure that invocation rate is higher on NVidia we'd need to compare an "empty kernel" on both. NVidia may have higher work-item generation rate or something else (i.e. something in the GPU) that makes "rendering infinity" faster...
I don't know if blocking sync is implemented for OpenCL on NV hardware. The "fastest" way to wait on GPU event completion is to spinlock on a memory location, but that obviously pegs a CPU core. Blocking sync allows the thread to go to sleep and be awakened by the driver later. There can be significant latency penalties (it can be orders of magnitudes higher than a basic spinlock because you get into the vagaries of OS thread scheduling), but this is not a bottleneck for a number of applications.
I understand the point but it is going to be a problem for anyone interested to CPU+GPU bandwidth (i.e. total rendering time in my case) more than GPU latency. I guess they should use some sort of adaptive strategy (i.e. spinlock for small task, thread suspend for larger one).
Anyway, ATI beta SDK was showing exactly the same behavior (i.e. high CPU usage on comuncation with the GPU) and it has than been fixed in the ATI SDK 2.0 final release. I hope NVIDIA will do the same.
You could periodically poll with clGetEventInfo and sleep in between ... but that's a bit fugly, it would prove the theory though.
Busy waits inside the OpenCL driver don't make sense ... if the developer wants to do a busy wait he can do that in his own program (same as above without the sleep) implementing driver level code for a real blocking call he can't.
I get goosebumps just reading this conversation...
After brief trial and error I've managed to manually edit the scene files and now I'm able to modify every parameter (colour, luminosity, position and surface type) and add more spheres to the scene. I guess with an editor would be much easy.
Anyway, my immediate observation was that adding more lights considerably decreases the sampling rate -- four versus one light source cuts the performance nearly in half, both for GPU and CPU device selection.
Fellix, SmallptGPU doesn't include any acceleration structure, it means that if you double the number of objects, the rendering time will double too and it is about the same for light sources.
For instance, http://davibu.interfree.it/opencl/smallluxgpu/smallluxGPU.html has instead a simple acceleration structure (i.e. BVH) and is able to handle a lot more objects (262000 in the case of the Luxball scene and/or 62 light sources in the case of Loft scene).
Er, it has no impact on PCIe bandwidth because it's not spinning across PCIe.
If you guys want to read more about Nvidia performance, this might be worth keeping an eye on?
It is ongoing performance testing by Steven Robertson (srobertson on Nvidia's forum.)
I realize the test is being done in CUDA, but figure some performance lessons might also be applied directly to OpenCL.
I did some more change in the area of GPU memory management to try to improve NVIDIA performances: http://davibu.interfree.it/opencl/smallptgpu/smallptgpu-v1.6alpha.tgz
Can someone with a NVIDIA card give it a try ? Thanks.
OpenCL Platform 0: NVIDIA Corporation
OpenCL Device 0: Type = TYPE_GPU
OpenCL Device 0: Name = GeForce GTX 260
OpenCL Device 0: Compute units = 27
OpenCL Device 0: Max. work group size = 512
[SELECTED] OpenCL Device 0: Type = TYPE_GPU
[SELECTED] OpenCL Device 0: Name = GeForce GTX 260
[SELECTED] OpenCL Device 0: Compute units = 27
[SELECTED] OpenCL Device 0: Max. work group size = 512
Reading file 'rendering_kernel.cl' (size 3230 bytes)
OpenCL Device 0: kernel work group size = 128
OpenCL Device 0: forced kernel work group size = 128
Samples/sec 46k (fluctuates a bit up and down, but seems to stay withint 46k).
But it doesn't work for any sizes over 128, Reading file 'rendering_kernel.cl' (size 3230 bytes)
OpenCL Device 0: kernel work group size = 128
OpenCL Device 0: forced kernel work group size = 512
Failed to enqueue OpenCL work: -5
On Cornell I get about 10500/10600k.
Ubuntu 9.10 x64 with Core i7-920