NVidia users normally won't have a CPU capability on their systems as they need the ATI driver to be installed (this driver contains both CPU and GPU support for OpenCL, and installs on systems that don't have ATI graphics). I think Vista prevents multiple IHV drivers from being installed, but XP and W7 should be OK.I have already started to work on SmallptGPU 2.0 in order to test/learn how the OpenCL support for multiple devices work (i.e. CPU + GPU, GPU + GPU, etc.)
NVidia users normally won't have a CPU capability on their systems as they need the ATI driver to be installed (this driver contains both CPU and GPU support for OpenCL, and installs on systems that don't have ATI graphics). I think Vista prevents multiple IHV drivers from being installed, but XP and W7 should be OK.
Not sure I rate their chances against AMD and Intel...I have read this news on NVIDIA OpenCL forum: http://forums.nvidia.com/index.php?showtopic=153438
No idea how it works but it looks interesting for CPUs.
So it has no meaningul impact on actual rendering performance then. Also NVidia can change the way this OpenCL event is handled so that it's treated like an event rather than a spin.I have read also this post: http://forums.nvidia.com/index.php?showtopic=154710
It could be a good explanation of why the SmallptGPU CPU usage is so high for NVIDIA users.
I don't know if blocking sync is implemented for OpenCL on NV hardware. The "fastest" way to wait on GPU event completion is to spinlock on a memory location, but that obviously pegs a CPU core. Blocking sync allows the thread to go to sleep and be awakened by the driver later. There can be significant latency penalties (it can be orders of magnitudes higher than a basic spinlock because you get into the vagaries of OS thread scheduling), but this is not a bottleneck for a number of applications.
Anyway, my immediate observation was that adding more lights considerably decreases the sampling rate -- four versus one light source cuts the performance nearly in half, both for GPU and CPU device selection.
Er, it has no impact on PCIe bandwidth because it's not spinning across PCIe.I understand the point but it is going to be a problem for anyone interested to CPU+GPU bandwidth (i.e. total rendering time in my case) more than GPU latency. I guess they should use some sort of adaptive strategy (i.e. spinlock for small task, thread suspend for larger one).
Anyway, ATI beta SDK was showing exactly the same behavior (i.e. high CPU usage on comuncation with the GPU) and it has than been fixed in the ATI SDK 2.0 final release. I hope NVIDIA will do the same.