Or a faster OpenCL implementation
(I got 1200 kSamples per second with my CUDA version on a 8800GTS).
So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of
OCL_CONSTANT_BUFFER with
__constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.
For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replace
Code:
unsigned int i = sphereCount;
for (; i--;) {
with
Code:
for (unsigned int i = 0; i < sphereCount; i++) {
It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?
I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.