Or a faster OpenCL implementation (I got 1200 kSamples per second with my CUDA version on a 8800GTS).
So good news: I figured out how to get the OpenCL version to speed up by a lot (30x on my GPU), and the change was so simple compared to the time I spent experimenting. In "geomfunc.h" and "rendering_kernel.cl", replace all occurences of OCL_CONSTANT_BUFFER with __constant (I thought the former was already defined as the latter all this time). I knew constant memory wasn't being used properly. I'm curious if ATI users get the same speedup.
For my 8800GTS, there is also one other thing that has to be changed, or else the speedup is only 2x. On line 82 in "geomfunc.h", replacewithCode:unsigned int i = sphereCount; for (; i--;) {
It's really strange that this would make such a difference, but it did for me. Even weirder is that doing the same on line 102 had barely any effect, and it's called almost as frequently. Probably a compiler bug. ATI users: do you see a difference?Code:for (unsigned int i = 0; i < sphereCount; i++) {
I'll try to package it all later tonight along with other changes. I wish NVidia's 64-bit SDK didn't make it such a pain to create 32-bit binaries. I might have to uninstall it.
Changing OCL_CONSTANT_BUFFER for _constant has no effect on performance on HD5870 with Cat. 8.12Beta
Replacing the second bit of code gave me minimum improvement from 14834Ks/s to 15000Ks/s.
Tested on SmallPT 2.0 alpha 2 using only GPU accel. on HD5870 stock clocks.