Hi David,
After going through a few driver revisions, I finally got SmallptGPU running on my comp (Core 2 Duo 2.4GHz, 8800 GTS 640MB). I was actually playing with the original Smallpt program first to understand the algorithm better and fool around with it.
Running Smallpt GPU, I was getting 370k samples/s, while I got 500k samples/s multithreaded on my CPU. I then realized that you made some algorithm changes with the direct lighting section. If I restored it to be similar to the original Smallpt code, I got 900k samples/s (and the image matched the CPU version, too).
I see how the direct lighting lowers the pass requirement for any target image quality, but I didn't expect such a dropoff in performance (though looking back I can see how it doubles the ray tests for diffuse hits). So in that function I tried replacing "for (i = 0; i < sphereCount; i++)" with "i = 8;" (and the "continue" with a "return") and performance doubled to 635k samples/s. This is really bizzare, because it's completely coherent branching. I also tried taking out the ray tests in SampleLights() to get a feel for the performance (this is the ideal location to do it because it doesn't affect program flow) and doing some math I get 70 ns to test a ray, or 8000 shader cycles on my card. That's pretty damn steep for nine ~20-cycle sphere-ray tests!
(FYI, in the SampleLights(), you need to delete the FLOAT_PI factor to get the lighting to be the same as the regular unbiased path tracing algorithm. I worked through the math, and it makes sense.)
It's not branch incoherence, because I eliminated all divergent branching by deleting everything but the DIFF block (thus making all objects diffuse), and performance was the same. I was hoping to use NVidia's OpenCL profiler, but it crashes for me.
So I finally decided to play around with CUDA, and after a bunch of frustration I finally got a port working. Basing it on your original OpenCL code (370kSamples/s for me, or ~5MRays/s), I first got 10 MRays/s. A decent improvement, but still not where I wanted it to be, particularly because I didn't get constant buffers working yet. When I did, I got up to 330 MRays per second! Now that's more like it! (~20 instr * 9 spheres + ~60 instr. between rays) * 330M is about 70% of my card's peak rate, ignoring divergence and estimation errors, so it's probably about as optimal as it gets. Is it true that NVidia has no caching of memory access unless you have an image object? The sphere data is 400 freaking bytes, so it's shocking that it can be the cause of a 30x difference in performance.
Anyway, this is so awesome. I had no idea that a brute force method like path tracing was so close to real-time. Attached below is an image of what my lowly 96 SP card can do in just three seconds, and I wouldn't be surprised if the 5870 was 20x as fast as my card. So David, if you can figure out how to properly get data into the constant buffers with OpenCL, you'll see monsterous speed improvements. The second best option is to use image objects.
I only have a 64-bit executable, but for those of you with NVidia cards, here it is:
http://www.its.caltech.edu/~nandra/SmallptCUDA.zip