The benchmark is my own work on realtime GI. The workloads are breadth first traversals of BVH / point hierarchy, raytracing for visibility, building acceleration structures. But it's not comparable to classic raytracing. Complexity is much higher and random access is mostly avoided. The general structure of programs is load from memory, heavy processing using LDS, write to memory. Rarely i access memory during the processing phase, and there is a lot of integer math, scan algorithms, also a lot of bit packing to reduce LDS. Occupancy is good, overall 70-80%. It's compute only - no rasterization or texture sampling.
The large AMD lead was constant over many years and APIs (OpenGL, OpenCL 1.2, finally Vulkan) The factor 5 i remember from the latest test in Vulkan comparing GTX670 vs. 7950, two years ago.
Many years ago i bought a 280x to see how 'crappy' AMD performs with my stuff, and i could not believe it destroyed Kepler Titan by a factor of two out of the box.
At this time i also switched from OpenGL to OpenCL, which helped a lot with NV performance but only a little with AMD. I concluded neither AMDs hardware nor their drivers are 'crappy'
Adding this to the disappointment of GTX670 not being faster than GTX480, i missed the following NV generations. Also i rewrote my algorithm which i did on CPU.
Years later, after porting results back to GPU (using OpenCL and Vulkan) i saw after the heavy changes the performance difference was the same. Rarely a shader (i have 30-40) shows an exception.
I also compared newer hardware: FuryX vs. GTX1070. And thankfully it showed NV did well. Both cards have the same performance per TF, just AMD offers more TF per dollar. So until i get my hands on Turing and RDNA i don't know how things have changed further.
Recently i learned Kepler has no atomics to LDS, and emulates with main memory. That's certainly a factor but it can't be that large - i always tried things like comparing scan algorithm vs. atomic max and picking the faster per GPU model.
So it remains a mystery why Kepler is so bad.
If you have an idea let me know, but it's too late - seems 670 has died recently :/
One interesting thing is AMD benefits much more from optimization, and i tried really hard here because GI is quite heavy.
Also NV seems much more forgiving to random access, and maybe i'm an exception here, comparing to other compute benchmark workloads.