The register allocation of the PS version is 18 vec4s, which, assuming the CS version is similar, means the hardware can't run a full 16 hardware threads on each SIMD.
But our theory was that it only schedules full groups, so for instance at size 256 we will have just 3 groups running at the same time. As we have no memory access (except the final write) all we need to utilize the hardware are 2 interleaved hardware threads running. And for that 2 groups should be enough (in theory at least). And at least this shouldn't be affected by the 2D layout, only the total number.
The performance of those narrow group layouts is really puzzling, branch coherence should definitely be best in 8x8 pixel blocks, so how 2x128 can ever be better than 8x32...?
Maybe it's time to check if the 2D index -> hardware thread index mapping is as expected..
Is there a way to implement some sort of AA sampling to the generated frames? Pixel aliasing on the fine structures is quite pronounced.
We could always add supersampling, but that would be pretty costly. For branch granularity it would be best to have just 1 sample/thread, and then maybe add them up in LDS. A more adaptive aproach is problematic because of the branching. Sphere tracing is already ALOT more costly on the edges (which is where we need the additional samples), so adding more samples to the already slowest strands would be quite ineffective. Could be interesting to try append buffers to schedule additional work for later.
Got around to write new pixel-work-sharing versions, after the initial 16pix/thread idea failed on the branch coherency.
First a bit complicated single pass scheme, that *should* have decent performance: each group is 8*32+16 threads (no 2d idx here), where the initial hard work (calculating lower bound on depth for 4x4 pixel block) is done in the last 16 threads (8*32 pixels = 16 4x4 pixel blocks). After this we synchronize through LDS with the remaining 256 threads so they can perform a lighter work than usual. This performs slightly worse than the plain version, even though less work is done.
Ofcourse the initial hardware thread is only running 16 strands, but it's only a smaller part of the workload (we have 4 more hardware threads in each group). And as I want at least 2 groups running I can't go for a larger group size. Still, compared to the version below it's less than expected.
So I had to write a regular 2 pass version instead, which could just as well be done in pixel shaders (ran into a lot of stupid restrictions with the new buffer types btw).
An inital pass writing out lower bound for depth for 4x4 blocks to a quarter sized structured buffer and a final pass starting at those positions instead of the camera. This gives in the range of 20-40% more performance for this scene. Again, looking at the number of distance evaluations (just plotting them per pixel) it looks like my FLOPS is going down. The first pass is taking <10% of the time.
btw, "a= (b>0)?0:complex()" is always doing BOTH branches, while "if (b>0) a=0; else a=complex();" is not..