Thread group issues...
I was going to do a (what I thought) obvious CS optimization to this, but it's not behaving as expected. What I wanted was to have each thread process a small (4x4) group of pixels, sharing most of the work between them.
However, just making the basic 4x4 loop in the shader and reducing group size to 4x4 too drastically increases compilation time and reduces performance.
Seems like even the [loop] attribute won't save me for the insane unrolling, so to keep program size down (seem like big programs aren't as much of an issue as in r700?) I have to do the old tricks of hiding the loop count from the compiler...
For the performance thing I think it has to do with the thread grouping, but seems like too much slowdown.
I don't see the hardware implications of thread groups documented, but from the concepts of LDS and GDS I expect it to be like this: Each thread group is running on a single SIMD (of which I have 10) and this SIMD can't get a new group until all threads in the previous one are done (however this should be possible if LDS is not used?) - so with a few long-running threads my SIMDs will be starved in the end of a group.
So groups of 16 or below would be really bad and generally we want as many as possible in each group, ie the maximum 1024 (while still having a decent number of groups, but something like a few hundreds should do) - more threads than possible for the hw (register file) shouldn't be a problem.
However, the original 16x16 version is twice the speed of a 32x32 version - why is that??
In numbers I have for my test frame:
1 pixel/thread, 16x16 threads/group: 34fps
1 pixel/thread, 32x32 threads/group: 19fps
16 pixels/thread, 16x16 threads/group, dynamic loop: 17fps
16 pixels/thread, 16x16 threads/group, unrolled loop (grrr): 17fps (so just a compilation time issue)
16 pixels/thread, 4x4 threads/group: 6fps
And what would be right way to do things if the above isn't feasible? I imagine something like 1-pixel threads, 16+ 4*4 pixel blocks in a group and then letting one thread from each block do the common work, saving it on the LDS and then synchronize the whole group. However even with the maximum possible 64 blocks (1024 threads) drastically different workloads for the 64 "initial pixels" could waste a lot of cycles in the synchronization.