While overall performance should see a nice increase, I don't think benchmark figures is where you'll see the benefit the most. The biggest benefit the way I see it is that you should get a more stable framerate. When you enter that part of the world where the number of draw calls happen to shoot up to 5000, the framerate wouldn't drop to mid-20s anymore but could stay in the 50-60 range.
Sounds tasty.
Are you talking about the total frame time? Out of the blur effect's time I would expect such a reduction to cut it in half or so. The biggest gain isn't going to come from bandwidth reduction but from reducing the texture fetch count and turning posteffects from mostly fetch bound to ALU bound.
I was thinking purely in terms of being bandwidth bound.
But, separately, the problem with the CS implementation is that it trades TEX fetches (which on ATI should be significantly cached in L1) for LDS fetches. Since, per "pixel", it only uses each sample once it pays the full bandwidth/latency penalty for LDS fetch, which needs to be preceded by thread group fetches that populate LDS plus a thread group "syncthreads".
It seems to me this is really about the percentage of L1 misses. If L1 hits 100%, then PS is faster than CS because there's no syncthreads and no L1-fetch-into-LDS. Obviously L1 won't hit 100%, so now it's a question of the latency margin caused by those misses...
Against this, the nature of a filtering kernel tends to fight rasterisation order (Z, say) since they're linear space not rasterisation space, which theoretically causes a substantial L1 miss rate. One of those things that doesn't seem to have been benchmarked very effectively as far as I can tell.
It seems to me there's a real risk that the CS implementation won't be significantly faster simply because of the low arithmetic intensity. But a much larger kernel size obviously changes things, both by exacerbating L1's problems with linear space fetches and the thrashing caused by competing pixels. Though it's interesting to note that D3D11's thread local storage is only 32KB, which isn't vastly larger than the cluster's L1 size (8KB seems prolly too small, but I don't remember if L1 size is ever stated anywhere for current hardware).
(CS actually has a back-door optimising effect on ATI: if a shader is not too heavy on register allocation, then the PS version will result in more concurrent pixels fighting over L1. Whereas in the CS version it's not possible to have more than 1024 threads sharing data amongst themselves (16 wavefronts). This optimisation doesn't really apply on NVidia (at least not currently) because NVidia's register files are too small to have such a large number of competing pixels in the PS version, i.e. there's a hard limit of 1024 pixels on current hardware, anyway.)
The main effect of CS could simply be that the original fetch (each thread fetches its corresponding texel and posts it in thread local storage) is highly cache coherent, with the best possible L1 hit rate and thereafter everything is like a guaranteed 100% hit rate (but from thread local storage).
Another paramter here is the size of the thread group. It's no good making a single thread group 1024 in size, because this creates dead time at the start and end of the kernel when LDS is being populated by the new thread group (i.e. until syncthreads passes), or some of the threads in the workgroup (16 wavefronts' worth) have completed. Thread local storage cannot be freed for the next thread group until all threads in the thread group have completed.
So the programmer needs to size the thread group minimally, based on the kernel size. But rasterisation order then rears its ugly head as too-small thread-groups increase the total number of repeated fetches and incoherence (i.e. linear space not rasterisation space fetches). And overlaps are required (apron is a nice term) at the edges of the region being filtered.
It seems the example code uses a thread group size of 1024, processing an entire row (or column) of the source 1024x1024 texture at a time. So no apron and I presume a fair amount of wasted ALU cycles due to thread group start-up and shut-down intervals.
I don't know where the break-even point is in a comparison of PS and CS - maybe the 7-wide kernel is just beyond the break-even point?
Jawed