Hardware gaussian blur?

Shifty Geezer

uber-Troll!
Moderator
Legend
Given the ubiquity of blur in various image operations, and the fact it doesn't map fabulously well to GPUs (even crazily optimised blurs need multiple taps) especially mobile, is there any merit and means by which some alternative sampling hardware could be implemented to facilitate random image sampling? I suppose passing the results to shaders could be problematic so perhaps they'd need to function as stand alone buffer processors. Or are GPUs as good as it's going to get and there's no room for improvement?
 
facilitate random image sampling
As soon as I saw the words "random image sampling" the adjective "slow" came to mind.

It's not dedicated HW and unfortunately I can't remember where I saw it, but think I I read about methods for constructing (approx) Gaussian filtering possibly either using a few box filters or perhaps IIR filters. Might be worth a search.

As for 'new' filtering hardware, one recent one I can think of was the addition of bicubic Catmull-Rom to PVR GPUs
 
Yeah, I've used the optimal, very clever solution. It's using the hardware texture sampling and using texture coordinates between buffer samples to get a weighted sample. http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/

Problem is it's still not fast enough! For good sized blurs you need multiple passes even at quarter res and the like. At first glance there's not a lot that can be done about that because you need to sample the area of the blur. However, I'm sure something clever can be done. Just off the top of my head, how about creating a cascade of reduced sized buffers (mip maps) and using some clever sampling between them? I expect the caching of these operations is very efficient so there wouldn't be any massive gains having a large enough scratchpad (eDRAM) to fit the whole buffer.
 
Reduce read latency by inverting the flow: instead of reading multiple samples per pixel, read the input once and write it weighted to all output samples. To reduce the massive write amplification use tiled atomic add on LDS/GDS. Would this work? Who knows.
 
Problem is it's still not fast enough! For good sized blurs you need multiple passes even at quarter res and the like. At first glance there's not a lot that can be done about that because you need to sample the area of the blur. However, I'm sure something clever can be done. Just off the top of my head, how about creating a cascade of reduced sized buffers (mip maps) and using some clever sampling between them?
Enter Kawase's Bloom (slide 44):

http://www.daionet.gr.jp/~masa/archives/GDC2004/GDC2004_PIoHDRR_EN.ppt

Unity's open sourced bloom works like that:

https://github.com/keijiro/KinoBloom

You could also do the blur throughout several frames, like SotC's / ZOE2's bloom:
 
On modern GPUs, you should program blur kernels as compute shaders. Compute shader has access to groupshared memory, a fast on-chip memory per compute unit (64 KB per CU on AMD GPUs). With groupshared memory, you don't need to load/sample the blur neighborhood again and again for each pixel. Instead you first load the neighborhood to groupshared memory and then load data from groupshared memory for each pixel. Separate X/Y as usual.

You should also do reductions directly in groupshared memory if you want multiple different radius gaussian filters. Doing multiple downsampling & combine pixel shader passes is slow, because the GPU stalls between each pass (as there's always a dependency to the last passes output). This is another important advantage of compute shader blur versus pixel shader blur.
 
Cool. When we get compute on mobile, we can get some decent effects! So compute really does replace the need for pretty much any specialist hardware?
 
Cool. When we get compute on mobile, we can get some decent effects! So compute really does replace the need for pretty much any specialist hardware?
Compute shaders are still limited. The programming model could be more flexible. Also groupshared memory is mostly only good for regular (known) access patterns. If you do random sampling, then your neighbors don't share much data. Dedicated texture filtering hardware isn't going away anytime soon. Compute shaders are good for post-processing (2d rect, splittable to tiles, known neighborhood), but you need to be able to sample (and filter) a texture from any UV coordinate when you are rendering polygon meshes.
 
That is I suppose the broader concept. The main problem is memory access patterns. If we had unlimited bandwidth with no read/write latencies, we could do whatever we wanted. There's no real solution for that as we're limited to caches and access patterns.
 
Back
Top