[OpenGL 4] k-buffer using shader_image_load_store

Zeross · Jun 3, 2013

To replace an old (and slow) depth peeling solution, I'm working on a k-buffer algorithm to do order independant transparency. I was largely inspired by Cyril's demo http://blog.icare3d.org/2010/06/fast-and-accurate-single-pass-buffer.html and did some minor modifications to integrate it in my code.

Everything is working great if the number of samples allocated is sufficient to hold all the overlapping transparent fragments. For example with 16 or 32 samples per pixel in my scene the rendering is perfect. But if I scale it down to 4 samples to limit the memory consumption, I'm seeing visual artifacts crawling where the limit is exceeded. This is not acceptable from a visual quality standpoint. So what I'd like is to have something that behaves like the depth peeling algorithm : if I have only room for 4 samples per fragment, just keep the 4 nearest overlapping fragments.

My idea was to do on the fly sorting of overlapping fragments : if the size of my k-Buffer is not enough, find the farthest fragment and replace it with the new one if it is nearer from the viewpoint. But it is more difficult than expected. Indeed a lot of warps/wavefront are in flight at the same time and I have no guarantee regarding the order of operations. So i've tried to implement a critical section in my code, using a lock buffer that is updated using atomic operations. Some people seemed to have some kind of success using solutions like this (see for example http://stackoverflow.com/questions/11820066/glsl-spinlock-only-mostly-works/16802075#16802075) but in my case with the same driver revisions (320.18) on a Fermi card, every solutions I've tried only result in a reset of my graphics driver.

So how to implement a critical section inside a shader ? Is it even posible ? Or maybe the problem comes from my algorithm. Intel seems to have a D3D extension that could help (Pixel Shader Ordering) but it is proprietary and moreover it is not exposed in OpenGL

Andrew Lauritzen · Jun 4, 2013

Zeross said:
So how to implement a critical section inside a shader ? Is it even posible ? Or maybe the problem comes from my algorithm. Intel seems to have a D3D extension that could help (Pixel Shader Ordering) but it is proprietary and moreover it is not exposed in OpenGL

Short answer is you can't ever "wait" or hold a "lock" in a shader as that is unsafe. A valid DX/GL implementation is allowed to run any pixel shader invocation to any point, suspend it, then run something else arbitrarily... there is no guarantee of fairness or time slicing or anything like that, so it's not okay to do something that would block or busy wait.

What you can do is something like a "transaction"... i.e. do all the work in a separate local list (i.e. allocate a new node list or something), then atomic cmp/swap the pointer to the list at the end. If anyone else rushed in and changed it first, you have to start over and loop the whole shader. It's a similar concept to what you already do to "append" something, but it makes constructing a fully separate/updated list part of the "transaction".

The reason this is different than a "critical section" is that you throw out everything - the entire "transaction" if it fails and start over, including re-reading the current state of the list in memory, in case it was updated in the mean time.

Now I don't anticipate this to be high-performance, even if it does work (drivers might be buggy with stuff like this). You wouldn't be able to reuse space from lists you've already constructed else you could in the rare case screw up a cmp-swap unless you store a completely separate "update counter" for just that purpose or similar. Furthermore GPU scheduling could still make this run really inefficiently with lots of work being thrown out due to people racing on the same pixels, but theoretically it should eventually complete.

But yeah, to reiterate, the short answer is no you can't do that with general DX/GL semantics.

I will note that Intel's pixel shader ordering extension on Haswell exposes precisely the semantics that you want for this sort of thing. In fact the OIT and volumetric shadows in Grid 2 are implemented similarly by constructing a compressed pre-pixel representation of the visibility function on the fly in a pixel shader, and this is done precisely by the ability of the extension to denote a sort of "critical section tail" of a pixel shader that is guaranteed not only to run by itself, but to run in primitive order (so compression and such are stable as well). Pretty awesome stuff

Zeross · Jun 5, 2013

Thanks for your answer Andrew, that's exactly what I was worried about. Indeed I have seen the Pixel Shader Ordering Extension and I was quite interested but it is only D3D and I'm an OpenGL programmer, you should lobby intel to propose it as an extension to OpenGL

more seriously I'm sure a lot of people would be interested. And who knows, maybe other IHV will follow and add support for it. OIT is a real problem and all solutions that could help to handle it are welcome.

[OpenGL 4] k-buffer using shader_image_load_store

Zeross

Andrew Lauritzen

Moderator

Zeross

Similar threads