To replace an old (and slow) depth peeling solution, I'm working on a k-buffer algorithm to do order independant transparency. I was largely inspired by Cyril's demo http://blog.icare3d.org/2010/06/fast-and-accurate-single-pass-buffer.html and did some minor modifications to integrate it in my code.
Everything is working great if the number of samples allocated is sufficient to hold all the overlapping transparent fragments. For example with 16 or 32 samples per pixel in my scene the rendering is perfect. But if I scale it down to 4 samples to limit the memory consumption, I'm seeing visual artifacts crawling where the limit is exceeded. This is not acceptable from a visual quality standpoint. So what I'd like is to have something that behaves like the depth peeling algorithm : if I have only room for 4 samples per fragment, just keep the 4 nearest overlapping fragments.
My idea was to do on the fly sorting of overlapping fragments : if the size of my k-Buffer is not enough, find the farthest fragment and replace it with the new one if it is nearer from the viewpoint. But it is more difficult than expected. Indeed a lot of warps/wavefront are in flight at the same time and I have no guarantee regarding the order of operations. So i've tried to implement a critical section in my code, using a lock buffer that is updated using atomic operations. Some people seemed to have some kind of success using solutions like this (see for example http://stackoverflow.com/questions/11820066/glsl-spinlock-only-mostly-works/16802075#16802075) but in my case with the same driver revisions (320.18) on a Fermi card, every solutions I've tried only result in a reset of my graphics driver.
So how to implement a critical section inside a shader ? Is it even posible ? Or maybe the problem comes from my algorithm. Intel seems to have a D3D extension that could help (Pixel Shader Ordering) but it is proprietary and moreover it is not exposed in OpenGL
Everything is working great if the number of samples allocated is sufficient to hold all the overlapping transparent fragments. For example with 16 or 32 samples per pixel in my scene the rendering is perfect. But if I scale it down to 4 samples to limit the memory consumption, I'm seeing visual artifacts crawling where the limit is exceeded. This is not acceptable from a visual quality standpoint. So what I'd like is to have something that behaves like the depth peeling algorithm : if I have only room for 4 samples per fragment, just keep the 4 nearest overlapping fragments.
My idea was to do on the fly sorting of overlapping fragments : if the size of my k-Buffer is not enough, find the farthest fragment and replace it with the new one if it is nearer from the viewpoint. But it is more difficult than expected. Indeed a lot of warps/wavefront are in flight at the same time and I have no guarantee regarding the order of operations. So i've tried to implement a critical section in my code, using a lock buffer that is updated using atomic operations. Some people seemed to have some kind of success using solutions like this (see for example http://stackoverflow.com/questions/11820066/glsl-spinlock-only-mostly-works/16802075#16802075) but in my case with the same driver revisions (320.18) on a Fermi card, every solutions I've tried only result in a reset of my graphics driver.
So how to implement a critical section inside a shader ? Is it even posible ? Or maybe the problem comes from my algorithm. Intel seems to have a D3D extension that could help (Pixel Shader Ordering) but it is proprietary and moreover it is not exposed in OpenGL