The return of the S-Buffer?

Ethatron · Apr 8, 2010

Hy;

Today I was thinking about order independent transparency and was munching the way the ATI Mecha demo handles it and I think it's amost a shift towards S-Buffers, just now on OpenCL/DirectCompute. In the early days when full Z-buffers were too bandwidth capped to be applyable (I ported the first Quake-implementation on the Amiga after it's source-leak and I spend a lot of time tweaking the S-buffer implementation, wanna see the source?

Don't throw away anything) S-buffers were real cool stuff. Basically it's some form of slope-based Z-buffer compression.

Now the OIT solution via A-buffers is almost in a similar situation as the Z-buffers of the old days, bandwidth limitation. We have fast Z-buffers but A-buffers are almost 3D-textures with a SingleLinked-depth.

I think it's quite streightforward to not order single fragments (towards Z) but fragment-strips (over Y) which then in turn exactly is like our S-buffer again.

What we'd do is a depth-prepass which does create a depthstrip-buffer, carrying startdepth/enddepth/index. Convergence is that when each triangle covers or intersects on exactly every pixel it's become an A-buffer again. Can't be worst. Of course we throw away covered strip-fragments.

Through the fragment-indices a indexed vertex-buffer can be rebuild/reordered such that ordering is back-to front exactly like a BSP-tree, without vertex- or fragment-strip-overlap. It's even possible to mask all non-transparency covered area and render 1x streight, 1x with blending.

I'm actually learning OpenSceneGraph, let's see if I can come around with a deferred renderer with depthstrip-buffer, it smooths so nicely into each other.

I'm not yet sure how the depth-pass and OpenCL are correlated, maybe I'd be forced to do depth entirely in OpenCL.

As AA-coverage sampling isn't available (I know only the theory, so bear with me), it's possible to store the fragment-strip start and stop fractions within the depthstrip-buffer, the size (in bits) of the fraction will define the AA-quality. Also quite streighforward.

And in fact as long as fragment-strips are longer than 1 pixel is a reliable z-buffer compression scheme, as any smooth vertex surface results only in X-coverage number of strips.

Seems like a nice concept to me.

MfA · Apr 8, 2010

Ethatron said:
Can't be worst.

AFAICS yes it can, because the insertion and clipping costs are horrible.

An A-buffer approach with deferred sorting has a pretty much constant insertion cost and the sorting is a regular algorithm well suited to GPUs. Also you can get away with ignoring clipping issues if you work at fragment level.

Ethatron · Apr 8, 2010

MfA said:
AFAICS yes it can, because the insertion and clipping costs are horrible.

An A-buffer approach with deferred sorting has a pretty much constant insertion cost and the sorting is a regular algorithm well suited to GPUs. Also you can get away with ignoring clipping issues if you work at fragment level.

Yes, they are [horribly costy, the splits and insertations], but it has been shown in another context (software-rasterizing) that it can be faster than the alternative.

In this case I think all falls with the possibility to NOT receive single fragments after rasterization, but to get lines (fragment-strips). Something like as if the pipeline is like this:

It's obvious I try to learn and to get this GPU state-of-mind in regard to algorithms. But as occlusion-culling was and seems like is still the biggest or maybe most complex part of rendering, I guess a little bit more (fixed) hardware-support for arbitrary OC-structures could be nice (you asked for it, mhouston :^)

Picture shameless ripped from the course-slides, hope it falls under fair use.

My main motivation in this is to remove any static restrictions like the numbers of transparency layers, it seems stencil A-Buffers are limited to the number of possible texture-targets. Dual-depth-peeling is non-restricted but multipass. In addition S-buffer structure give a bunch of additional benefits for transparency shadowing and light-filtering (the physical filter I mean).
In the extreme end an S-buffer can capture discontinued depth in such a way that deferred shadowing is amost as simple as lighting, as well as corner-correct/hidden-object SSAO. It's interesting that all those nice algorithms receive a nice resolution if one solves the OC-problem nicely (which I don't imply is solved with S-buffers, not that simple).

MfA · Apr 8, 2010

Ethatron said:
My main motivation in this is to remove any static restrictions like the numbers of transparency layers, it seems stencil A-Buffers are limited to the number of possible texture-targets.

ATI didn't use MRTs for OIT, so I don't see why it would be necessary. (They use UAVs. One for the per pixel last storage pointer into the second UAV (with counter) which stores interleaved backwards linked lists.)

With the the move towards smaller triangles, span based optimization seems swimming upstream

Ethatron · Apr 9, 2010

MfA said:
ATI didn't use MRTs for OIT, so I don't see why it would be necessary. (They use UAVs. One for the per pixel last storage pointer into the second UAV (with counter) which stores interleaved backwards linked lists.)

Unordered Access Views ...
Have to get a ppsx-reader and will comment on it later.
Is it available in OGL4 as well?

MfA said:
With the the move towards smaller triangles, span based optimization seems swimming upstream

Yes and no. If rasterization is deterministic or at least predictable (if you know your triangle ordering there must be something) you can get clever operations I hope.

After I wrote my thoughts I detected I can't leave the "traditional" pipeline for OpenCL because I would backstab the hardware tesselation capabilities (for example). I think it is very likely that S-buffers on GPU turns out costly, the question is if it's more costly than all roundtrips of the related algorithms together.
I'm quite attracted by the possibilities of the concept itself, like arbitrary AA-precision, one-pass deferred shadowing & lighting, transparency etc. And that a concept apparently seems to be a fail doesn't free you (me) from proofing it's fail, right? :^)
Maybe something else rises from its ashes.

Edit: notepook entry, related discussion: http://forum.beyond3d.com/showthread.php?t=50901
Edit: notebookentry, the stream-out-of geometry-stage does indeed exist: http://www.opengl.org/registry/specs/ARB/transform_feedback2.txt

MfA · Apr 9, 2010

Ethatron said:
Unordered Access Views ...
Have to get a ppsx-reader and will comment on it later.
Is it available in OGL4 as well?

I haven't personally read the specs, but Arjan says no :/

like arbitrary AA-precision, one-pass deferred shadowing & lighting, transparency etc.

Arbitrary AA precision on a 1D scanline, not in general (for that you need beam trees, like Tim Sweeney has suggested). Also only for primary rendering of flat shaded geometry (and potentially hard shadows) not for texturing or secondary/area lighting effects.

The problem with a full scanline scan-buffer is that the insertion costs are simply unacceptable. The scan buffer worked in quake because they used it with painter's algorithm, so they could consolidate the spans during insertion. Pointer chasing linked list with upto a thousand or so entries, per triangle per scanline covered? Forget about it.

That's not to say the idea doesn't have merit, it's just that you haven't followed it to it's logical conclusion ... someone else did though (the first thing I do when I get an idea is do a web search, the second thing I do is find someone else had it before me). You need to restrict your scan buffers to smaller regions of the image so you can keep insertion costs down ... and if you are going smaller, the pixel level kinda fits nicely with normal rendering.

Which brings us to Ronald Perry's line sampling work. He used two span buffers per pixel (horizontal and vertical) although he didn't call them that. It's an interesting alternative to a full blown per pixel beam tree, the clipping operations are a whole lot simpler.

Ethatron · Apr 9, 2010

I read the presentation. It's basically an identical approach, or let's say one could use the same approach replacing raster-pixels by raster-lines IF the rasterizer could be forced to spit out lines instead of pixels. The sheer amount of information stored per fragment made feel really unhappy about the implementation.

I thought that in the case that all vertices are in GPU memory one could possibly replace attributes by attribute-indices. To me this implementation is absolute ugly on bandwidth. If it maybe is not neccesary to switch to spans, it should get some smarter resource-management.

MfA said:
I haven't personally read the specs, but Arjan says no :/

Well, I'm munching through the spec, if it's not some obvious extension maybe some brainstorming leads to some abuse of something else to do something comparable.
The TransformFeedBack looks really interesting, I haven't grasped the implications of the Pause()/Resume() though yet.

MfA said:
Arbitrary AA precision on a 1D scanline, not in general (for that you need beam trees, like Tim Sweeney has suggested). Also only for primary rendering of flat shaded geometry (and potentially hard shadows) not for texturing or secondary/area lighting effects.

Yes ... of course [on scanline-borders]. Reminds me of some AA I saw in games where horizontal AA was absent, or very weak. Oblivion on my 9700 I think. Never knew the cause of that.

MfA said:
The problem with a full scanline scan-buffer is that the insertion costs are simply unacceptable. The scan buffer worked in quake because they used it with painter's algorithm, so they could consolidate the spans during insertion. Pointer chasing linked list with upto a thousand or so entries, per triangle per scanline covered? Forget about it.

That is just for a naive implementation, and is the same for a naive A-buffer implementation. You don't need to do direct span-clipping, you can do it deferred. You don't even need to clip at all. If you replace pixel-attributes in the ATI OIT implementation by span-indices, you gain on bandwidth and space and also maintain at least horizontal connectivity information. For analytical post-processes it's a better case, and the space-gain allows you to raise the bar on amount of transparent real-estate.

MfA said:
That's not to say the idea doesn't have merit, it's just that you haven't followed it to it's logical conclusion ... someone else did though (the first thing I do when I get an idea is do a web search, the second thing I do is find someone else had it before me). You need to restrict your scan buffers to smaller regions of the image so you can keep insertion costs down ... and if you are going smaller, the pixel level kinda fits nicely with normal rendering.

Which brings us to Ronald Perry's line sampling work. He used two span buffers per pixel (horizontal and vertical) although he didn't call them that. It's an interesting alternative to a full blown per pixel beam tree, the clipping operations are a whole lot simpler.

I'll look at that, thanks.
Google is a bitch nowadays, if you don't exactly hit the phrase you get tourist diaries from mongolia ...

Nothing like a good 3D-thesaurus/wiki ... where is it?

The return of the S-Buffer?

Ethatron

MfA

Ethatron

MfA

Ethatron

MfA

Ethatron

Similar threads