depth-of-field demo, potential use of scattered writes

vember

Newcomer
I've put together a post-processing depth-of-field demo that shows an algorithm I've developed which uses recursive filters (IIRs) to perform the blurring. This has great performance advantages as the required complexity is unaffected by the size of the blur/CoC and it is always smooth.

The demo runs in software, but the algorithms used are parallel by nature (the demo supports SMP) and should be possible to implement in graphics hardware which expose scattered writes on shaders with great performance. Not only does the demo handle the case where an object in focus have sharp edges against a blurry background but it handles the case of a blurry object in front of a focused backgroud as well.

It's on top of this page, check it out:
http://www.vemberaudio.se/showoff/

- claes
 
Scattered writes can to some extent be implemented in GPUs by setting up a vertex array corresponding to the memory locations you wish to write to, then render the vertex array as a bunch of textured point or point-sprite primitives; are there any problems (inter-pixel dependencies, triangle setup rate, dfficulties filling the vertex array fast enough etc) that prevent you from doing this?
 
It's not really the scattered part that necessary, it's just multiple writes from the that's required which are very predictable and cache-friendly. The blurring works in two steps, horizontal and vertical. For the filter to work, it need to perform a single row/column in one shader run, as the state of the filter on each pixel depend on all the pixels left to it.

The simplest possible form of one pass of the type of filter used looks like this. (for one row)
Code:
y=in[0]
c = 0.5
for(x=0; x<width; x++)
{
   y = (1-c)*y + c*in[x]
   out[x] = y
}

Even if the filters may get more complex to get the desired result, the memory-access pattern remains the same. That is, each pixel is read & written only once. As the rows/columns are totally independent within each pass, parallelization is trivial. (theoretically ;) )
 
vember said:
As the rows/columns are totally independent within each pass, parallelization is trivial. (theoretically ;) )
With my wavelet image compression (see thread in this forum), I've learnt one lesson the hard way: seperable filters are nice (because you need to do less work) but implemented trivially, they tend to be slow because 1) they become essentially a two-pass process thus streaming the data twice (once for rows and once for columns) and 2) the column-wise application of the filter has fairly evil memory access patterns.

Of course there are ways around that (for example I am essentially doing both in parallel with a lag of ~3 rows) but they are not trivial to implement and certainly to easy to debug / maintain.
 
vember said:
It's not really the scattered part that necessary, it's just multiple writes from the that's required which are very predictable and cache-friendly. The blurring works in two steps, horizontal and vertical. For the filter to work, it need to perform a single row/column in one shader run, as the state of the filter on each pixel depend on all the pixels left to it.

The simplest possible form of one pass of the type of filter used looks like this. (for one row)
Code:
y=in[0]
c = 0.5
for(x=0; x<width; x++)
{
   y = (1-c)*y + c*in[x]
   out[x] = y
}

Even if the filters may get more complex to get the desired result, the memory-access pattern remains the same. That is, each pixel is read & written only once. As the rows/columns are totally independent within each pass, parallelization is trivial. (theoretically ;) )

For such a simple exponential decay filter, you can actually convert the IIR filter to a sequence of FIR filters, highly suitable for GPU implementation, thus breaking the pixel-to-pixel dependency:
Code:
for(y=0;y < pass_count; y++)
{
   dif = pow(2, y);
   weight_factor = pow(1-c, dif);
   for(x=0; x<width; x++)
   {   
      out[x] = in[x] + weight_factor * in[x-dif];
   }
   swap input and output
}
for(x=0; x<width; x++) 
{
   out[x] *= c;
}
It gets considerably trickier if you wish to vary 'c' on a per-pixel basis, but it can still be done. Note that in[x-dif] will access outside your array, which is a problem in a CPU implementation (try clamping the array index to 0); in GPUs, you can play tricks with texture edges/borders to avoid this problem. The 'pass_count' is in theory log2 of your frame size; if c is large, you can make it smaller in practice.

EDIT: fixed bug in code
 
Last edited by a moderator:
Yes, that would be identical in this simple case but it would also increase the workload a lot, and yes I do like to change 'c' per-pixel. It'd be better to just use a more traditional approach if you cannot reap the benefits of this one anyway.

It may not be trivial at all to implement but given the large amount of parallelization and predictability of the read/writes I don't see any reason why a future GPU ultimately wouldn't "like" the recursive filter approach by hiding all the latency of the read/writes.
 
vember said:
Yes, that would be identical in this simple case but it would also increase the workload a lot, and yes I do like to change 'c' per-pixel. It'd be better to just use a more traditional approach if you cannot reap the benefits of this one anyway.

It may not be trivial at all to implement but given the large amount of parallelization and predictability of the read/writes I don't see any reason why a future GPU ultimately wouldn't "like" the recursive filter approach by hiding all the latency of the read/writes.
The main problem of introducing scattered pixel shader writes is in general one of coherency; if two pixels are written to overlapping areas, then the writes must be sorted into polygon order before being committed, or else result is nondeterministic and not debuggable. This is already difficult with non-scattered writes; if you introduce scattered writes with pre-declared memory areas, it is not impossible, but very expensive; if you introduce fully general scattered writes with write locations decided dynamically, you need something brutal like an RB-tree *per pixel* if you wish to avoid catastrophic breakdown of parallellism.

Even if you special-case for parallellism along one dimension, as in your case, your shader execution parallellism is limited to the dimensions of your picture; if you have e.g. a 200-step pipeline in your pixel shader (AFAIK a rather typical number) and 24 pipelines (a bit less than an actual "future GPU"), then you need 4800 pixel shader program invocations in flight at any one given time to get 100% utilization. If you have a 1000x1000 pixel framebuffer, you only get 1000 parallellism opportunities, giving you a max theoretical shader utilization of ~21%, even assuming infinite bandwidth. Also, the fact that modern GPUs usually have texture/framebuffer layouts greatly different from simple linear 2D arrays is likely to work against you in this particular case.

As for the program I suggested, here is a variant with per-pixel varying 'c':
Code:
c[]; // array containing c

for(x=0;x<width;x++)
{
    in_weight[i] = 1.0 - c[i];
    sum_weight[i] = 1.0;
}

for(y=0;y < pass_count; y++)
{
   dif = pow(2, y);
   for(x=0; x<width; x++)
   {   
      weight = in_weight[x];
      sum_weight[x] += weight;
      out[x] = in[x] + weight * in[x-dif];
      out_weight[x] = weight * in_weight[x-dif];
   }
   swap in[] and out[]
   swap in_weight[] and out_weight[]
}
for(x=0; x<width; x++) 
{
   out[x] = in[x] / sum_weight[x];
}
 
BTW, if you wish to do a really fast and cheesy GPU-based depth-of-field, you could use the 'mipmap blur' technique: render the frame to a texture, generate a mipmap pyramid from the texture, then render the texture to the frambuffer as a single quad with a per pixel LOD bias applied ('texldb' instruction under DirectX PS2.0+). Given the parameter 'c' in this thread, the LOD bias should be roughly log2(1/c); per-pixel 'c' could then be achieved by taking 'c' from a texture. If this gives unacceptably bad quality, it is possible to run a small blur-filter on each mipmap; the LOD bias should then be reduced, according to the size of that blur-filter.

This method does cause some problems with halos around sharp objects placed just in front of more blurred objects.
 
I assumed that 1000-ish pixels in flight would be enough to hide latency, but you're right, that could become a serious bottleneck. I do know the mipmap LOD bias technique, I'm not really looking for a practical solution as such. This was merely a test on using IIR-filters with variable coefficients in graphics.

Thanks for the feedback.
 
Back
Top