"Pure and Correct AA"

The fact that hardware uses framebuffer compression should indicate that there is enough to be gained from it, even in complex scenes. Of course for a simple downsample filter in a shader you don't save much, but reading the flag plus an if clause wouldn't cost much either.
I think that the extremely cache-friendly one-time sample fetch is quite different in its nature to the complexity of accesses that render target compression is designed for - near-randomised locations in memory and trying to cope with overdraw, with tiles potentially going back and forth between compressed and uncompressed.

e.g. I suppose you can boil it down to a best-case difference of 1.5GB/s: 1920 * 1080 * 32-bit * 60-fps is 0.5GB/s without AA or 2GB/s with 4xAA, all samples fetched. Somewhere in the middle (1.25GB/s bandwidth consumed?) on average scenes with the conservative compression flag test? 1.75GB/s for high-complexity scenes? - a saving of 0.25GB/s.

Jawed
 
I think that the extremely cache-friendly one-time sample fetch is quite different in its nature to the complexity of accesses that render target compression is designed for - near-randomised locations in memory and trying to cope with overdraw, with tiles potentially going back and forth between compressed and uncompressed.

e.g. I suppose you can boil it down to a best-case difference of 1.5GB/s: 1920 * 1080 * 32-bit * 60-fps is 0.5GB/s without AA or 2GB/s with 4xAA, all samples fetched. Somewhere in the middle (1.25GB/s bandwidth consumed?) on average scenes with the conservative compression flag test? 1.75GB/s for high-complexity scenes? - a saving of 0.25GB/s.

Jawed
I'm not sure why you mention bandwidth now. Whether compression flags are exposed in the shader or not, bandwidth consumption will probably be identical (if you use the flag strictly to avoid sampling all n samples). The idea is to save shader cycles.

Code:
Texture2DMS<float4, 4> t;

if (t.LoadFlag(tc)) {
  // short path
} else {
  // long path
}
instead of either always executing the long path or doing something like:
Code:
Texture2DMS<float4, 4> t;

float4 samples[4];
for (int i = 0; i < 4; ++i)
  samples[i] = t.Load(tc, i);

if (samples[0] == samples[1] && samples[1] == samples[2] && samples[2] == samples[3]) {
  // short path
} else {
  // long path
}
That one uses up more temp registers and the cost for the comparison is higher. It does have per-pixel granularity, but that doesn't buy you anything if your branching granularity is higher.
 
Pardon me for interrupting a great technical discussion that is over my head with a stupid question... but it's been nagging me since the first page or two and I finally decided to just ask.

Just how impractical is an analytic coverage calculation assuming that it would provide acceptable AA results for geometry edges (ignoring all the discussion about "proper" AA and whether box coverage, sinc, or whatever gives the ideal result... I assume analytic pixel/box coverage would give comparable results to high sample MSAA anyway)?

I would think an early thing to get out of the way would be to avoid calculating coverage on a screen full of pixels that are nowhere near the primitive edge. Can that be accomplished with a lookup table using X,Y intercepts of the edge projection, or perhaps a comparison of coordinate values along the primitive edge to a central pixel coordinate lookup table, to eliminate pixels that aren't even close to being involved?

Second, how expensive is the actual calculation of coverage inside a box relative to other modern GPU operations? I'm thinking that finding two intercepts in the box region and calculating the area of the triangle or trapezoid created is pretty straightforward, but is this more calculation than, say, a typical shader operation on a single pixel?

Third, what other penalties go with that kind of approach? I'm about as layman as it gets, but I would guess this would be done early in the pipeline, after triangle setup and Z-rejection? Would this result in another value being tagged to each pixel (texel? another term? you could have multiple render-pixels for a single screen pixel at this point, each with unique color/Z values corresponding to the primitive with which is associated). Seems this would add to bandwidth usage as you need to carry these values along with Z and color? And would that alter the way early Z-rejection is done (rejecting only pixels with full occulusion, so would coverage calculation be done before or after Z-rejection?)

Oh, and would things like normal mapped true geometry tesselation screw up any such approach?

I'm guessing the reply will be "more expensive than MSAA." But wouldn't it equal essentially infinite sample AA once the calculation is done? How does it compare to 16X or better AA? And I would think hardware acceleration would certainly be possible - a fixed unit more or less that is tagged onto any architecture.

Just curious... sorry for the multipage regression.
 
Last edited by a moderator:
I'm not sure why you mention bandwidth now. Whether compression flags are exposed in the shader or not, bandwidth consumption will probably be identical (if you use the flag strictly to avoid sampling all n samples).
Agreed, I was pointing at how minor the bandwidth difference would prolly end up being.

The idea is to save shader cycles.
OK. My first thoughts on this were that there'd be a 1 or 2% difference in performance between testing a flag and reading all samples.

In this presentation:

http://ati.amd.com/developer/gdc/2007/Riguer-DX10_tips_and_tricks_for_print.pdf

right at the end, there's some code for doing AA with tonemapped-HDR:

Code:
Texture2DMS<float4, SAMPLES> tHDR;
float4main(float4pos: SV_Position) : SV_Target
{
    int3 coord;
    coord.xy = (int2)In.pos.xy;
    coord.z = 0;
    // Correct exposure for individual samples and sum it up
    float4 sum = 0;
    [unroll]
    for (int i = 0; i < SAMPLES; i++)
    {
        float4 c = tHDR.Load(coord, i);
        sum.rgb += 1.0 -exp(-exposure * c.rgb);
    }
    sum *= (1.0 / SAMPLES);
    // Gammacorrection
    sum.rgb = pow(sum.rgb, 1.0 / 2.2);
    return sum;
}

which got me thinking about the relative costs. With a TC-flag test as you suggest the shader might process a pixel in ~10 cycles if compressed or ~30 if uncompressed.

With a "brute-force" test, reading all the samples, I guess a compressed pixel would cost ~20 cycles, while an uncompressed pixel would cost ~40. I don't think the extra registers this would consume would be a big issue: D3D10 cards have beefier register capaibility than DX9 cards, and if the shader was more complex the registers would be "recycled".

I'm not sure what kind of per-pixel instruction-cycle budget you can get in a high-end D3D10 engine (deferred, say) - 300 cycles? Anyway, this bit of AA code takes a much bigger chunk of that than I had appreciated, 3-13% with the numbers I'm guessing. A far bigger difference in performance than I was thinking :oops:

Apart from that though it seems the biggest issue is dynamic branching coherence which makes itself most keenly felt when performance is worst: when high image complexity produces lots of edges. Hmm...

Jawed
 
Most likely there wouldn't be large gains, but neither would the cost be high. But I guess no one thought it was worth the hassle of having such a rather weakly defined feature.
 
Just how impractical is an analytic coverage calculation assuming that it would provide acceptable AA results for geometry edges (ignoring all the discussion about "proper" AA and whether box coverage, sinc, or whatever gives the ideal result... I assume analytic pixel/box coverage would give comparable results to high sample MSAA anyway)?
I think this is what Humus was suggesting. Or something similar...

I would think an early thing to get out of the way would be to avoid calculating coverage on a screen full of pixels that are nowhere near the primitive edge. Can that be accomplished with a lookup table using X,Y intercepts of the edge projection, or perhaps a comparison of coordinate values along the primitive edge to a central pixel coordinate lookup table, to eliminate pixels that aren't even close to being involved?
Currently the rasteriser solves this problem with dedicated hardware - for each fragment it generates it determines if there is an edge involved and if so, produces mask information based upon the geometry sampling points. So, yeah, this is done "early", before the ROPs even see the fragment and have to think about AA.

Second, how expensive is the actual calculation of coverage inside a box relative to other modern GPU operations? I'm thinking that finding two intercepts in the box region and calculating the area of the triangle or trapezoid created is pretty straightforward, but is this more calculation than, say, a typical shader operation on a single pixel?
What you're talking about is analogous to viewport clipping (well that doesn't calculate area), but now the viewport is a single pixel on the screen - a process that is performed by dedicated hardware. I guess it would be phenomenally expensive because each triangle can cover hundreds of screen pixels.

Third, what other penalties go with that kind of approach?
Unfortunately in this thread we haven't got as far as quantifying the error of 4xMSAA or 8xMSAA on poly edges. When does MSAA get good enough? What's the margin between still and moving images like.

Anyway, you would need to come up with a way to record "triangle area" per screen pixel in the render target (i.e. in memory), collate each triangle against Z (or a range of Z! since if you're measuring the area within a pixel you have to take account of how depth varies across the fragment). Then the ROPs would need to be able to "average" over your arbitrary number of fragments, clipping the fragments against each other according to Z (which is sort of a viewport clipping problem, all over again). You could limit the number of triangles (fragments) per pixel - e.g. 8, to make things less ruinous.

Overall, it sounds to me like the death of all the "efficiencies" gained with Z-buffer based rendering.

I'm about as layman as it gets, but I would guess this would be done early in the pipeline, after triangle setup and Z-rejection? Would this result in another value being tagged to each pixel (texel? another term? you could have multiple render-pixels for a single screen pixel at this point, each with unique color/Z values corresponding to the primitive with which is associated).
"Fragment" (teehee, I actually forgot that word in the first version of this reply, so went back and re-worded for clarity - hope it's clear).

Seems this would add to bandwidth usage as you need to carry these values along with Z and color? And would that alter the way early Z-rejection is done (rejecting only pixels with full occulusion, so would coverage calculation be done before or after Z-rejection?)
The bandwidth cost would seem to be pretty hideous. I suppose you could approximate things. If you're going to limit the number of triangles per pixel, you might also limit the number of clipping coordinates to around the four sides of the pixel. Say 8 per side, triangle A has its vertices at points 5, 8 and 13, with Z of 0.5, 0.5 and 0.5001, B has its vertices at 4, 7 and 21 with Z of 0.5, 0.5 and 0.5001.

Code:
[COLOR=black]32------1------2------3------4------5------6------7------8[/COLOR]
[COLOR=black]|                            bbbbbbbbbaaaaaaaaaaaaaaaaaaa|[/COLOR]
[COLOR=black]31                           bbbbbbbbbbbbaaaaaaaaaaaaaaaa9[/COLOR]
[COLOR=black]|                           bbbbbbbbbbbbbbbaaaaaaaaaaaaaa|[/COLOR]
[COLOR=black]30                          bbbbbbbbbbbbbbbbbaaaaaaaaaaa10[/COLOR]
[COLOR=black]|                          bbbbbbbbbbbbbbbbb  aaaaaaaaaaa|[/COLOR]
[COLOR=black]29                         bbbbbbbbbbbbbbb      aaaaaaaa11[/COLOR]
[COLOR=black]|                         bbbbbbbbbbbbbb          aaaaaaa|[/COLOR]
[COLOR=black]28                        bbbbbbbbbbbb              aaaa12[/COLOR]
[COLOR=black]|                         bbbbbbbbbbb                 aaa|[/COLOR]
[COLOR=black]27                       bbbbbbbbbb                     13[/COLOR]
[COLOR=black]|                        bbbbbbbb                        |[/COLOR]
[COLOR=black]26                      bbbbbbb                         14[/COLOR]
[COLOR=black]|                       bbbbb                            |[/COLOR]
[COLOR=black]25                     bbbb                             15[/COLOR]
[COLOR=black]|                      bb                                |[/COLOR]
[COLOR=black]24-----23-----22-----21-----20-----19-----18-----17-----16[/COLOR]
Looks like fun :p

Oh, and would things like normal mapped true geometry tesselation screw up any such approach?
Tessellation is geometry anyway, so that would just be geometry. Normal mapping is a texture-filtering/shader-antialiasing problem.

I'm guessing the reply will be "more expensive than MSAA." But wouldn't it equal essentially infinite sample AA once the calculation is done? How does it compare to 16X or better AA? And I would think hardware acceleration would certainly be possible - a fixed unit more or less that is tagged onto any architecture.
In the race to "good-enough" real time AA, MSAA looks like it's destined always to win!

I'd really like to see AA quality quantified, for moving objects. The HQV video testing software has some "objective" tests for the visual quality of digital video replay - surely the 3D industry could stand to have some similar tools...

Jawed
 
Last edited by a moderator:
Most likely there wouldn't be large gains, but neither would the cost be high. But I guess no one thought it was worth the hassle of having such a rather weakly defined feature.
One thing that's been bugging me is whether there would always be a varying TC-flag. e.g. if a GPU didn't perform compression for a floating-point render target.

Jawed
 
I noticed that even in the ATI sample code they perform gamma correction [after] resolve/tone mapping.
 
I noticed that even in the ATI sample code they perform gamma correction [after] resolve/tone mapping.
Isn't that merely gamma-encoding? Conversion from linear space (fp32 linear?) into gamma 2.2 space?

Is there a more complete version of the code I quoted above? Is there an SDK example of this somewhere?

In the NVidia presentation on Adrianne:

http://developer.download.nvidia.com/presentations/2007/gdc/Advanced_Skin.pdf

on pages 106 to 112 they take the same approach:

  • Our displays warp pixel values
  • When we write to the framebuffer:
    • Invert the warping
float3 finalCol= do_all_lighting_and_shading();
float pixelAlpha= compute_pixel_alpha();
return float4(pow(finalCol,1.0 / 2.2), pixelAlpha);

// or (cheaper)
return float4( sqrt( finalCol), pixelAlpha);
On the next page the presentation introduces an sRGB render target:
  • sRGBframebuffers
    • You write linear pixels
    • Hardware does correction for you
    • Blending is done linearly
    • EXT_framebuffer_sRGB
so I guess the first approach (without sRGB render target) is for linear render targets.

Jawed
 
I noticed that even in the ATI sample code they perform gamma correction [after] resolve/tone mapping.

Yes. That's my code actually, and contradicts my previous statement in this thread. That's my first revision of the code, then later I thought about it and I changed it to do it before resolving and noticed much improved quality, which explains my earlier comments in this thread. Then after thinking more about it, it should actually be done after resolve after all. Not sure why I'm seeing the opposite in practice though. I haven't had the chance to revisit my code yet though as I've been traveling.
 
Going out on a limb :p , in D3D10 it seems to me that if you intend to perform HDR-tonemapping then you use a linear space intermediate render target, but if you are using a regular immediate non-HDR approach, then it's simplest to use the sRGB render target format.

In the latter case, D3D10 will handle both blending and AA correctly by translating between linear and gamma 2.2 spaces in the ROPs with no programmer involvement.

In either case, bog-standard textures (which, even in DX9 or earlier are in gamma 2.2 space) should be converted to linear space - seemingly you can get D3D10 to do this automatically if you declare the textures correctly - otherwise you should code the conversion yourself. Presumably the declaration is the best approach, simply because the TMUs will perform texture filtering correctly, converting from gamma 2.2 space first.

Have I got that right?

---

Potentially stupid question: is D3D10's sRGB attribute of a render target orthogonal to the bit-depth/format (float or integer)?

Here's a useful thread:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1049712&SiteID=1

I'm wondering now if the R11G11B10_FLOAT format could be short on precision if it's used as a linear space intermediate render target for HDR. This presentation gives a bit more detail:

http://developer.download.nvidia.com/presentations/2006/develop/next-gen-dx10-games-develop06.pdf

R11G11B10_FLOAT
  • Each component has own 5 bit exponent (like fp16 numbers)
  • RGB components have 6, 6, 5 bits of mantissa each (vs. 10 bit mantissa for fp16)
  • No sign bit, all values must be positive
  • Can be used for render targets

Presumably there's an implicit bit in each mantissa: so it's really 7,7,6.

Perhaps if you build a deferred shader with a separate tonemapping/AA-resolve pass, then linear R11G11B10_FLOAT's precision isn't much of a problem when it is used as the intermediate render target: only the tone-mapping/AA is subject to this precision, all prior work on shading, lighting, shader-AA etc is done in FP32.

Seems there's been a fair amount of confusion (and lack of documentation) on this subject.

Jawed
 
Potentially stupid question: is D3D10's sRGB attribute of a render target orthogonal to the bit-depth/format (float or integer)?

DX10 docs are online. sRGB isn't an attribute of a rendertarget, it's an inherent part of the resource format; the R8G8B8A8_UNORM_SRGB format is distinct from the R8G8B8A8_UNORM format. The complete format list is at http://msdn2.microsoft.com/en-us/library/bb173059.aspx. The sRGB formats are:

DXGI_FORMAT_R8G8B8A8_UNORM_SRGB = 29,
DXGI_FORMAT_BC1_UNORM_SRGB = 72,
DXGI_FORMAT_BC2_UNORM_SRGB = 75,
DXGI_FORMAT_BC3_UNORM_SRGB = 78,

(BC1 = DXT1, BC2 = DXT2/3, BC3 = DXT4/5).
 
it would be expensive to force hw support of sRGB color space conversion for all formats - not to mention complications in specifying the accuracy of the conversions.
 
The sRGB color space and conversion functions are only really defined within a bounded range [0,1] (where 0 corresponds to black and 1 corresponds to the brightest color the display device is able to actually display, "white"); interpreting floating-point data (which are not naturally clamped to [0,1] and normally do not even represent a data set that can meaningfully be sent to a display device without further processing) as sRGB does not strike me as very meaningful or useful.

Also, the precision loss that arises when doing lRGB->sRGB conversion on fixed-point data is not really a problem that can arise with floating-point data (assuming that the float data actually are bounded to [0,1]) ; the main problem is quantization caused by the harsh slope of the conversion curve around zero, but in floating-point, the fact that absolute precision keeps getting better the closer to zero you come pretty much solves this problem.
 
Questions :

Should there be separate ways to tackle various types of aliasing (edge, texture, spatial) using whatever "pure and correct" method?

How feasible would this be in hardware?

If feasible, would such an approach be cheaper than the current universal AA approach?

Sorry for bringing this dead thread back up the forum... didn't realize last post was 3 days ago... stopped reading this thread at pg. 5... too much to absorb!! :)
 
Last edited:
Should there be separate ways to tackle various types of aliasing (edge, texture, spatial) using whatever "pure and correct" method?
Edge and texture are spatial, did you mean temporal?

We do have different methods of antialiasing for geometry edges, textures and alpha test edges (MSAA/FAA, texture filtering/mipmapping, alpha to coverage). Could you clarify what you mean by "the current universal AA approach"?
 
Back
Top