Alternative AA methods and their comparison with traditional MSAA*

Update on FXAA,now version v3.

Timothy Lottes
Added math to FXAA v3 Console, likely bad for PS3. To any PS3 dev: looking for perf feedback, dmsg me and I will send FXAA v3 preview source

Timothy Lottes @
@NocturnDragon FXAA v3 Console is faster +
higher quality version of FXAA v2 Console, FXAA v3 Quality is a much faster version of FXAA v1.
 
Did some dynamic branch optimizations for FXAA2 today. Now it runs at 0.9ms on Xbox with a ifAll branch.

I was searching DX11 documentation for a way to branch depending on the result of all threads in the same branching unit (ifAll, ifAny). Cuda has __all and __any, but I can't find equivalents in DirectCompute/DX11... DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
 
Did some dynamic branch optimizations for FXAA2 today. Now it runs at 0.9ms on Xbox with a ifAll branch.

I was searching DX11 documentation for a way to branch depending on the result of all threads in the same branching unit (ifAll, ifAny). Cuda has __all and __any, but I can't find equivalents in DirectCompute/DX11... DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
You're probably trying to do something I don't understand, but if all threads take the same branch the hardware should actually branch rather than predicate.
 
DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.
 
Ported FXAA2 this morning to our engine. Did some Xbox 360 microcode ASM optimizations to the tfetches (to remove some ALU instructions), but nothing else. It looks really good actually. Way better than 2xMSAA or our hacky temporal AA (i'd say it's very much comparable to 4xMSAA in quality). Textures stay sharp, and it properly antialiases all our foliage/trees (we have a lot of vegetation) and noisy specular highlights.

For 1280x672 (93% of 720p pixels) it runs at 1.2ms. It's tfetch bound, so I will likely integrate it to our huge post process shader (that is currently ALU bound). It would balance the load nicely. Total cost for AA would be around 1ms then :)

Another way to make it run faster is to make it read more than one luminance value by one tfetch. Sadly gather instruction is not available on consoles, making the straightforward R8 luminance sampling (4 neightbour texels to rgba in one instruction) not possible. That would make it faster than 1.0ms.

Just curious, but how much memory did FXAA take on 360?
 
Andrew Lauritzen said:
It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.

The point of __any and __all is not to avoid a branch, but to efficiently interchange information between workitems.
 
Just curious, but how much memory did FXAA take on 360?
It doesn't use any memory at all.

G-buffers are not used after post processing, so we have two full screen buffers unused at that point of rendering. We resolve the post processed back buffer to one of our g-buffer textures. The AA samples texels from the g-buffer and outputs pixels to EDRAM. UI is drawn on top of the antialiased result.

MLAA would also cost no memory, as two g-buffers are enough to keep it's temp results (MLAA needs two temp buffers). Basically any reasonable post AA filter would have zero memory usage when used in deferred renderer.

With forward rendering you would likely get a memory hit, unless you for example reuse a shadowmap memory area to store the AA temp results. On consoles you can overlap multiple textures to the same memory areas, so it doesn't matter that the shadowmap uses different format than RGBA8. And for forward renderer, you could also use hardware MSAA. Post process AA filters are most useful for deferred renderers.
 
It doesn't use any memory at all.

G-buffers are not used after post processing, so we have two full screen buffers unused at that point of rendering. We resolve the post processed back buffer to one of our g-buffer textures. The AA samples texels from the g-buffer and outputs pixels to EDRAM. UI is drawn on top of the antialiased result.

MLAA would also cost no memory, as two g-buffers are enough to keep it's temp results (MLAA needs two temp buffers). Basically any reasonable post AA filter would have zero memory usage when used in deferred renderer.

With forward rendering you would likely get a memory hit, unless you for example reuse a shadowmap memory area to store the AA temp results. On consoles you can overlap multiple textures to the same memory areas, so it doesn't matter that the shadowmap uses different format than RGBA8. And for forward renderer, you could also use hardware MSAA. Post process AA filters are most useful for deferred renderers.

Ah, I see. So in a sense, post process AA such as MLAA and FXAA save a lot of memory/bandwidth compared to hardware MSAA? And from this point on, will devs that use deffered renderer's just simply switch to post process AA, or is this still some incentive to stick with MSAA?
 
It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.
The point of ifAny/ifAll branch is to do less calculations when branches diverge inside a SIMD (branching unit). The DirectCompute compiler cannot automatically optimize for this, since it changes the meaning of the code.

For example a simple if-else branch:
if (needSlowPath) execute40Instructions() else execute20Instructions();

With standard branching you do 60 instructions if both sides of the branch occur inside a single SIMD (and predicate the invalid instructions for non appliable threads). With ifAny branch, if any of the threads inside a SIMD evaluate needSlowPath = true, all pixels in the SIMD will pass the if-statement. So you only execute 40 instructions. All threads will jump over the 20 instructions (including those that would evaluate needSlowPath = false). And it also guarantees that all threads inside the SIMD run exactly the same instructions (no predicates, etc needed).

Standard if:
Slow path for all pixels: 40 instructions
Mixed slow and fast path: 60 instructions
Fast path for all pixels: 20 instructions

ifAny/__any:
Slow path for all pixels: 40 instructions
Mixed slow and fast path: 40 instructions (saves 20 instructions here)
Fast path for all pixels: 20 instructions

Of course you can only use this kind of branching for cases, where one of the execution paths satisfies both comparison results.

You are correct that the hardware SIMD size affects how this kind of branches get executed. But if branch A provides correct results for all threads, then it's safe to execute any amount of branch B threads using the branch A. The smaller SIMD size, the more branch B threads will execute branch B. The programmer has to be really careful however, since both branch sides can be optimized slightly differently, and there can be different float rounding ect, making the calculation possibly slightly nondeterministic.
 
But if branch A provides correct results for all threads, then it's safe to execute any amount of branch B threads using the branch A.
Right, I understand how it works and how you can write code to take advantage of it. My point stands though: DirectX typically cannot allow instructions/features that behaves differently on different hardware, else people will write code that unknowingly (or worse, intentionally) depends on the card/drivers they are testing it on. The non-determinism of floating point results is a very minor issue compared to the widely varying SIMD sizes on different GPUs :)

This is a more important optimization on CPUs anyways as predication currently costs instructions there. On GPUs predication is mostly "free" so all you're potentially saving is the other side of the branch when you have a general case and a specific case (for warps which diverge).
 
My point stands though: DirectX typically cannot allow instructions/features that behaves differently on different hardware, else people will write code that unknowingly (or worse, intentionally) depends on the card/drivers they are testing it on.
Agreed. I just tried to explain it so that everyone reading this thread understands what we are talking about :)

Many of the most efficient CUDA algorithms for calculating stuff like prefix sum depend on intra warp optimizations. The SIMD width can be used as a powerful synchronization tool, since basically you get a free warp wide synch barrier after each instruction. I wonder however, what will happen if NVIDIA chooses to use different warp size than 32 in their future GPUs. Many highly optimized CUDA algorithms (also featured in popular CUDA libraries) will break completely. That will be a mess for sure :)
 
Many of the most efficient CUDA algorithms for calculating stuff like prefix sum depend on intra warp optimizations. The SIMD width can be used as a powerful synchronization tool, since basically you get a free warp wide synch barrier after each instruction. I wonder however, what will happen if NVIDIA chooses to use different warp size than 32 in their future GPUs. Many highly optimized CUDA algorithms (also featured in popular CUDA libraries) will break completely. That will be a mess for sure :)
Yeah it's a difficult line... for most graphics work you can sorta see the justification for hiding the SIMD size from the user, but it becomes less clear once you stray into compute stuff. Ideally you want users to write applications that are parameterized on the SIMD size and scale appropriately to different architectures (not an easy task mind you for wide ranges of hardware), but how best to express that elegantly (i.e not #define BLOCK_SIZE, etc) is not totally clear yet.
 
Every console game released without any kind of AA in 2012, will be consider as a tech disappointment for me.
Also any PC game released without an option for some real, non-uniform subsampling (MSAA, even if it has a fancier reconstruction filter than just box) will be a disappointment to me ;) I can sort of forgive it on the consoles for this generation given the limitations faced there.
 
Last edited by a moderator:
Also any PC game released without an option for some real, non-uniform subsampling (MSAA, even if it has a fancier reconstruction filter than just box) will be a disappointment to me ;)
This is out of question, after so many UE games [Mass Effect 1/2 i look at You!] i cant stand it either.

And about consoles, yes it was valid earlier, but now i prefer FXAA/MLAA/DLAA over any other post processing effect first - 1ms isnt really that much, especially since most games are optimized for 33ms.


@sebbbi
How to bribe You for PC release of Trials Revolution? :>
 
I see some (E)VSM bleeding there, but it's not bad considering :)... you're using 16-bit EVSM + SDSM z-ranges? Sure would be nice if the consoles could do 32-bit filtering.

AA looks pretty good close up but some of the far stuff doesn't seem to be AAd at all. I can understand the fence and wheel spokes and stuff that needs subsampling, but the pipe in the background - is it just too close to horizontal to be picked up by the filter?

How does it look in motion?
 
Last edited by a moderator:
Back
Top