Timothy Lottes
Added math to FXAA v3 Console, likely bad for PS3. To any PS3 dev: looking for perf feedback, dmsg me and I will send FXAA v3 preview source
Timothy Lottes @
@NocturnDragon FXAA v3 Console is faster +
higher quality version of FXAA v2 Console, FXAA v3 Quality is a much faster version of FXAA v1.
You're probably trying to do something I don't understand, but if all threads take the same branch the hardware should actually branch rather than predicate.Did some dynamic branch optimizations for FXAA2 today. Now it runs at 0.9ms on Xbox with a ifAll branch.
I was searching DX11 documentation for a way to branch depending on the result of all threads in the same branching unit (ifAll, ifAny). Cuda has __all and __any, but I can't find equivalents in DirectCompute/DX11... DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
Ported FXAA2 this morning to our engine. Did some Xbox 360 microcode ASM optimizations to the tfetches (to remove some ALU instructions), but nothing else. It looks really good actually. Way better than 2xMSAA or our hacky temporal AA (i'd say it's very much comparable to 4xMSAA in quality). Textures stay sharp, and it properly antialiases all our foliage/trees (we have a lot of vegetation) and noisy specular highlights.
For 1280x672 (93% of 720p pixels) it runs at 1.2ms. It's tfetch bound, so I will likely integrate it to our huge post process shader (that is currently ALU bound). It would balance the load nicely. Total cost for AA would be around 1ms then
Another way to make it run faster is to make it read more than one luminance value by one tfetch. Sadly gather instruction is not available on consoles, making the straightforward R8 luminance sampling (4 neightbour texels to rgba in one instruction) not possible. That would make it faster than 1.0ms.
Andrew Lauritzen said:It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.
It doesn't use any memory at all.Just curious, but how much memory did FXAA take on 360?
It doesn't use any memory at all.
G-buffers are not used after post processing, so we have two full screen buffers unused at that point of rendering. We resolve the post processed back buffer to one of our g-buffer textures. The AA samples texels from the g-buffer and outputs pixels to EDRAM. UI is drawn on top of the antialiased result.
MLAA would also cost no memory, as two g-buffers are enough to keep it's temp results (MLAA needs two temp buffers). Basically any reasonable post AA filter would have zero memory usage when used in deferred renderer.
With forward rendering you would likely get a memory hit, unless you for example reuse a shadowmap memory area to store the AA temp results. On consoles you can overlap multiple textures to the same memory areas, so it doesn't matter that the shadowmap uses different format than RGBA8. And for forward renderer, you could also use hardware MSAA. Post process AA filters are most useful for deferred renderers.
The point of ifAny/ifAll branch is to do less calculations when branches diverge inside a SIMD (branching unit). The DirectCompute compiler cannot automatically optimize for this, since it changes the meaning of the code.It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.
Right, I understand how it works and how you can write code to take advantage of it. My point stands though: DirectX typically cannot allow instructions/features that behaves differently on different hardware, else people will write code that unknowingly (or worse, intentionally) depends on the card/drivers they are testing it on. The non-determinism of floating point results is a very minor issue compared to the widely varying SIMD sizes on different GPUsBut if branch A provides correct results for all threads, then it's safe to execute any amount of branch B threads using the branch A.
Agreed. I just tried to explain it so that everyone reading this thread understands what we are talking aboutMy point stands though: DirectX typically cannot allow instructions/features that behaves differently on different hardware, else people will write code that unknowingly (or worse, intentionally) depends on the card/drivers they are testing it on.
Yeah it's a difficult line... for most graphics work you can sorta see the justification for hiding the SIMD size from the user, but it becomes less clear once you stray into compute stuff. Ideally you want users to write applications that are parameterized on the SIMD size and scale appropriately to different architectures (not an easy task mind you for wide ranges of hardware), but how best to express that elegantly (i.e not #define BLOCK_SIZE, etc) is not totally clear yet.Many of the most efficient CUDA algorithms for calculating stuff like prefix sum depend on intra warp optimizations. The SIMD width can be used as a powerful synchronization tool, since basically you get a free warp wide synch barrier after each instruction. I wonder however, what will happen if NVIDIA chooses to use different warp size than 32 in their future GPUs. Many highly optimized CUDA algorithms (also featured in popular CUDA libraries) will break completely. That will be a mess for sure
Also any PC game released without an option for some real, non-uniform subsampling (MSAA, even if it has a fancier reconstruction filter than just box) will be a disappointment to me I can sort of forgive it on the consoles for this generation given the limitations faced there.Every console game released without any kind of AA in 2012, will be consider as a tech disappointment for me.
This is out of question, after so many UE games [Mass Effect 1/2 i look at You!] i cant stand it either.Also any PC game released without an option for some real, non-uniform subsampling (MSAA, even if it has a fancier reconstruction filter than just box) will be a disappointment to me
I see some (E)VSM bleeding there, but it's not bad considering ... you're using 16-bit EVSM + SDSM z-ranges? Sure would be nice if the consoles could do 32-bit filtering.