Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 19-May-2011, 18:15   #876
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,820
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
FXAA also makes it brighter! :P
That's probably an effect from the Snipping Tool (ignores the desktop gamma setting?) - the rest of the shots were captured by FRAPS.

Temporal stability is OK-ish to me, but I have only tested with this old OGL demo for now.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 19-May-2011, 22:36   #877
Xenus
Senior Member
 
Join Date: Nov 2004
Location: Ohio
Posts: 1,209
Default

Yeah it's probably better to capture all screens using the same method for consistancy.
Xenus is offline   Reply With Quote
Old 19-May-2011, 22:38   #878
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,216
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by sebbbi View Post
We have a fully deferred renderer, so 2xMSAA would cost a lot. Naive implementation would double the lighting cost (currently our lighting is around 25% of our frame time). There are some clever tiled deferred renderers (for example in Black Rock's Split Second) that do sample frequency lighting only to tiles that have MSAA edges, but even with really small 4x4 tiles huge amount of pixels require double lighting (one edge pixel in tile requires the whole tile to be lighted by sample precision).
A stencil mask can eliminate much of the overhead. Both consoles have very effective Hi-Stencil. The main bottleneck is generating the mask.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline   Reply With Quote
Old 20-May-2011, 06:06   #879
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Quote:
Originally Posted by Humus View Post
A stencil mask can eliminate much of the overhead. Both consoles have very effective Hi-Stencil. The main bottleneck is generating the mask.
That's true. The most efficient mask generation I know uses centroid sampling trick (subtract centroid interpolated value from center interpolated to detect an edge). This however only works properly for 2xMSAA on current hardware, as 4xMSAA returns center value for all 3/4 subsample patterns, and it doesn't detect transparency clip edges (or shader specular aliasing). Of course you need extra space in your g-buffer to store the edge bit (the extra 2 bits of 10-10-10-2 RT are good for this). On consoles you can write directly from the pixel shader to the stencil buffer (set your color RT memory address to point to the DS-buffer), so you can copy the mask bits later to the stencil buffer pretty painlessly. However Hi-stencil works in 4x4 or larger blocks, so this is not any more efficient than 4x4 based tiled deferred. One edge pixel causes 16 pixels to be lighted at sample precision. And you need to resolve (copy) all the g-buffers at sample precision from the EDRAM (with 4xMSAA that alone is over 1ms extra... you could do whole FXAA2 processing in the same time as copying the samples to the main memory). And tiling adds cost also, since some of the geometry needs to be drawn twice (or tree times for 4xMSAA). 2xMSAA is already behind before the lighting step even begins. And it depends entirely on the scene contents how many 4x4 blocks have edges, and require sample precision lighting (field of grass can be really bad for example). So the extra lighting cost varies from frame to frame. FXAA2 perf hit is always the same.I tend to prefer techniques with constant performance hit (to achieve good minimum frame rate).

And then there is the tonemapping (and gamma) issue as well. You have to keep the edge blocks at sample frequency until you do tone mapping and gamma. So basically you need to do your post processing also with stencil masked sample frequency (bloom combine, low res particle buffer combine, etc, should be done before tone mapping). Doing this properly adds extra cost. So basically you cannot do 2xMSAA in under 3ms (3ms is 18% of your frame time if you aim at 60 fps). And 2xMSAA antialiasing quality isn't worth that big sacrifice. And it doesn't do anything to transparency edges and specular aliasing.

Quote:
Originally Posted by fellix View Post
Here is the result: NoAA - 4xMSAA - FXAA
I don't like how PC drivers apply post process AA after UI rendering. It makes text look blurry. Post AA should be applied before UI rendering.

Last edited by sebbbi; 20-May-2011 at 08:46.
sebbbi is offline   Reply With Quote
Old 20-May-2011, 11:34   #880
Ruskie
Senior Member
 
Join Date: Mar 2010
Posts: 1,283
Default

Being the stalker that I am...
Quote:
Cool, just found an optimization for FXAA II Console: 2.5x faster on PC! Should be under 1ms now for 720p on Xbox360!!!
http://twitter.com/#!/TimothyLottes/...16671413108736

Could this mean that BF3 will use MLAA?
Quote:
its short for MLAAwesome, cause it makes the ps3 version look awesome. death to jagged pixels!
http://twitter.com/#!/ChristinaCoffi...95962115686400

Last edited by Ruskie; 21-May-2011 at 01:14.
Ruskie is offline   Reply With Quote
Old 21-May-2011, 00:31   #881
AlStrong
penguins
 
Join Date: Feb 2004
Posts: 13,978
Default

Wonder how useable this is on the other console.
__________________

AlStrong is offline   Reply With Quote
Old 24-May-2011, 22:01   #882
Ruskie
Senior Member
 
Join Date: Mar 2010
Posts: 1,283
Default

Update on FXAA,now version v3.

Quote:
Timothy Lottes
Added math to FXAA v3 Console, likely bad for PS3. To any PS3 dev: looking for perf feedback, dmsg me and I will send FXAA v3 preview source
Quote:
Timothy Lottes @
@NocturnDragon FXAA v3 Console is faster +
higher quality version of FXAA v2 Console, FXAA v3 Quality is a much faster version of FXAA v1.
Ruskie is offline   Reply With Quote
Old 28-May-2011, 11:10   #883
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Did some dynamic branch optimizations for FXAA2 today. Now it runs at 0.9ms on Xbox with a ifAll branch.

I was searching DX11 documentation for a way to branch depending on the result of all threads in the same branching unit (ifAll, ifAny). Cuda has __all and __any, but I can't find equivalents in DirectCompute/DX11... DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
sebbbi is offline   Reply With Quote
Old 28-May-2011, 15:07   #884
Ruskie
Senior Member
 
Join Date: Mar 2010
Posts: 1,283
Default

Well,thats really fast.I wonder if FXAA v3 code is available now?It suppose to have better quality and performance.
Here is DFs comparison in Enslaved(No AA-console FXAA)
http://img705.imageshack.us/img705/8053/fxaa001.png
Ruskie is offline   Reply With Quote
Old 29-May-2011, 04:28   #885
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,021
Default

Quote:
Originally Posted by sebbbi View Post
Did some dynamic branch optimizations for FXAA2 today. Now it runs at 0.9ms on Xbox with a ifAll branch.

I was searching DX11 documentation for a way to branch depending on the result of all threads in the same branching unit (ifAll, ifAny). Cuda has __all and __any, but I can't find equivalents in DirectCompute/DX11... DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
You're probably trying to do something I don't understand, but if all threads take the same branch the hardware should actually branch rather than predicate.
3dcgi is offline   Reply With Quote
Old 29-May-2011, 06:00   #886
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
Default

Quote:
Originally Posted by sebbbi View Post
DirectCompute doesn't have any way to query the size of a warp/wavefront, so maybe this feature was too low level as well.
It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 29-May-2011, 07:01   #887
Deviousb33r
Member
 
Join Date: May 2010
Location: California
Posts: 110
Default

Quote:
Originally Posted by sebbbi View Post
Ported FXAA2 this morning to our engine. Did some Xbox 360 microcode ASM optimizations to the tfetches (to remove some ALU instructions), but nothing else. It looks really good actually. Way better than 2xMSAA or our hacky temporal AA (i'd say it's very much comparable to 4xMSAA in quality). Textures stay sharp, and it properly antialiases all our foliage/trees (we have a lot of vegetation) and noisy specular highlights.

For 1280x672 (93% of 720p pixels) it runs at 1.2ms. It's tfetch bound, so I will likely integrate it to our huge post process shader (that is currently ALU bound). It would balance the load nicely. Total cost for AA would be around 1ms then

Another way to make it run faster is to make it read more than one luminance value by one tfetch. Sadly gather instruction is not available on consoles, making the straightforward R8 luminance sampling (4 neightbour texels to rgba in one instruction) not possible. That would make it faster than 1.0ms.
Just curious, but how much memory did FXAA take on 360?
Deviousb33r is online now   Reply With Quote
Old 29-May-2011, 07:26   #888
RecessionCone
Member
 
Join Date: Feb 2010
Posts: 170
Default

Quote:
Originally Posted by Andrew Lauritzen
It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.
The point of __any and __all is not to avoid a branch, but to efficiently interchange information between workitems.
RecessionCone is offline   Reply With Quote
Old 29-May-2011, 08:23   #889
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Quote:
Originally Posted by Deviousb33r View Post
Just curious, but how much memory did FXAA take on 360?
It doesn't use any memory at all.

G-buffers are not used after post processing, so we have two full screen buffers unused at that point of rendering. We resolve the post processed back buffer to one of our g-buffer textures. The AA samples texels from the g-buffer and outputs pixels to EDRAM. UI is drawn on top of the antialiased result.

MLAA would also cost no memory, as two g-buffers are enough to keep it's temp results (MLAA needs two temp buffers). Basically any reasonable post AA filter would have zero memory usage when used in deferred renderer.

With forward rendering you would likely get a memory hit, unless you for example reuse a shadowmap memory area to store the AA temp results. On consoles you can overlap multiple textures to the same memory areas, so it doesn't matter that the shadowmap uses different format than RGBA8. And for forward renderer, you could also use hardware MSAA. Post process AA filters are most useful for deferred renderers.
sebbbi is offline   Reply With Quote
Old 29-May-2011, 08:46   #890
Deviousb33r
Member
 
Join Date: May 2010
Location: California
Posts: 110
Default

Quote:
Originally Posted by sebbbi View Post
It doesn't use any memory at all.

G-buffers are not used after post processing, so we have two full screen buffers unused at that point of rendering. We resolve the post processed back buffer to one of our g-buffer textures. The AA samples texels from the g-buffer and outputs pixels to EDRAM. UI is drawn on top of the antialiased result.

MLAA would also cost no memory, as two g-buffers are enough to keep it's temp results (MLAA needs two temp buffers). Basically any reasonable post AA filter would have zero memory usage when used in deferred renderer.

With forward rendering you would likely get a memory hit, unless you for example reuse a shadowmap memory area to store the AA temp results. On consoles you can overlap multiple textures to the same memory areas, so it doesn't matter that the shadowmap uses different format than RGBA8. And for forward renderer, you could also use hardware MSAA. Post process AA filters are most useful for deferred renderers.
Ah, I see. So in a sense, post process AA such as MLAA and FXAA save a lot of memory/bandwidth compared to hardware MSAA? And from this point on, will devs that use deffered renderer's just simply switch to post process AA, or is this still some incentive to stick with MSAA?
Deviousb33r is online now   Reply With Quote
Old 29-May-2011, 09:11   #891
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
It's non-portable: it lets you write code that will run differently based on the SIMD size of the underlying implementation, which is not allowed for obvious reasons. As mentioned of course typical dynamic branching should perform well enough on modern PC cards... there's fairly little overhead to adding a branch.
The point of ifAny/ifAll branch is to do less calculations when branches diverge inside a SIMD (branching unit). The DirectCompute compiler cannot automatically optimize for this, since it changes the meaning of the code.

For example a simple if-else branch:
if (needSlowPath) execute40Instructions() else execute20Instructions();

With standard branching you do 60 instructions if both sides of the branch occur inside a single SIMD (and predicate the invalid instructions for non appliable threads). With ifAny branch, if any of the threads inside a SIMD evaluate needSlowPath = true, all pixels in the SIMD will pass the if-statement. So you only execute 40 instructions. All threads will jump over the 20 instructions (including those that would evaluate needSlowPath = false). And it also guarantees that all threads inside the SIMD run exactly the same instructions (no predicates, etc needed).

Standard if:
Slow path for all pixels: 40 instructions
Mixed slow and fast path: 60 instructions
Fast path for all pixels: 20 instructions

ifAny/__any:
Slow path for all pixels: 40 instructions
Mixed slow and fast path: 40 instructions (saves 20 instructions here)
Fast path for all pixels: 20 instructions

Of course you can only use this kind of branching for cases, where one of the execution paths satisfies both comparison results.

You are correct that the hardware SIMD size affects how this kind of branches get executed. But if branch A provides correct results for all threads, then it's safe to execute any amount of branch B threads using the branch A. The smaller SIMD size, the more branch B threads will execute branch B. The programmer has to be really careful however, since both branch sides can be optimized slightly differently, and there can be different float rounding ect, making the calculation possibly slightly nondeterministic.
sebbbi is offline   Reply With Quote
Old 29-May-2011, 18:47   #892
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
Default

Quote:
Originally Posted by sebbbi View Post
But if branch A provides correct results for all threads, then it's safe to execute any amount of branch B threads using the branch A.
Right, I understand how it works and how you can write code to take advantage of it. My point stands though: DirectX typically cannot allow instructions/features that behaves differently on different hardware, else people will write code that unknowingly (or worse, intentionally) depends on the card/drivers they are testing it on. The non-determinism of floating point results is a very minor issue compared to the widely varying SIMD sizes on different GPUs

This is a more important optimization on CPUs anyways as predication currently costs instructions there. On GPUs predication is mostly "free" so all you're potentially saving is the other side of the branch when you have a general case and a specific case (for warps which diverge).
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 29-May-2011, 19:06   #893
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
My point stands though: DirectX typically cannot allow instructions/features that behaves differently on different hardware, else people will write code that unknowingly (or worse, intentionally) depends on the card/drivers they are testing it on.
Agreed. I just tried to explain it so that everyone reading this thread understands what we are talking about

Many of the most efficient CUDA algorithms for calculating stuff like prefix sum depend on intra warp optimizations. The SIMD width can be used as a powerful synchronization tool, since basically you get a free warp wide synch barrier after each instruction. I wonder however, what will happen if NVIDIA chooses to use different warp size than 32 in their future GPUs. Many highly optimized CUDA algorithms (also featured in popular CUDA libraries) will break completely. That will be a mess for sure
sebbbi is offline   Reply With Quote
Old 30-May-2011, 19:41   #894
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
Default

Quote:
Originally Posted by sebbbi View Post
Many of the most efficient CUDA algorithms for calculating stuff like prefix sum depend on intra warp optimizations. The SIMD width can be used as a powerful synchronization tool, since basically you get a free warp wide synch barrier after each instruction. I wonder however, what will happen if NVIDIA chooses to use different warp size than 32 in their future GPUs. Many highly optimized CUDA algorithms (also featured in popular CUDA libraries) will break completely. That will be a mess for sure
Yeah it's a difficult line... for most graphics work you can sorta see the justification for hiding the SIMD size from the user, but it becomes less clear once you stray into compute stuff. Ideally you want users to write applications that are parameterized on the SIMD size and scale appropriately to different architectures (not an easy task mind you for wide ranges of hardware), but how best to express that elegantly (i.e not #define BLOCK_SIZE, etc) is not totally clear yet.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 05-Jun-2011, 10:23   #895
OlegSH
Member
 
Join Date: Jan 2010
Posts: 117
Default

FXAAII on PS3
OlegSH is offline   Reply With Quote
Old 05-Jun-2011, 13:49   #896
KKRT
Member
 
Join Date: Aug 2009
Posts: 836
Default

Quote:
Originally Posted by OlegSH View Post
Awesome work from Timothy.

Every console game released without any kind of AA in 2012, will be consider as a tech disappointment for me.
KKRT is offline   Reply With Quote
Old 06-Jun-2011, 20:00   #897
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
Default

Quote:
Originally Posted by KKRT View Post
Every console game released without any kind of AA in 2012, will be consider as a tech disappointment for me.
Also any PC game released without an option for some real, non-uniform subsampling (MSAA, even if it has a fancier reconstruction filter than just box) will be a disappointment to me I can sort of forgive it on the consoles for this generation given the limitations faced there.
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 06-Jun-2011 at 20:13.
Andrew Lauritzen is offline   Reply With Quote
Old 07-Jun-2011, 13:25   #898
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Some screenies of the 0.9ms optimized FXAA2 on Xbox 360 I mentioned earlier in this thread. These are unfortunately slightly upscaled. We will release better screenshots later (hopefully with FXAA3).

http://www.redlynx.com/media/files/R...ress%20Kit.zip
sebbbi is offline   Reply With Quote
Old 07-Jun-2011, 19:46   #899
KKRT
Member
 
Join Date: Aug 2009
Posts: 836
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Also any PC game released without an option for some real, non-uniform subsampling (MSAA, even if it has a fancier reconstruction filter than just box) will be a disappointment to me
This is out of question, after so many UE games [Mass Effect 1/2 i look at You!] i cant stand it either.

And about consoles, yes it was valid earlier, but now i prefer FXAA/MLAA/DLAA over any other post processing effect first - 1ms isnt really that much, especially since most games are optimized for 33ms.


@sebbbi
How to bribe You for PC release of Trials Revolution? :>
KKRT is offline   Reply With Quote
Old 07-Jun-2011, 20:35   #900
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
Default

Quote:
Originally Posted by sebbbi View Post
I see some (E)VSM bleeding there, but it's not bad considering ... you're using 16-bit EVSM + SDSM z-ranges? Sure would be nice if the consoles could do 32-bit filtering.

AA looks pretty good close up but some of the far stuff doesn't seem to be AAd at all. I can understand the fence and wheel spokes and stuff that needs subsampling, but the pipe in the background - is it just too close to horizontal to be picked up by the filter?

How does it look in motion?
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 07-Jun-2011 at 20:41.
Andrew Lauritzen is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 06:59.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.