Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 01-Apr-2012, 02:15   #1
Bryant
Junior Member
 
Join Date: Dec 2006
Posts: 31
Default Forward+

I was checking out new papers submitted for Eurographics 2012 and I saw this paper entitled Forward+: Bringing Deferred Lighting to the Next Level

A preview of the paper is available here https://sites.google.com/site/takahiroharada/ and here is an exerpt:

Quote:
This paper presents Forward+, a method of rendering many lights by culling and storing only lights that contribute
to the pixel. Forward+ is an extension to traditional forward rendering. Light culling, implemented using the
compute capability of the GPU, is added to the pipeline to create lists of lights; that list is passed to the final
rendering shader, which can access all information about the lights. Although Forward+ increases workload
to the final shader, it theoretically requires less memory traffic compared to compute-based deferred lighting.
Furthermore, it removes the major drawback of deferred techniques, which is a restriction of materials and lighting
models. Experiments are performed to compare the performance of Forward+ and deferred lighting.
The biggest deal to me is the fact that it allows hardware antialiasing with an approach similar to deferred rendering.
Bryant is offline   Reply With Quote
Old 01-Apr-2012, 03:26   #2
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,264
Send a message via Skype™ to rpg.314
Default

Does that demo run @30fps on Tahiti?
rpg.314 is offline   Reply With Quote
Old 01-Apr-2012, 03:45   #3
Bryant
Junior Member
 
Join Date: Dec 2006
Posts: 31
Default

I tried out the Leo demo on my 6970 and got from 20-30 fps, I don't currently have a 7000 series to test on.
Bryant is offline   Reply With Quote
Old 01-Apr-2012, 03:45   #4
frogblast
Junior Member
 
Join Date: Apr 2008
Posts: 77
Default

Quote:
Originally Posted by rpg.314 View Post
Does that demo run @30fps on Tahiti?
Not sure, but it wouldnt be representative. I'm pretty sure the Leo demo was doing a whole lot more than tiled forward shading (it also included ptex and some form of indirect lighting as I recall), which makes it hard to tell where the performance is going.
frogblast is offline   Reply With Quote
Old 01-Apr-2012, 04:46   #5
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,264
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Bryant View Post
I tried out the Leo demo on my 6970 and got from 20-30 fps, I don't currently have a 7000 series to test on.
Damn, that is impressive.

I saw it's video. Looked pretty darn close to REYES quality to me. I think we might get to real time reyes quality rendering within this decade.
rpg.314 is offline   Reply With Quote
Old 01-Apr-2012, 07:01   #6
MJP
Member
 
Join Date: Feb 2007
Location: Irvine, CA
Posts: 523
Default

I actually just put up a blog post with some numbers from my own test app. 6970 seems to do really well with this technique. I wish I had a 7970 to try out.
__________________
The Blog | The Book
MJP is offline   Reply With Quote
Old 01-Apr-2012, 09:57   #7
Ryan Smith
Member
 
Join Date: Mar 2010
Posts: 166
Default

Quote:
Originally Posted by frogblast View Post
Not sure, but it wouldnt be representative. I'm pretty sure the Leo demo was doing a whole lot more than tiled forward shading (it also included ptex and some form of indirect lighting as I recall), which makes it hard to tell where the performance is going.
Leo doesn't use PTEX. AMD used the same art assets for their PTEX demo, but ultimately the PTEX demo is entirely something else. Unfortunately just about everyone has confused the two - even I made that mistake at AMD's editor's day in the demo room.
Ryan Smith is offline   Reply With Quote
Old 01-Apr-2012, 11:36   #8
Lightman
Senior Member
 
Join Date: Jun 2008
Location: Torquay, UK
Posts: 1,159
Default

Quote:
Originally Posted by Ryan Smith View Post
Leo doesn't use PTEX. AMD used the same art assets for their PTEX demo, but ultimately the PTEX demo is entirely something else. Unfortunately just about everyone has confused the two - even I made that mistake at AMD's editor's day in the demo room.
Makes sense! PTEX is not supported on HD69xx and Leo demo still runs just fine.
Lightman is offline   Reply With Quote
Old 01-Apr-2012, 20:27   #9
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,226
Default

MJP. I tried it on a 7970 and deferred is slower as with the 6970, but I don't know what resolution you used for your results. By default the app loaded in a window and I don't know if it always loads the same resolution. Let me know and I'll post the results.
3dcgi is online now   Reply With Quote
Old 01-Apr-2012, 21:27   #10
AlNets
Posts may self-destruct
 
Join Date: Feb 2004
Location: In a Mirror Darkly
Posts: 15,178
Default

So what do you expert folks think about this method? Or is it too early to tell (not enough demoing/fps benchmarking)
__________________
"You keep using that word. I do not think it means what you think it means."
Never scale-up, never sub-render!
(╯□)╯︵ □ Flipquad
AlNets is offline   Reply With Quote
Old 01-Apr-2012, 21:40   #11
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,226
Default

Quote:
Originally Posted by Lightman View Post
Makes sense! PTEX is not supported on HD69xx and Leo demo still runs just fine.
I don't think there's any reason ptex can't be supported on HD69xx though if ptex is implemented with partially resident textures then it would be 7000 series only.
3dcgi is online now   Reply With Quote
Old 02-Apr-2012, 04:19   #12
MJP
Member
 
Join Date: Feb 2007
Location: Irvine, CA
Posts: 523
Default

Quote:
Originally Posted by 3dcgi View Post
MJP. I tried it on a 7970 and deferred is slower as with the 6970, but I don't know what resolution you used for your results. By default the app loaded in a window and I don't know if it always loads the same resolution. Let me know and I'll post the results.
I gathered all of my results at 1920x1080. The window defaults to 1280x720.

Quote:
Originally Posted by 3dcgi View Post
I don't think there's any reason ptex can't be supported on HD69xx though if ptex is implemented with partially resident textures then it would be 7000 series only.
The page for the demo mentions a "Ptex and PRT Technology Preview", which must be what Ryan Smith is talking about.

Quote:
Originally Posted by AlStrong View Post
So what do you expert folks think about this method? Or is it too early to tell (not enough demoing/fps benchmarking)
From early tests so far it seems pretty good for AMD hardware, and a clear winner when MSAA is involved. On Nvidia hardware it doesn't fare nearly as well, at least compared to tile-based deferred rendering implemented in a compute shader. But overall a practical technique if you really want to stick to forward rendering, but want a lot of dynamic lights.
__________________
The Blog | The Book

Last edited by MJP; 02-Apr-2012 at 04:32.
MJP is offline   Reply With Quote
Old 09-Apr-2012, 00:05   #13
Bryant
Junior Member
 
Join Date: Dec 2006
Posts: 31
Default

Quote:
Originally Posted by MJP View Post
On Nvidia hardware it doesn't fare nearly as well
The 680 seems to do really well in the benchmarks on your blog post.
Bryant is offline   Reply With Quote
Old 09-Apr-2012, 01:08   #14
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,226
Default

You have to be careful to compare the same resolutions. The only numbers posted so far that are comparable between the GTX680 and Radeon 7970 are the following.

1024 Lights on the GTX680
MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 10.2ms 12.6ms
2x MSAA 11.62ms 15.15ms
4xMSAA 12.65ms 16.39ms

1024 Lights on the Radeon 7970
MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 6.02ms 4.63ms
2x MSAA 6.85ms 6.58ms
4xMSAA 7.52ms 8.00ms

And one commenter speculated the 680's smaller amount of shared memory is holding it back.
3dcgi is online now   Reply With Quote
Old 09-Apr-2012, 01:26   #15
Bryant
Junior Member
 
Join Date: Dec 2006
Posts: 31
Default

I find it odd that LID with 0xMSAA on the 680 is slower than LID on the 7970 with 4xMSAA.

Hopefully there's more research put into this stuff. I really like MSAA.
Bryant is offline   Reply With Quote
Old 19-Apr-2012, 10:48   #16
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,952
Send a message via Skype™ to Jawed
Default

There's a new "educational mode" for the Leo demo.

http://www.geeks3d.com/20120322/amd-...ode/#more-8243

Quite interesting.

I wonder if they will release the sourcecode?
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 27-Apr-2012, 21:35   #17
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 2,271
Default

Quote:
Originally Posted by 3dcgi View Post
... numbers ...
These numbers don't seem right, at least in terms of the underlying techniques. Without MSAA the 7970 and 680 are typically neck in neck in tile-based deferred. With MSAA the 680 wins by a decent margin due to some unexplained (to me) bottleneck (see my SIGGRAPH presentation or BF3 benchmarks with deferred MSAA). Here's my older benchmark to play with in terms of tile-based and conventional deferred:
http://software.intel.com/en-us/arti...ing-pipelines/

That said, Sponza isn't really the best test scene for this, and the configuration of lights in the demo really just turn this into an ALU test (at least beyond 128 lights). To demonstrate this, fly up and zoom in so that you can just see the roof filling your whole screen... note how with 1024 lights it doesn't really get much faster. Beyond the point where every pixel has a bunch of lights affecting it (say 2-8), there's arguable utility to adding more lights.

That's not to say it's a totally unrealistic scene, but I'd prefer to see that many lights distributed over a wider area so that more significant culling is happening. Now of course Power Plant isn't a great scene either, but I did test this on a fair number of real game scenes and the results between GPUs were more consistent.
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 27-Apr-2012 at 22:55.
Andrew Lauritzen is offline   Reply With Quote
Old 28-Apr-2012, 01:03   #18
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,897
Default

Andrew, isn't your comparison different? Forward+ uses light tiles to avoid having to write a G-Buffer.

Judging from your slides, the MSAA perf hit with deferred - even with your fancy edge detection and pixel repacking - is greater than what you see with most forward rendered games. I'm also sure that the render time should be significantly faster without having to write/read a G-buffer. As nice as that perf boost would be, the biggest strength of Forward+ is probably shader variety.

Also, despite forward rendering MSAA being inefficient on small interior triangle edges, it is a form of selective supersampling that can be very important for specular surfaces, so you'd have to identify those areas in the deferred to make it truly apples to apples.
Mintmaster is offline   Reply With Quote
Old 28-Apr-2012, 02:44   #19
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 2,271
Default

Quote:
Originally Posted by Mintmaster View Post
Andrew, isn't your comparison different? Forward+ uses light tiles to avoid having to write a G-Buffer.
The "tiled deferred" implementation in that demo should be similar to mine and that's what I was comparing. As I noted, I imagine the disparity comes from just massive ALU saturation when you set it to 1024 lights in MJP's demo. With 128 lights the results are more similar to mine.

Indeed I don't implement the "forward+" (personally I'd still call this closer to deferred than pure forward, but that's just me ), but I wasn't looking at those numbers.

Quote:
Originally Posted by Mintmaster View Post
Judging from your slides, the MSAA perf hit with deferred - even with your fancy edge detection and pixel repacking - is greater than what you see with most forward rendered games. I'm also sure that the render time should be significantly faster without having to write/read a G-buffer. As nice as that perf boost would be, the biggest strength of Forward+ is probably shader variety.
So called "shader variety" is a totally red herring. Deferred can run arbitrary shaders just as efficiently as forward (sometimes more-so due to 2x2 quad scheduling from the rasterizer). Try it And sure you avoid reading/writing the G-buffer (once), but you re-render/transform/tessellate/skin all your geometry. So it's app and scene dependent which is faster of course.

Quote:
Originally Posted by Mintmaster View Post
Also, despite forward rendering MSAA being inefficient on small interior triangle edges, it is a form of selective supersampling that can be very important for specular surfaces, so you'd have to identify those areas in the deferred to make it truly apples to apples.
It doesn't make a difference in practice unless you're rendering your entire mesh with that high density and even then, it's a bad way of doing it. In fact in case where it was visible it would produce objectionable problems that would reveal the mesh tessellation, so I'm not sure this should ever be considered desirable.

I'd give that one to deferred again, because you can selectively super-sample wherever you like, not just at triangle edges. I refuse to be shackled by the rasterizer in terms of shader execution and evaluation

Anyways I should write a big blog post about this at some point, the main point being "these are all variants of similar ideas - test them all and use the best. It's just normal code optimization." The only reason people seem to think these are fundamentally different things is because of the semi-bizarre way that you write code in the graphics pipeline. In fact I would wager that's pretty much the entire reason for people tending to have a bias against deferred from a conceptual point of view... but try to separate out thinking of how one might write the code from how it gets executed. It doesn't end up being as different as one might think, and long term the ease of writing code is irrelevant. Frankly any decent engine should be able to generate the shaders to swap between forward and deferred with the click of a button.

The only really important point is doing culling with knowledge of the depth buffer and semi-hierarchically. In fact, all of these GPU tiling variants are doing light culling *really inefficiently* (tons of redundant computation) due to the GPU programming model, so I'm more interested in seeing that addressed than more variants of how to launch the shading work. At the moment the CPU can cull the light lists significantly faster than the GPU (!), and yet it's still a win over conventional methods which really demonstrates how bad those methods were

That's not to say it isn't good to test and document all this stuff, but there's really nothing interesting to talk about from a research point of view IMHO. All of the tiled variants can produce identical results with similar levels of efficiency. It's literally just constant factors we're playing with here and they vary per application.
__________________
The content of this message is my personal opinion only.

Last edited by Andrew Lauritzen; 28-Apr-2012 at 02:57.
Andrew Lauritzen is offline   Reply With Quote
Old 28-Apr-2012, 17:42   #20
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,226
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
In fact, all of these GPU tiling variants are doing light culling *really inefficiently* (tons of redundant computation) due to the GPU programming model, so I'm more interested in seeing that addressed than more variants of how to launch the shading work. At the moment the CPU can cull the light lists significantly faster than the GPU (!), and yet it's still a win over conventional methods which really demonstrates how bad those methods were
In what way would you improve the programming model?
3dcgi is online now   Reply With Quote
Old 28-Apr-2012, 19:35   #21
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,361
Default

Light culling on GPU could likely be done quite efficiently, as it's a problem that can be parallelized. Doing it in brute force way (per pixel or per block like usually done with pixel shaders) is easy, but requires lots of duplicate work. Investigating an efficient parallel algorithm for it that requires minimal work (and utilizes minimum amount of BW) is a pretty challenging problem.

I would personally like to try similar techniques that the fastest GPU radix sorters use, and use a quadtree to split lights. Basically you could do a modified scan (prefix sum) based on culling results (four frustums). One thread processes one light and checks the four frustums and increases counts on the bins that the light belongs. Scan the bins, and create next iteration data (this can have less or more items based on how many lights intersected frustums and how many got completely culled out). Next iteration (log2(r) in total, where r is screen resolution : block resolution) would then again go though all the lights (one thread per light) but now check the light to the four frustums of the sub frustum it belongs to. So in total 4*n*log2(r) frustum checks... log2(r) = 9 (for 720p and 4x4 block size). So it would result in approx 18 frustum checks per light (if we disregard intersected frustums and lights completely culled out).
sebbbi is online now   Reply With Quote
Old 29-Apr-2012, 21:41   #22
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 2,271
Default

Quote:
Originally Posted by sebbbi View Post
I would personally like to try similar techniques that the fastest GPU radix sorters use, and use a quadtree to split lights.
Yeah precisely, that's how I do it on the CPU and it's vastly faster due to it doing like 10-100x less total tests (!). But you don't want to do it it in "passes", because it's incredibly inefficient bandwidth-wise to dump your stack to DRAM every frame. What you need is the ability to fork/join with a work stealing scheduler. i.e. depth first traversal (most cache friendly), breadth first stealing (best load balancing).

On the GPU right now you can kind of do this with persistent threads (this is what OptiX and others do), but not efficiently, elegantly or portably. The programming model and hardware needs to expand to not lock in a static register count/stack/shared memory size, else it won't be able to do anything more interesting than what it does today. Turns out hierarchical data structures and recursion are kind of important
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 29-Apr-2012, 23:35   #23
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,897
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
The "tiled deferred" implementation in that demo should be similar to mine and that's what I was comparing.
Ahh, I see.

Quote:
So called "shader variety" is a totally red herring. Deferred can run arbitrary shaders just as efficiently as forward (sometimes more-so due to 2x2 quad scheduling from the rasterizer). Try it
At first I was going rebut with an example of large amounts of data interacting with the light sources, but I guess you could defer that with a mega-texture type of technique where every piece of data passed to the shader during forward texturing can be accessed in the deferred path. It does complicate fixed-function amenities like LOD or cylindrical texture mapping, but I suppose I'm being a bit picky now.

Quote:
Originally Posted by sebbbi View Post
Light culling on GPU could likely be done quite efficiently, as it's a problem that can be parallelized. Doing it in brute force way (per pixel or per block like usually done with pixel shaders) is easy, but requires lots of duplicate work.
Does it really? I may not be understanding what's required, but the best way I can think of is downscaling the Z-buffer (tracking min/max) so that you have one pixel per tile and then render lights to an equal-sized buffer with a DX11-style linked list. I don't see a lot of redundancy or inefficiency going on there except the usual quad-based underutilization which shouldn't bring it down to CPU speeds.

Last edited by Mintmaster; 30-Apr-2012 at 01:16. Reason: clarity
Mintmaster is offline   Reply With Quote
Old 30-Apr-2012, 04:57   #24
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 2,271
Default

Quote:
Originally Posted by Mintmaster View Post
It does complicate fixed-function amenities like LOD or cylindrical texture mapping, but I suppose I'm being a bit picky now.
Sure, and that's precisely why you tend to sample the diffuse texture in the first pass (i.e. not defer it), because you really have 6 floats as input (uvs and derivatives although this could be compressed somewhat), whereas the output is normally 4 unorms or similar. That said, as soon as you have a lot of textures using all the same coordinates it may make sense to defer that and store the derivatives... Also in other cases you can compute the derivatives analytically, like I do with shadows for instance. Storing Z derivatives is enough to recover any position-related derivatives in any space, and in fact you can use the shading normal as a reasonable approximation if G-buffer space is a problem.

Quote:
Originally Posted by Mintmaster View Post
Does it really? I may not be understanding what's required, but the best way I can think of is downscaling the Z-buffer (tracking min/max) so that you have one pixel per tile and then render lights to an equal-sized buffer with a DX11-style linked list. I don't see a lot of redundancy or inefficiency going on there except the usual quad-based underutilization which shouldn't bring it down to CPU speeds.
Yeah you need to create a min/max mip tree, although you shouldn't go down to 1:1 (GPUs are really inefficient at the last few levels). After that though given the very low resolutions (of tiles) it's normally not worth rasterizing, particularly since it's mostly spheres and cones, which are more efficient to "rasterize" in software. The biggest issue with this approach is that all your light lists go out to DRAM instead of sitting in caches, etc. As I mentioned, you really want depth-first traversal...

The ability that I described lets you implement this pretty close to as efficiently (i.e. hierarchical rasterizer) as possible in compute. You just need a stack and work stealing really, both of which GPUs hardware can do decently (see OptiX and other work)... it's trivial to implement in something like Cilk or TBB for instance. In fact if you download the ISPC compiler package, I wrote a CPU implementation of both the culling and shading, which is actually not as much slower than a modern GPU (iso-power) as you might thing With culling only, it's actually faster.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 30-Apr-2012, 12:37   #25
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,897
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
The biggest issue with this approach is that all your light lists go out to DRAM instead of sitting in caches, etc.
I don't see why that's an issue. For each light in each tile, you'd need a few orders of magnitude more bandwidth to render it deferred w/o tiling: 2 byte light index + linked list overhead vs reading a g-buffer for a few hundred pixels. I make this comparison because according to your paper deferred w/o tiling is 5-7x more expensive per light than tiled. Bandwidth really should be negligible, unless you're implying that latency from linked lists the issue. That would be weird given the AMD Mecha demo.

The way I see it, this method should be almost 100x faster than quad-based non-tiled deferred culling (due to several hundred times smaller render target), subject to triangle setup overhead from the lights (tiny fraction of Z-pass setup). Your tile frustum tests are taken care of in 2D by the rasterizer, and the shader does the Z test along with the LL.

I don't see why you need a full Z mip tree. Just sample all the pixels in a tile and output the min/max.
Mintmaster is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 09:04.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.