Forward+

Bryant

Newcomer
I was checking out new papers submitted for Eurographics 2012 and I saw this paper entitled Forward+: Bringing Deferred Lighting to the Next Level

A preview of the paper is available here https://sites.google.com/site/takahiroharada/ and here is an exerpt:

This paper presents Forward+, a method of rendering many lights by culling and storing only lights that contribute
to the pixel. Forward+ is an extension to traditional forward rendering. Light culling, implemented using the
compute capability of the GPU, is added to the pipeline to create lists of lights; that list is passed to the final
rendering shader, which can access all information about the lights. Although Forward+ increases workload
to the final shader, it theoretically requires less memory traffic compared to compute-based deferred lighting.
Furthermore, it removes the major drawback of deferred techniques, which is a restriction of materials and lighting
models. Experiments are performed to compare the performance of Forward+ and deferred lighting.

The biggest deal to me is the fact that it allows hardware antialiasing with an approach similar to deferred rendering.
 
I tried out the Leo demo on my 6970 and got from 20-30 fps, I don't currently have a 7000 series to test on.
 
Does that demo run @30fps on Tahiti?

Not sure, but it wouldnt be representative. I'm pretty sure the Leo demo was doing a whole lot more than tiled forward shading (it also included ptex and some form of indirect lighting as I recall), which makes it hard to tell where the performance is going.
 
I tried out the Leo demo on my 6970 and got from 20-30 fps, I don't currently have a 7000 series to test on.

Damn, that is impressive. ;)

I saw it's video. Looked pretty darn close to REYES quality to me. I think we might get to real time reyes quality rendering within this decade.
 
I actually just put up a blog post with some numbers from my own test app. 6970 seems to do really well with this technique. I wish I had a 7970 to try out.
 
Not sure, but it wouldnt be representative. I'm pretty sure the Leo demo was doing a whole lot more than tiled forward shading (it also included ptex and some form of indirect lighting as I recall), which makes it hard to tell where the performance is going.
Leo doesn't use PTEX. AMD used the same art assets for their PTEX demo, but ultimately the PTEX demo is entirely something else. Unfortunately just about everyone has confused the two - even I made that mistake at AMD's editor's day in the demo room.
 
Leo doesn't use PTEX. AMD used the same art assets for their PTEX demo, but ultimately the PTEX demo is entirely something else. Unfortunately just about everyone has confused the two - even I made that mistake at AMD's editor's day in the demo room.

Makes sense! PTEX is not supported on HD69xx and Leo demo still runs just fine.
 
MJP. I tried it on a 7970 and deferred is slower as with the 6970, but I don't know what resolution you used for your results. By default the app loaded in a window and I don't know if it always loads the same resolution. Let me know and I'll post the results.
 
So what do you expert folks think about this method? :p Or is it too early to tell (not enough demoing/fps benchmarking)
 
Makes sense! PTEX is not supported on HD69xx and Leo demo still runs just fine.
I don't think there's any reason ptex can't be supported on HD69xx though if ptex is implemented with partially resident textures then it would be 7000 series only.
 
MJP. I tried it on a 7970 and deferred is slower as with the 6970, but I don't know what resolution you used for your results. By default the app loaded in a window and I don't know if it always loads the same resolution. Let me know and I'll post the results.

I gathered all of my results at 1920x1080. The window defaults to 1280x720.

I don't think there's any reason ptex can't be supported on HD69xx though if ptex is implemented with partially resident textures then it would be 7000 series only.

The page for the demo mentions a "Ptex and PRT Technology Preview", which must be what Ryan Smith is talking about.

So what do you expert folks think about this method? :p Or is it too early to tell (not enough demoing/fps benchmarking)

From early tests so far it seems pretty good for AMD hardware, and a clear winner when MSAA is involved. On Nvidia hardware it doesn't fare nearly as well, at least compared to tile-based deferred rendering implemented in a compute shader. But overall a practical technique if you really want to stick to forward rendering, but want a lot of dynamic lights.
 
Last edited by a moderator:
You have to be careful to compare the same resolutions. The only numbers posted so far that are comparable between the GTX680 and Radeon 7970 are the following.

1024 Lights on the GTX680
MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 10.2ms 12.6ms
2x MSAA 11.62ms 15.15ms
4xMSAA 12.65ms 16.39ms

1024 Lights on the Radeon 7970
MSAA Level Light Indexed Deferred Tile-Based Deferred
No MSAA 6.02ms 4.63ms
2x MSAA 6.85ms 6.58ms
4xMSAA 7.52ms 8.00ms

And one commenter speculated the 680's smaller amount of shared memory is holding it back.
 
I find it odd that LID with 0xMSAA on the 680 is slower than LID on the 7970 with 4xMSAA.

Hopefully there's more research put into this stuff. I really like MSAA. :)
 
... numbers ...
These numbers don't seem right, at least in terms of the underlying techniques. Without MSAA the 7970 and 680 are typically neck in neck in tile-based deferred. With MSAA the 680 wins by a decent margin due to some unexplained (to me) bottleneck (see my SIGGRAPH presentation or BF3 benchmarks with deferred MSAA). Here's my older benchmark to play with in terms of tile-based and conventional deferred:
http://software.intel.com/en-us/art...g-for-current-and-future-rendering-pipelines/

That said, Sponza isn't really the best test scene for this, and the configuration of lights in the demo really just turn this into an ALU test (at least beyond 128 lights). To demonstrate this, fly up and zoom in so that you can just see the roof filling your whole screen... note how with 1024 lights it doesn't really get much faster. Beyond the point where every pixel has a bunch of lights affecting it (say 2-8), there's arguable utility to adding more lights.

That's not to say it's a totally unrealistic scene, but I'd prefer to see that many lights distributed over a wider area so that more significant culling is happening. Now of course Power Plant isn't a great scene either, but I did test this on a fair number of real game scenes and the results between GPUs were more consistent.
 
Last edited by a moderator:
Andrew, isn't your comparison different? Forward+ uses light tiles to avoid having to write a G-Buffer.

Judging from your slides, the MSAA perf hit with deferred - even with your fancy edge detection and pixel repacking - is greater than what you see with most forward rendered games. I'm also sure that the render time should be significantly faster without having to write/read a G-buffer. As nice as that perf boost would be, the biggest strength of Forward+ is probably shader variety.

Also, despite forward rendering MSAA being inefficient on small interior triangle edges, it is a form of selective supersampling that can be very important for specular surfaces, so you'd have to identify those areas in the deferred to make it truly apples to apples.
 
Andrew, isn't your comparison different? Forward+ uses light tiles to avoid having to write a G-Buffer.
The "tiled deferred" implementation in that demo should be similar to mine and that's what I was comparing. As I noted, I imagine the disparity comes from just massive ALU saturation when you set it to 1024 lights in MJP's demo. With 128 lights the results are more similar to mine.

Indeed I don't implement the "forward+" (personally I'd still call this closer to deferred than pure forward, but that's just me :)), but I wasn't looking at those numbers.

Judging from your slides, the MSAA perf hit with deferred - even with your fancy edge detection and pixel repacking - is greater than what you see with most forward rendered games. I'm also sure that the render time should be significantly faster without having to write/read a G-buffer. As nice as that perf boost would be, the biggest strength of Forward+ is probably shader variety.
So called "shader variety" is a totally red herring. Deferred can run arbitrary shaders just as efficiently as forward (sometimes more-so due to 2x2 quad scheduling from the rasterizer). Try it :) And sure you avoid reading/writing the G-buffer (once), but you re-render/transform/tessellate/skin all your geometry. So it's app and scene dependent which is faster of course.

Also, despite forward rendering MSAA being inefficient on small interior triangle edges, it is a form of selective supersampling that can be very important for specular surfaces, so you'd have to identify those areas in the deferred to make it truly apples to apples.
It doesn't make a difference in practice unless you're rendering your entire mesh with that high density and even then, it's a bad way of doing it. In fact in case where it was visible it would produce objectionable problems that would reveal the mesh tessellation, so I'm not sure this should ever be considered desirable.

I'd give that one to deferred again, because you can selectively super-sample wherever you like, not just at triangle edges. I refuse to be shackled by the rasterizer in terms of shader execution and evaluation :)

Anyways I should write a big blog post about this at some point, the main point being "these are all variants of similar ideas - test them all and use the best. It's just normal code optimization." The only reason people seem to think these are fundamentally different things is because of the semi-bizarre way that you write code in the graphics pipeline. In fact I would wager that's pretty much the entire reason for people tending to have a bias against deferred from a conceptual point of view... but try to separate out thinking of how one might write the code from how it gets executed. It doesn't end up being as different as one might think, and long term the ease of writing code is irrelevant. Frankly any decent engine should be able to generate the shaders to swap between forward and deferred with the click of a button.

The only really important point is doing culling with knowledge of the depth buffer and semi-hierarchically. In fact, all of these GPU tiling variants are doing light culling *really inefficiently* (tons of redundant computation) due to the GPU programming model, so I'm more interested in seeing that addressed than more variants of how to launch the shading work. At the moment the CPU can cull the light lists significantly faster than the GPU (!), and yet it's still a win over conventional methods which really demonstrates how bad those methods were :)

That's not to say it isn't good to test and document all this stuff, but there's really nothing interesting to talk about from a research point of view IMHO. All of the tiled variants can produce identical results with similar levels of efficiency. It's literally just constant factors we're playing with here and they vary per application.
 
Last edited by a moderator:
In fact, all of these GPU tiling variants are doing light culling *really inefficiently* (tons of redundant computation) due to the GPU programming model, so I'm more interested in seeing that addressed than more variants of how to launch the shading work. At the moment the CPU can cull the light lists significantly faster than the GPU (!), and yet it's still a win over conventional methods which really demonstrates how bad those methods were :)
In what way would you improve the programming model?
 
Back
Top