Forward+

Discussion in 'Rendering Technology and APIs' started by Bryant, Apr 1, 2012.

  1. Bryant

    Newcomer

    Joined:
    Dec 16, 2006
    Messages:
    31
    I was checking out new papers submitted for Eurographics 2012 and I saw this paper entitled Forward+: Bringing Deferred Lighting to the Next Level

    A preview of the paper is available here https://sites.google.com/site/takahiroharada/ and here is an exerpt:

    The biggest deal to me is the fact that it allows hardware antialiasing with an approach similar to deferred rendering.
     
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Does that demo run @30fps on Tahiti?
     
  3. Bryant

    Newcomer

    Joined:
    Dec 16, 2006
    Messages:
    31
    I tried out the Leo demo on my 6970 and got from 20-30 fps, I don't currently have a 7000 series to test on.
     
  4. frogblast

    Newcomer

    Joined:
    Apr 1, 2008
    Messages:
    78
    Not sure, but it wouldnt be representative. I'm pretty sure the Leo demo was doing a whole lot more than tiled forward shading (it also included ptex and some form of indirect lighting as I recall), which makes it hard to tell where the performance is going.
     
  5. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Damn, that is impressive. :wink:

    I saw it's video. Looked pretty darn close to REYES quality to me. I think we might get to real time reyes quality rendering within this decade.
     
  6. MJP

    MJP
    Regular

    Joined:
    Feb 21, 2007
    Messages:
    562
    Location:
    Irvine, CA
    I actually just put up a blog post with some numbers from my own test app. 6970 seems to do really well with this technique. I wish I had a 7970 to try out.
     
  7. Ryan Smith

    Regular Subscriber

    Joined:
    Mar 26, 2010
    Messages:
    441
    Location:
    PCIe x16_1
    Leo doesn't use PTEX. AMD used the same art assets for their PTEX demo, but ultimately the PTEX demo is entirely something else. Unfortunately just about everyone has confused the two - even I made that mistake at AMD's editor's day in the demo room.
     
  8. Lightman

    Veteran

    Joined:
    Jun 9, 2008
    Messages:
    1,573
    Location:
    Torquay, UK
    Makes sense! PTEX is not supported on HD69xx and Leo demo still runs just fine.
     
  9. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,340
    MJP. I tried it on a 7970 and deferred is slower as with the 6970, but I don't know what resolution you used for your results. By default the app loaded in a window and I don't know if it always loads the same resolution. Let me know and I'll post the results.
     
  10. AlNets

    AlNets ¯\_(ツ)_/¯
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    17,933
    Location:
    Polaris
    So what do you expert folks think about this method? :p Or is it too early to tell (not enough demoing/fps benchmarking)
     
  11. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,340
    I don't think there's any reason ptex can't be supported on HD69xx though if ptex is implemented with partially resident textures then it would be 7000 series only.
     
  12. MJP

    MJP
    Regular

    Joined:
    Feb 21, 2007
    Messages:
    562
    Location:
    Irvine, CA
    I gathered all of my results at 1920x1080. The window defaults to 1280x720.

    The page for the demo mentions a "Ptex and PRT Technology Preview", which must be what Ryan Smith is talking about.

    From early tests so far it seems pretty good for AMD hardware, and a clear winner when MSAA is involved. On Nvidia hardware it doesn't fare nearly as well, at least compared to tile-based deferred rendering implemented in a compute shader. But overall a practical technique if you really want to stick to forward rendering, but want a lot of dynamic lights.
     
    #12 MJP, Apr 2, 2012
    Last edited by a moderator: Apr 2, 2012
  13. Bryant

    Newcomer

    Joined:
    Dec 16, 2006
    Messages:
    31
    The 680 seems to do really well in the benchmarks on your blog post.
     
  14. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,340
    You have to be careful to compare the same resolutions. The only numbers posted so far that are comparable between the GTX680 and Radeon 7970 are the following.

    1024 Lights on the GTX680
    MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 10.2ms 12.6ms
    2x MSAA 11.62ms 15.15ms
    4xMSAA 12.65ms 16.39ms

    1024 Lights on the Radeon 7970
    MSAA Level Light Indexed Deferred Tile-Based Deferred
    No MSAA 6.02ms 4.63ms
    2x MSAA 6.85ms 6.58ms
    4xMSAA 7.52ms 8.00ms

    And one commenter speculated the 680's smaller amount of shared memory is holding it back.
     
  15. Bryant

    Newcomer

    Joined:
    Dec 16, 2006
    Messages:
    31
    I find it odd that LID with 0xMSAA on the 680 is slower than LID on the 7970 with 4xMSAA.

    Hopefully there's more research put into this stuff. I really like MSAA. :)
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
  17. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,517
    Location:
    British Columbia, Canada
    These numbers don't seem right, at least in terms of the underlying techniques. Without MSAA the 7970 and 680 are typically neck in neck in tile-based deferred. With MSAA the 680 wins by a decent margin due to some unexplained (to me) bottleneck (see my SIGGRAPH presentation or BF3 benchmarks with deferred MSAA). Here's my older benchmark to play with in terms of tile-based and conventional deferred:
    http://software.intel.com/en-us/art...g-for-current-and-future-rendering-pipelines/

    That said, Sponza isn't really the best test scene for this, and the configuration of lights in the demo really just turn this into an ALU test (at least beyond 128 lights). To demonstrate this, fly up and zoom in so that you can just see the roof filling your whole screen... note how with 1024 lights it doesn't really get much faster. Beyond the point where every pixel has a bunch of lights affecting it (say 2-8), there's arguable utility to adding more lights.

    That's not to say it's a totally unrealistic scene, but I'd prefer to see that many lights distributed over a wider area so that more significant culling is happening. Now of course Power Plant isn't a great scene either, but I did test this on a fair number of real game scenes and the results between GPUs were more consistent.
     
    #17 Andrew Lauritzen, Apr 27, 2012
    Last edited by a moderator: Apr 27, 2012
  18. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Andrew, isn't your comparison different? Forward+ uses light tiles to avoid having to write a G-Buffer.

    Judging from your slides, the MSAA perf hit with deferred - even with your fancy edge detection and pixel repacking - is greater than what you see with most forward rendered games. I'm also sure that the render time should be significantly faster without having to write/read a G-buffer. As nice as that perf boost would be, the biggest strength of Forward+ is probably shader variety.

    Also, despite forward rendering MSAA being inefficient on small interior triangle edges, it is a form of selective supersampling that can be very important for specular surfaces, so you'd have to identify those areas in the deferred to make it truly apples to apples.
     
  19. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,517
    Location:
    British Columbia, Canada
    The "tiled deferred" implementation in that demo should be similar to mine and that's what I was comparing. As I noted, I imagine the disparity comes from just massive ALU saturation when you set it to 1024 lights in MJP's demo. With 128 lights the results are more similar to mine.

    Indeed I don't implement the "forward+" (personally I'd still call this closer to deferred than pure forward, but that's just me :)), but I wasn't looking at those numbers.

    So called "shader variety" is a totally red herring. Deferred can run arbitrary shaders just as efficiently as forward (sometimes more-so due to 2x2 quad scheduling from the rasterizer). Try it :) And sure you avoid reading/writing the G-buffer (once), but you re-render/transform/tessellate/skin all your geometry. So it's app and scene dependent which is faster of course.

    It doesn't make a difference in practice unless you're rendering your entire mesh with that high density and even then, it's a bad way of doing it. In fact in case where it was visible it would produce objectionable problems that would reveal the mesh tessellation, so I'm not sure this should ever be considered desirable.

    I'd give that one to deferred again, because you can selectively super-sample wherever you like, not just at triangle edges. I refuse to be shackled by the rasterizer in terms of shader execution and evaluation :)

    Anyways I should write a big blog post about this at some point, the main point being "these are all variants of similar ideas - test them all and use the best. It's just normal code optimization." The only reason people seem to think these are fundamentally different things is because of the semi-bizarre way that you write code in the graphics pipeline. In fact I would wager that's pretty much the entire reason for people tending to have a bias against deferred from a conceptual point of view... but try to separate out thinking of how one might write the code from how it gets executed. It doesn't end up being as different as one might think, and long term the ease of writing code is irrelevant. Frankly any decent engine should be able to generate the shaders to swap between forward and deferred with the click of a button.

    The only really important point is doing culling with knowledge of the depth buffer and semi-hierarchically. In fact, all of these GPU tiling variants are doing light culling *really inefficiently* (tons of redundant computation) due to the GPU programming model, so I'm more interested in seeing that addressed than more variants of how to launch the shading work. At the moment the CPU can cull the light lists significantly faster than the GPU (!), and yet it's still a win over conventional methods which really demonstrates how bad those methods were :)

    That's not to say it isn't good to test and document all this stuff, but there's really nothing interesting to talk about from a research point of view IMHO. All of the tiled variants can produce identical results with similar levels of efficiency. It's literally just constant factors we're playing with here and they vary per application.
     
    #19 Andrew Lauritzen, Apr 28, 2012
    Last edited by a moderator: Apr 28, 2012
  20. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,340
    In what way would you improve the programming model?
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...