If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 | |
|
Junior Member
Join Date: Dec 2006
Posts: 28
|
I was checking out new papers submitted for Eurographics 2012 and I saw this paper entitled Forward+: Bringing Deferred Lighting to the Next Level
A preview of the paper is available here https://sites.google.com/site/takahiroharada/ and here is an exerpt: Quote:
|
|
|
|
|
|
|
#2 |
|
Senior Member
|
Does that demo run @30fps on Tahiti?
|
|
|
|
|
|
#3 |
|
Junior Member
Join Date: Dec 2006
Posts: 28
|
I tried out the Leo demo on my 6970 and got from 20-30 fps, I don't currently have a 7000 series to test on.
|
|
|
|
|
|
#4 |
|
Junior Member
Join Date: Apr 2008
Posts: 76
|
Not sure, but it wouldnt be representative. I'm pretty sure the Leo demo was doing a whole lot more than tiled forward shading (it also included ptex and some form of indirect lighting as I recall), which makes it hard to tell where the performance is going.
|
|
|
|
|
|
#5 | |
|
Senior Member
|
Quote:
I saw it's video. Looked pretty darn close to REYES quality to me. I think we might get to real time reyes quality rendering within this decade. |
|
|
|
|
|
|
#6 |
|
Member
Join Date: Feb 2007
Location: Irvine, CA
Posts: 426
|
I actually just put up a blog post with some numbers from my own test app. 6970 seems to do really well with this technique. I wish I had a 7970 to try out.
|
|
|
|
|
|
#7 |
|
Junior Member
Join Date: Mar 2010
Posts: 70
|
Leo doesn't use PTEX. AMD used the same art assets for their PTEX demo, but ultimately the PTEX demo is entirely something else. Unfortunately just about everyone has confused the two - even I made that mistake at AMD's editor's day in the demo room.
|
|
|
|
|
|
#8 |
|
Member
Join Date: Jun 2008
Location: Torquay, UK
Posts: 913
|
Makes sense! PTEX is not supported on HD69xx and Leo demo still runs just fine.
|
|
|
|
|
|
#9 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,020
|
MJP. I tried it on a 7970 and deferred is slower as with the 6970, but I don't know what resolution you used for your results. By default the app loaded in a window and I don't know if it always loads the same resolution. Let me know and I'll post the results.
|
|
|
|
|
|
#10 |
|
penguins
Join Date: Feb 2004
Posts: 13,978
|
So what do you expert folks think about this method?
__________________
|
|
|
|
|
|
#11 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,020
|
|
|
|
|
|
|
#12 | ||
|
Member
Join Date: Feb 2007
Location: Irvine, CA
Posts: 426
|
Quote:
Quote:
From early tests so far it seems pretty good for AMD hardware, and a clear winner when MSAA is involved. On Nvidia hardware it doesn't fare nearly as well, at least compared to tile-based deferred rendering implemented in a compute shader. But overall a practical technique if you really want to stick to forward rendering, but want a lot of dynamic lights. Last edited by MJP; 02-Apr-2012 at 04:32. |
||
|
|
|
|
|
#13 |
|
Junior Member
Join Date: Dec 2006
Posts: 28
|
|
|
|
|
|
|
#14 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,020
|
You have to be careful to compare the same resolutions. The only numbers posted so far that are comparable between the GTX680 and Radeon 7970 are the following.
1024 Lights on the GTX680 MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 10.2ms 12.6ms 2x MSAA 11.62ms 15.15ms 4xMSAA 12.65ms 16.39ms 1024 Lights on the Radeon 7970 MSAA Level Light Indexed Deferred Tile-Based Deferred No MSAA 6.02ms 4.63ms 2x MSAA 6.85ms 6.58ms 4xMSAA 7.52ms 8.00ms And one commenter speculated the 680's smaller amount of shared memory is holding it back. |
|
|
|
|
|
#15 |
|
Junior Member
Join Date: Dec 2006
Posts: 28
|
I find it odd that LID with 0xMSAA on the 680 is slower than LID on the 7970 with 4xMSAA.
Hopefully there's more research put into this stuff. I really like MSAA. |
|
|
|
|
|
#16 |
|
Regular
|
There's a new "educational mode" for the Leo demo.
http://www.geeks3d.com/20120322/amd-...ode/#more-8243 Quite interesting. I wonder if they will release the sourcecode?
__________________
Can it play WoW? |
|
|
|
|
|
#17 |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
|
These numbers don't seem right, at least in terms of the underlying techniques. Without MSAA the 7970 and 680 are typically neck in neck in tile-based deferred. With MSAA the 680 wins by a decent margin due to some unexplained (to me) bottleneck (see my SIGGRAPH presentation or BF3 benchmarks with deferred MSAA). Here's my older benchmark to play with in terms of tile-based and conventional deferred:
http://software.intel.com/en-us/arti...ing-pipelines/ That said, Sponza isn't really the best test scene for this, and the configuration of lights in the demo really just turn this into an ALU test (at least beyond 128 lights). To demonstrate this, fly up and zoom in so that you can just see the roof filling your whole screen... note how with 1024 lights it doesn't really get much faster. Beyond the point where every pixel has a bunch of lights affecting it (say 2-8), there's arguable utility to adding more lights. That's not to say it's a totally unrealistic scene, but I'd prefer to see that many lights distributed over a wider area so that more significant culling is happening. Now of course Power Plant isn't a great scene either, but I did test this on a fair number of real game scenes and the results between GPUs were more consistent.
__________________
The content of this message is my personal opinion only. Last edited by Andrew Lauritzen; 27-Apr-2012 at 22:55. |
|
|
|
|
|
#18 |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Andrew, isn't your comparison different? Forward+ uses light tiles to avoid having to write a G-Buffer.
Judging from your slides, the MSAA perf hit with deferred - even with your fancy edge detection and pixel repacking - is greater than what you see with most forward rendered games. I'm also sure that the render time should be significantly faster without having to write/read a G-buffer. As nice as that perf boost would be, the biggest strength of Forward+ is probably shader variety. Also, despite forward rendering MSAA being inefficient on small interior triangle edges, it is a form of selective supersampling that can be very important for specular surfaces, so you'd have to identify those areas in the deferred to make it truly apples to apples. |
|
|
|
|
|
#19 | |||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
|
Quote:
Indeed I don't implement the "forward+" (personally I'd still call this closer to deferred than pure forward, but that's just me Quote:
Quote:
I'd give that one to deferred again, because you can selectively super-sample wherever you like, not just at triangle edges. I refuse to be shackled by the rasterizer in terms of shader execution and evaluation Anyways I should write a big blog post about this at some point, the main point being "these are all variants of similar ideas - test them all and use the best. It's just normal code optimization." The only reason people seem to think these are fundamentally different things is because of the semi-bizarre way that you write code in the graphics pipeline. In fact I would wager that's pretty much the entire reason for people tending to have a bias against deferred from a conceptual point of view... but try to separate out thinking of how one might write the code from how it gets executed. It doesn't end up being as different as one might think, and long term the ease of writing code is irrelevant. Frankly any decent engine should be able to generate the shaders to swap between forward and deferred with the click of a button. The only really important point is doing culling with knowledge of the depth buffer and semi-hierarchically. In fact, all of these GPU tiling variants are doing light culling *really inefficiently* (tons of redundant computation) due to the GPU programming model, so I'm more interested in seeing that addressed than more variants of how to launch the shading work. At the moment the CPU can cull the light lists significantly faster than the GPU (!), and yet it's still a win over conventional methods which really demonstrates how bad those methods were That's not to say it isn't good to test and document all this stuff, but there's really nothing interesting to talk about from a research point of view IMHO. All of the tiled variants can produce identical results with similar levels of efficiency. It's literally just constant factors we're playing with here and they vary per application.
__________________
The content of this message is my personal opinion only. Last edited by Andrew Lauritzen; 28-Apr-2012 at 02:57. |
|||
|
|
|
|
|
#20 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,020
|
Quote:
|
|
|
|
|
|
|
#21 |
|
Member
Join Date: Nov 2007
Posts: 945
|
Light culling on GPU could likely be done quite efficiently, as it's a problem that can be parallelized. Doing it in brute force way (per pixel or per block like usually done with pixel shaders) is easy, but requires lots of duplicate work. Investigating an efficient parallel algorithm for it that requires minimal work (and utilizes minimum amount of BW) is a pretty challenging problem.
I would personally like to try similar techniques that the fastest GPU radix sorters use, and use a quadtree to split lights. Basically you could do a modified scan (prefix sum) based on culling results (four frustums). One thread processes one light and checks the four frustums and increases counts on the bins that the light belongs. Scan the bins, and create next iteration data (this can have less or more items based on how many lights intersected frustums and how many got completely culled out). Next iteration (log2(r) in total, where r is screen resolution : block resolution) would then again go though all the lights (one thread per light) but now check the light to the four frustums of the sub frustum it belongs to. So in total 4*n*log2(r) frustum checks... log2(r) = 9 (for 720p and 4x4 block size). So it would result in approx 18 frustum checks per light (if we disregard intersected frustums and lights completely culled out). |
|
|
|
|
|
#22 | |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
|
Quote:
On the GPU right now you can kind of do this with persistent threads (this is what OptiX and others do), but not efficiently, elegantly or portably. The programming model and hardware needs to expand to not lock in a static register count/stack/shared memory size, else it won't be able to do anything more interesting than what it does today. Turns out hierarchical data structures and recursion are kind of important
__________________
The content of this message is my personal opinion only. |
|
|
|
|
|
|
#23 | ||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
Quote:
Does it really? I may not be understanding what's required, but the best way I can think of is downscaling the Z-buffer (tracking min/max) so that you have one pixel per tile and then render lights to an equal-sized buffer with a DX11-style linked list. I don't see a lot of redundancy or inefficiency going on there except the usual quad-based underutilization which shouldn't bring it down to CPU speeds. Last edited by Mintmaster; 30-Apr-2012 at 01:16. Reason: clarity |
||
|
|
|
|
|
#24 | ||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
|
Quote:
Quote:
The ability that I described lets you implement this pretty close to as efficiently (i.e. hierarchical rasterizer) as possible in compute. You just need a stack and work stealing really, both of which GPUs hardware can do decently (see OptiX and other work)... it's trivial to implement in something like Cilk or TBB for instance. In fact if you download the ISPC compiler package, I wrote a CPU implementation of both the culling and shading, which is actually not as much slower than a modern GPU (iso-power) as you might thing
__________________
The content of this message is my personal opinion only. |
||
|
|
|
|
|
#25 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
The way I see it, this method should be almost 100x faster than quad-based non-tiled deferred culling (due to several hundred times smaller render target), subject to triangle setup overhead from the lights (tiny fraction of Z-pass setup). Your tile frustum tests are taken care of in 2D by the rasterizer, and the shader does the Z test along with the LL. I don't see why you need a full Z mip tree. Just sample all the pixels in a tile and output the min/max. |
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|