Deferred rendering, virtual texturing, IMR vs TB(D)R, and more...

Andrew Lauritzen

Moderator
Moderator
Veteran
NOTE: This thread was split from the Microsoft Surface thread in the Handheld Forums.


As for the "free" thing, today's desktop GPUs have fillrate to spare; AF doesn't consume any worthwhile bandwidth nor memory footprint last time I checked in contrast to anti-aliasing.
AF definitely consumes bandwidth - it effectively bumps up the mip levels that you sample. It also is slower in the texture samplers, so the only case in which it is "free" is when you have samplers and bandwidth to spare, which is less and less the case (as these are typically the real bottlenecks). And it's never "free" in terms of power.

Since when does a simple higher resolution equal to supersampling?
Since the resolution is high enough that you can't resolve the individual pixels? Your eye effectively integrates/resolves the high resolution image...
 
AF definitely consumes bandwidth - it effectively bumps up the mip levels that you sample. It also is slower in the texture samplers, so the only case in which it is "free" is when you have samplers and bandwidth to spare, which is less and less the case (as these are typically the real bottlenecks). And it's never "free" in terms of power.

Nothing is completely free in 3D anyway, unless the system is vastly CPU bound fe. Those type of GPUs need more TMUs and you'll get with those more samplers anyway. As for bandwidth overall, when a small form factor doesn't usually lose more than 10% for 4x Multisampling, with the right amount of TMUs there shouldn't be any worthwhile cost for AF either especially since AF algorithms are adaptive for eons now.

I've cleaned up for a friend a couple of days his PC which was a mess and threw it through a couple of hurdles. The GT210 it carries has 4 TMUs and merely 8GB/s bandwidth over a 64bit bus. Even in the highest resolution AF didn't cost more than a fraction of performance which wasn't even noticable (a couple of fps), quite to the contrary to any Multisampling amount (up to ~1/3rd the 1xAA performance with 4xAA enabled)

Further to that since we're in a surface tablet thread, I haven't seen any benchmarks in order to see how much performance the ULP GF in Tegra3 loses with AF enabled, but it's at least capable of it. It's on the other hand not capable of MSAA due to lack of tiling. One to other isn't related, but it's not that Tegra3 as a SoC has any bandwidth to spare rather the contrary. IF AF should cost more performance on it than even the lowest desktop GPU it would be more likely due to the lack of TMUs. That thing shouldn't have more than 2 TMUs anyway. Yes loops cost in bandwidth, but it's an indirect issue and not the primary bottleneck.

Since the resolution is high enough that you can't resolve the individual pixels? Your eye effectively integrates/resolves the high resolution image...

It's not supersampling in the strict sense, either way you twist it. I understood your initial point, but the above hairsplitting gets to the point where it's rather ridiculous. As if the eye could resolve pixels at 1280 on a sub 10" display medium unless you glue your nose on to it.
 
Even in the highest resolution AF didn't cost more than a fraction of performance which wasn't even noticable (a couple of fps), quite to the contrary to any Multisampling amount (up to ~1/3rd the 1xAA performance with 4xAA enabled)
You're trivializing something that is data dependent and far more complex than you seem to be taking into account. For instance, higher resolution will incur *less* AF (lighten the load). It's similarly dependent on available texture resolution - it will only do anything if you're min filtering textures (i.e. texture resolution exceeds projected screen shading rate). Certainly if you have a high screen resolution and/or low texture resolutions AF will do absolutely nothing, and thus be "free" :p

MSAA is similarly dependent on the scene, but more-so on geometric frequencies. For simple geometry, there's no reason it has to cost much either since MSAA compression should handle the bandwidth usage.

Yes loops cost in bandwidth, but it's an indirect issue and not the primary bottleneck.
How is it indirect? AF *directly* affects the MIP calculation (hence the colored tunnel tests) by using the minor axis and making more pixels use higher mip levels. Certainly the line integration and additional samples are costly too, but it's a directly related issue (more taps is expensive *because* of bandwidth).

It's not supersampling in the strict sense, either way you twist it. I understood your initial point, but the above hairsplitting gets to the point where it's rather ridiculous. As if the eye could resolve pixels at 1280 on a sub 10" display medium unless you glue your nose on to it.
But that's the point... my point was that you don't need to render at high-dpi resolutions to make images look good. Better sampling (more AA, better filtering, etc) is really what you want, and brute force pixel shading at the higher frequency is a poor use of hardware resources to that end.
 
Indirect under the reasoning that by the time you need to loop for any of the AF related calculations (due to the absence of quad TMUs for the lowest common denominators) the bandwidth requirements increase by far more.
Uhh, the TMUs (which by the way is not exactly a well-defined concept that exists in the same way on every architecture) pretty much always have to loop for AF. I see no compelling reason to massively complicate the design by distributing the integration of a single sample when there's plenty of parallelism in different samples anyways. Thus all you're saying is "if texture throughput is sufficient, texturing will not a bottleneck"... ok? Doesn't that go for pretty much anything?

The degree of anisotropy at a sample point is not dependent on resolution (assuming constant aspect ratio).
Sure the ratio does not change, but the number of samples you take does, simply because of hitting the most detailed mip level. In mobile where texture resolutions are extremely low, this is an especially pronounced effect.

Of course the downside being that you can run demanding games only for two hours or so, and the battery is dead :)
But that's the rub in general :) Power efficiency of computing a frame is a separate metric than raw performance, and one that people are normally ill-equipped to measure meaningfully. It's also not a simple "which is better" situation because you need to reach a minimum level of performance before a solution is interesting at all. Running serial algorithms (parallelism always has power overhead) on a low power single core CPU and finishing your frame in a few minutes might be the most power-efficient, but it's hardly interesting ;)

Problem with HD4000 is it's an immediate-mode renderer, which spends a LOT of its fillrate and power budget drawing and shading invisible pixels.
Meh, while I admit tiled renderers have advantages in framebuffer bandwidth and power, I'm fairly certainly that Tegra is not a tiled renderer either, so it's hardly the only way to play the game. Also I don't think modern IMG stuff sorts or does other hidden surface removal in the tile... I believe they are tiled, but pretty much just run like an IMR inside each tile (like Larrabee). There have been API changes over the years that make it infeasible to run any other way (and still be spec compliant), especially when you get into DX10 and 11.

I'm also not convinced framebuffer bandwidth is the real limitation in the long run. Certainly while we're still in the space of rendering with basic shaders, simple and extremely low resolution texturing and the like it's a win, but it's not clear that framebuffer bandwidth is a significant factor in desktop games for instance. So unless you believe that the mobile graphics world will evolve significantly differently (and so far it really has just mirrored the evolution of desktop graphics with a few omissions), I'm not sure we can necessarily pronounce IMR dead in power-constrained environments. For my part I expect the graphics pipeline portion of rendering a frame to be increasingly small as we move forward, with more and more work being done in generic compute/software.

For my part I'm actually becoming less and less interested in pure tablets, or even tablets + keyboard "covers". After having played with a few "convertibles", I'm much more leaning towards a good ULV big core (17W or lower), with a detachable tablet portion + keyboard (ideally with more batteries, transformer style), and a nice digitizer. Mobile hardware just hasn't scaled up in performance as quickly as desktop hardware seems to be scaling down in power usage.
 
Uhh, the TMUs (which by the way is not exactly a well-defined concept that exists in the same way on every architecture) pretty much always have to loop for AF.

Loop a little or loop a lot?

Meh, while I admit tiled renderers have advantages in framebuffer bandwidth and power, I'm fairly certainly that Tegra is not a tiled renderer either, so it's hardly the only way to play the game.
Of course doesn't the ULP GF not use any tiling, but uses large caches to solve the bandwidth problem. It doesn't however work for Multisampling and that's probably the primary reason why it isn't supported but only coverage AA.

Also I don't think modern IMG stuff sorts or does other hidden surface removal in the tile... I believe they are tiled, but pretty much just run like an IMR inside each tile (like Larrabee).
I'd love to see how things look like on a theoretical LRB with very thin diagonally placed triangles.

I'm also not convinced framebuffer bandwidth is the real limitation in the long run. Certainly while we're still in the space of rendering with basic shaders, simple and extremely low resolution texturing and the like it's a win, but it's not clear that framebuffer bandwidth is a significant factor in desktop games for instance. So unless you believe that the mobile graphics world will evolve significantly differently (and so far it really has just mirrored the evolution of desktop graphics with a few omissions), I'm not sure we can necessarily pronounce IMR dead in power-constrained environments. For my part I expect the graphics pipeline portion of rendering a frame to be increasingly small as we move forward, with more and more work being done in generic compute/software.
On a pure GPU integration statistical level it's a 50-50 ballgame between TBDRs and IMRs in the small form factor markets right now. If you'd slightly turn the perspective and skip the deferred part (which is limited to IMG only) and concentrate on tile based small form factor GPUs they're the widest majority. It's not that there aren't any bandwidth savings with tiling and early-Z combinations in those IMRs, rather the contrary.

Mobile hardware just hasn't scaled up in performance as quickly as desktop hardware seems to be scaling down in power usage.
I tend to disagree. The performance and efficiency jumps for smartphone/mW platforms are huge over the years and the bump will only get significantly larger with the coming generation of small form factor GPUs. How long was it ago when f.e. OMAP3 GPUs had barely 1 GFLOP arithmetic throughput, while the Adreno320 in Qualcomm S4 smartphones should exceed the 50 GFLOPs mark.
 
This advantage hardly matters in a tablet, because almost nobody runs computing-intensive apps on such a device. It kills the battery for starters, and second, even intel's portable chips are no damn good at computing anyway; not compared to a desktop processor.

Problem with HD4000 is it's an immediate-mode renderer, which spends a LOT of its fillrate and power budget drawing and shading invisible pixels. This is no good in a portable device - arguably no good in a desktop setting either really, but neither AMD nor Nvidia (nor Intel for that matter) seem interested in wanting to do anything real about this issue...yet, anyway.

Time and power constraints may change that eventually.
It's an Early Z IMR. Obscured pixels matter less than you think.

A true TBDR paired with a renderer that plays to it's strengths is another matter of course. Not sure how many of iOS games render-to-TBDR, so to speak.
 
Meh, while I admit tiled renderers have advantages in framebuffer bandwidth and power, I'm fairly certainly that Tegra is not a tiled renderer either, so it's hardly the only way to play the game. Also I don't think modern IMG stuff sorts or does other hidden surface removal in the tile... I believe they are tiled, but pretty much just run like an IMR inside each tile (like Larrabee). There have been API changes over the years that make it infeasible to run any other way (and still be spec compliant), especially when you get into DX10 and 11.
Modern IMG stuff does HSR within a tile. Promise. ;)

I am curious about the API bits you mentioned that are inconvenient for a TBDR. I had no idea about this and I'd like to know more. Care to share?

I'm also not convinced framebuffer bandwidth is the real limitation in the long run. Certainly while we're still in the space of rendering with basic shaders, simple and extremely low resolution texturing and the like it's a win, but it's not clear that framebuffer bandwidth is a significant factor in desktop games for instance. So unless you believe that the mobile graphics world will evolve significantly differently (and so far it really has just mirrored the evolution of desktop graphics with a few omissions), I'm not sure we can necessarily pronounce IMR dead in power-constrained environments. For my part I expect the graphics pipeline portion of rendering a frame to be increasingly small as we move forward, with more and more work being done in generic compute/software.
With tessellation, the cost of doing Early Z pass can become quite a bit more. Also, in one of your presentations, I remember you had mentioned that Tiled deferred and Tiled forward had pretty much the same bandwidth on an Early Z IMR. Well, it would have a lot better GPU bandwidth numbers for Tiled forward on a TBDR. Not to mention all the savings CPU side.

When we get to ~10MB cache on die, around 14 nm or 10 nm for sure, then we *could* have entire framebuffer on die, or atleast the z buffer. I think that could shift the paradigm quite a bit.

Also, since UI is such an important job for mobile GPUs and the systems are so bandwidth constrianed, that alone can be quite useful.

For my part I'm actually becoming less and less interested in pure tablets, or even tablets + keyboard "covers". After having played with a few "convertibles", I'm much more leaning towards a good ULV big core (17W or lower), with a detachable tablet portion + keyboard (ideally with more batteries, transformer style), and a nice digitizer. Mobile hardware just hasn't scaled up in performance as quickly as desktop hardware seems to be scaling down in power usage.
That was my belief too. It's good to hear confirmation form someone with experience. ;) Is there anything in particular that attracted you towards a transformer + stylus device? Anything that pushed you away from tablet + covers?
 
When we get to ~10MB cache on die, around 14 nm or 10 nm for sure, then we *could* have entire framebuffer on die, or atleast the z buffer. I think that could shift the paradigm quite a bit.
Except that some developers have moved to SOFTWARE tiling with deferred shading to avoid doing multiple geometry passes ... so not really.
 
Modern IMG stuff does HSR within a tile. Promise. ;)
IMRs do HSR as well through early Z. The only thing they may do extra is go through the tile's polys twice (Z-pass first), but that often has questionable value, and can be a loss in a well optimized game with object-level sorting.

Where tiled renderers have a bandwidth advantage is primarily with alpha rendering. They also have a small advantage with the efficiency of large block writes to the framebuffer. The penalty is the bandwidth cost of binning (vertices pass through the chip multiple times).
That was my belief too. It's good to hear confirmation form someone with experience. ;) Is there anything in particular that attracted you towards a transformer + stylus device? Anything that pushed you away from tablet + covers?
Touch+stylus is simply superior to touch-only. A stylus is superior to a mouse in every way except in ease of switching from the keyboard (few tenths of a second extra) and cost (minimal now). Fingers are inferior to the mouse in every way except for some multitouch gestures (and, of course, the convenience of being permanently attached to the human body).

The stylus lets you run any desktop software comfortably, as it doesn't need a low density UI that also considers how the finger blocks the view of whatever is under it.
 
I'd love to see how things look like on a theoretical LRB with very thin diagonally placed triangles.
Well no GPUs like thin/skinny triangles, but obviously tiled ones don't like any triangles that span too many tiles :)

I tend to disagree. The performance and efficiency jumps for smartphone/mW platforms are huge over the years and the bump will only get significantly larger with the coming generation of small form factor GPUs. How long was it ago when f.e. OMAP3 GPUs had barely 1 GFLOP arithmetic throughput, while the Adreno320 in Qualcomm S4 smartphones should exceed the 50 GFLOPs mark.
Well but that level of performance I'm simply not interested in. Ultimately for tablet stuff I'm looking at something more in the range of 5-15W, and I fully expect the ULV desktop parts to be in that range with similar performance to their current 17W offerings in a year or two. Nothing I've seen from mobile makes me thing that their perf is going to scale high enough in that time to get near the performance of such a chip, CPU *or* GPU-wise. Love to be proven wrong, as I'd love to play with some more exotic hardware than what we currently have on PC :)

Phone CPU/GPU performance is basically uninteresting to me beyond the threshold where it can run the basic OS, e-mail, etc. I have zero desire to ever play a game on a tiny phone screen that has barely enough space to see let alone interact.

I'd be interested in a tablet that can handle Photoshop.
Samsung Series 7 Slate has been out for a while. It's awesome, and the stylus obviously works great with Photoshop. Surface Pro looks to be basically the same thing one generation newer...

I am curious about the API bits you mentioned that are inconvenient for a TBDR. I had no idea about this and I'd like to know more. Care to share?
Sure; two of the big ones are predicated rendering and the semantics of UAV access from pixel shaders. Both of them let you set up dependencies between arbitrary pixels and subsequent draw calls. For UAV accesses they are allowed to be unordered within a single draw call, but all must complete before the next draw call takes place (!). This isn't too bad for an IMR (just means you need to put a barrier after each draw call that has a UAV bound), but for a TBR this requires flushing all bins between every draw call. That is a disaster...

With tessellation, the cost of doing Early Z pass can become quite a bit more. Also, in one of your presentations, I remember you had mentioned that Tiled deferred and Tiled forward had pretty much the same bandwidth on an Early Z IMR. Well, it would have a lot better GPU bandwidth numbers for Tiled forward on a TBDR.
Not totally true, because on a TBR you would implement tiled deferred using framebuffer reads and discard the G-buffer (the G-buffer is effectively just per-tile scratch space), so it will be similar there too. But agreed that a naive implementation would work that way.

Where tiled renderers have a bandwidth advantage is primarily with alpha rendering.
Indeed, and that's pretty significant for the majority of stuff that you see on mobile. So much so that I expect people to start considering implementing binning in software (even though GPUs are fairly poor at that sort of data structure create at the moment) for particles, etc. to avoid such a massive waste of bandwidth on IMR blending.

Is there anything in particular that attracted you towards a transformer + stylus device? Anything that pushed you away from tablet + covers?
A few things actually.

1) Typing without tactile feedback is pretty awful... so much so that I don't really care to have a keyboard at all unless it's a decent tactile one. The tactile surface one might suffice, but we'll see.

2) If you're going to have a keyboard, might as well fit some more battery in the enclosure for it :)

3) Stylus is great. I don't use it exclusively or anything, but for anything more precise, or drawing or even writing to some extent, it's quite pleasant. Personally I use it for rough work, math, etc. in OneNote. OneNote even can convert my math scrawling to symbols (!!) and it works very well.

Thus I basically see no reason why I can't have it all... touch, keyboard, stylus, good performance and ability to run anything I want. The convertible aspect means that I can use just the tablet portion when it's more convenient to do that (on a bus, etc) but turn it into a laptop when I want to get some work done. After I've seen some of the convertible systems, they seem like a strict superset of what you get in other mobile devices.

I won't claim that these are the primary concerns for everyone, but it is hard to argue that the "strict" tablets have any advantages over convertibles going forward other than perhaps price, which gets muddy if you still buy a laptop in addition...
 
Well no GPUs like thin/skinny triangles, but obviously tiled ones don't like any triangles that span too many tiles :)

The point was that IMG's tiling method is unique, otherwise they wouldn't have a number of patents for it.

Well but that level of performance I'm simply not interested in. Ultimately for tablet stuff I'm looking at something more in the range of 5-15W, and I fully expect the ULV desktop parts to be in that range with similar performance to their current 17W offerings in a year or two. Nothing I've seen from mobile makes me thing that their perf is going to scale high enough in that time to get near the performance of such a chip, CPU *or* GPU-wise. Love to be proven wrong, as I'd love to play with some more exotic hardware than what we currently have on PC :)
Tablets obviously have a much higher power envelope than smartphones. Apple is doubling GPU performance on a yearly cadence since the iPad2 (well for the 4th generation iPad it's less than a year but it could very well be some sort of mid life kicker until their true next generation tablet arrives). We'll see how all next generation small form factor GPUs will perform in real time after they're released, but I wouldn't be as quick to underestimate IMG's Rogue or even the ULP GF in NV's Wayne. Scalability for the first in terms of clusters doesn't end neither at 16 clusters nor at just 1 TFLOP fp throughput (despite that FLOPs is another as meaningless metric as triangle throughputs once used to be); it'll come down to how perf/W looks like in order to make any sort of comparison in the first place or better how much GPU performance anyone could squeeze into that 5-15W tablet power envelope, all other factors included.
 
Except that some developers have moved to SOFTWARE tiling with deferred shading to avoid doing multiple geometry passes ... so not really.

Users of tiled forward rendering might not agree. MSAA is quite useful.

Analytic AO needs 2 geometry passes anyway.

Even if you are doing deferred shading in software, if the entire G buffer can be had in high speed memory, which seems possible with interposers, that would change sweet spots considerably.

Even if you can't fit the full G buffer on die, if you can just fit the ID buffer, that is still a big win.

ID buffer = what a hw TBDR generates to decide which tri/pixel combination to shade. Essentially Frame based Deferred Rendering in hw without any cost of binning.
 
Last edited by a moderator:
IMRs do HSR as well through early Z. The only thing they may do extra is go through the tile's polys twice (Z-pass first), but that often has questionable value, and can be a loss in a well optimized game with object-level sorting.

Where tiled renderers have a bandwidth advantage is primarily with alpha rendering. They also have a small advantage with the efficiency of large block writes to the framebuffer. The penalty is the bandwidth cost of binning (vertices pass through the chip multiple times).

1) That analysis is correct but would certainly change in view of large amounts tessellation. You would not want to render geometry twice.

2) Not having to shade quads is a win.

3) And there's this

http://www.google.com/patents/US20110254852

It's a new IMG patent that describes how you can use a TBDR to save both pixel and texel fill rate with shadow mapping and the like.

Basically, don't rasterize shadow maps immediately after binning is complete. Wait until the next render wants to lookup the texels. Then rasterize just one tile of the shadow map opportunistically and then immediately use it to shade the fragments.

This way both the z testing and the subsequent texture filtering can be done out of on chip buffers. Since the final render is going to have fairly large spatial coherence, it could save quite a bit of lookups.

I don't think this technique can be copied by an IMR.

4) On a handheld we have much larger resolution and since the screen size is small, the physical size of objects (in mm across the screen) is small. I am just thinking aloud here, but I think the tris needed to hide the curvature would be less too. Which could tip the balance in a TBR's favor for this market.
 
Sure; two of the big ones are predicated rendering and the semantics of UAV access from pixel shaders. Both of them let you set up dependencies between arbitrary pixels and subsequent draw calls. For UAV accesses they are allowed to be unordered within a single draw call, but all must complete before the next draw call takes place (!). This isn't too bad for an IMR (just means you need to put a barrier after each draw call that has a UAV bound), but for a TBR this requires flushing all bins between every draw call. That is a disaster...
Predicated rendering on a TBDR is no worse than rendering on IMR.

Wouldn't using a UAV immediately after rendering to it stall an IMR as well? A TBR should do no worse than an IMR in such case.
Not totally true, because on a TBR you would implement tiled deferred using framebuffer reads and discard the G-buffer (the G-buffer is effectively just per-tile scratch space), so it will be similar there too. But agreed that a naive implementation would work that way.
I had not thought of that. That would definitely work. Wouldn't the fb read involved be beyond current APIs though?

Indeed, and that's pretty significant for the majority of stuff that you see on mobile. So much so that I expect people to start considering implementing binning in software (even though GPUs are fairly poor at that sort of data structure create at the moment) for particles, etc. to avoid such a massive waste of bandwidth on IMR blending.


A few things actually.

1) Typing without tactile feedback is pretty awful... so much so that I don't really care to have a keyboard at all unless it's a decent tactile one. The tactile surface one might suffice, but we'll see.

2) If you're going to have a keyboard, might as well fit some more battery in the enclosure for it :)

3) Stylus is great. I don't use it exclusively or anything, but for anything more precise, or drawing or even writing to some extent, it's quite pleasant. Personally I use it for rough work, math, etc. in OneNote. OneNote even can convert my math scrawling to symbols (!!) and it works very well.

Thus I basically see no reason why I can't have it all... touch, keyboard, stylus, good performance and ability to run anything I want. The convertible aspect means that I can use just the tablet portion when it's more convenient to do that (on a bus, etc) but turn it into a laptop when I want to get some work done. After I've seen some of the convertible systems, they seem like a strict superset of what you get in other mobile devices.

I won't claim that these are the primary concerns for everyone, but it is hard to argue that the "strict" tablets have any advantages over convertibles going forward other than perhaps price, which gets muddy if you still buy a laptop in addition...
Thanks for this. I am looking for something that has hand writing recognition, math formula -> Latex recognition/conversion. Is there anything out there that does that.
 
1) That analysis is correct but would certainly change in view of large amounts tessellation. You would not want to render geometry twice.

2) Not having to shade quads is a win.
My take on deferred rendering (and tile based techniques)...

I personally try to avoid all techniques that require rendering geometry twice, because geometry transform/rasterization is the step that has by the far the most fluctuating running time. Draw call count, vertex/triangle count, quad efficiency, overdraw/fillrate, etc all change radically depending on the rendered scene. Screen pixel count is always the same (720p = 922k pixels). All algorithms you process just once for each pixel in the screen incur a constant cost. That's why I like deferred techniques (= processing after all geometry is rasterized). Constant stable frame rate is the most important thing for games. Worst case performance is what matters in algorithm selection, average performance is meaningless (unless it's guaranteed to amortize over the frame).

I am not a particular fan of LiDR and it's descendants (including Forward+). Depth pre-pass doubles the most fluctuating part of the frame rending (draw calls / geometry processing). It also is a waste of GPU resources. All the 2000+ programmable shader "cores" of modern GPUs are basically idling while the GPU crunches though all the scene draw calls and renders them to Z-buffer (depth testing, filling, triangle setup, etc fixed function work). Memory bandwidth is also underutilized (just vertex fetching and only depth writes, no texture reads or color writes at all). For good GPU utilization you have to have balanced load at every stage of your graphics rendering pipeline. Depth pre-pass isn't balanced at all.

Various displacement mapping techniques will be used more and more in future games, and these make the extra geometry pass even more expensive. DX11 supports vertex tessellation and conservative depth output. Tessellation will promote usage of vertex based displacement mapping techniques, and conservative depth is very useful for pixel based displacement mapping techniques (allows early-z and hi-z to be enabled with pixel shader programmable depth output). A side note: The programmable depth output and pixel discard isn't a good thing for TBDRs (making pixel shader based displacement quite inefficient). Vertex tessellation also adds some extra burden (how bad that is remains to be seen in the future).

Brute force deferred rendering with fat g-buffers isn't either the best choice in the long run. Basically all source textures are compressed (DXT variants, DX11 even adds an HDR format). A forward renderer simply reads each DXT texture once a pixel. A deferred renderer reads the compressed texture, outputs it to a uncompressed rendertarget and later reads the uncompressed texture from the render target. DXT5 is 1 byte per pixel, while uncompressed (8888 or 11f-11f-10f) is 4 bytes per pixel. Forward reads 1 byte per each texture layer used, deferred reads 5 bytes and writes 4 bytes (9x more BW used). This problem isn't yet a big problem, because most games don't have more than two textures per object (8 channels for example can fit: rgb color, xy normal, roughness, specular, opacity). But in the future the materials will become more complex and the g-buffers will become fatter (as we need to store all the texture data to the g-buffer for later stages).

I personally like to keep geometry rendering pass as cheap as possible. Rendering to three or four render targets and reading three or four textures isn't cheap. Overdraw gets very expensive and quad efficiency and texture cache efficiency play a big (unpredictable) role in the performance. It's better just to store the (interpolated) texture coordinates to the g-buffer. This way you get a very fast pixel shader (with no texture cache stalls), quad efficiency and/or overdraw doesn't matter much, full fill rate (no MRTs), low BW requirement, etc. All the heavy lifting is done later, once a pixel, in a compute shader. Compressed textures are read only once, and no uncompressed texture data is written/read from the g-buffers. This kind of system minimizes the fluctuating cost from geometry processing/rasterization and it compares very well to a TBDR in scenes that have high overdraw. IMR still has more overdraw and TBDR, but the overdraw is dirt cheap. (**)

What matters in the future isn't the geometry rasterization performance. Geometry rasterization is only around 10%-20% of the whole frame rendering cost if you use advanced deferred rendering techniques. TBDR/IMR aren't that different if 80%+ of frame rendering time is spend running compute shaders.

(**) The biggest downsize of the technique described above is that the "texture coordinate" (= texture address) must contain enough data to distinguish all the texture pixels that might be visible in the frame (and bilinear combinations of those). Basically with current architectures this means you need a big texture atlas, and you need to store all your textures there. This is not a viable strategy for games that have a gigabyte worth of textures loaded at memory at once. Virtual texturing however only tries to keep texture data in memory that is required to render the current frame. The whole data set fits to a single 8192x8192 atlas (virtual texture page cache). With this kind of single atlas, the "texture coordinate" problem becomes trivial: Just store a 32 bit (normalized int 16-16) texture coordinate to the g-buffer.
Basically, don't rasterize shadow maps immediately after binning is complete. Wait until the next render wants to lookup the texels. Then rasterize just one tile of the shadow map opportunistically and then immediately use it to shade the fragments.

[...]

I don't think this technique can be copied by an IMR.
This technique is very similar to virtual shadow mapping. Virtual shadow mapping works pretty much like virtual texturing, except that you use projected shadow map texture coordinates instead of the mesh texture coordinates. By using depth buffer and shadow map matrix you can calculate all the visible pages. Each page (frustum) is rendered separately (we can of course combine neighborhood pages to single frustums to speed up the processing). Shadow map fetching uses the same indirection texture approach as virtual texturing (cuckoo hashing is also a pretty good fit for GPU). The best thing about this technique is that it renders shadow maps always at correct 1:1 screen resolution. Oversampling/undersampling is much reduced compared to techniques such as cascaded shadow mapping.

Page visibility determination (from depth buffer) of course takes some extra time, but you can combine it with some other full screen pass to minimize it's impact. Rendering several smaller shadow frustums (pages) of course increases draw call count (and vertex overhead), but techniques such as merge-instancing can basically eliminate that problem (single draw call per page / subobject culling for reduced vertex overhead). With some DrawInstancedIndirect/DispatchIndirect trickery that's doable, but dynamic kernel dispatching by other kernels would make things much better (GK110 will be the first GPU to support this).
 
A side note: The programmable depth output and pixel discard isn't a good thing for TBDRs
If the depth/visibility was finalised near the start of the pixel shader, then with the right extra hardware you wouldn't have to compute the rest of the program if you knew some subsequent object would overwrite that pixel (which a TBDR can know about unlike an IMR). That would help significantly for some uses, although probably not for others.

It's better just to store the (interpolated) texture coordinates to the g-buffer. This way you get a very fast pixel shader (with no texture cache stalls), quad efficiency and/or overdraw doesn't matter much, full fill rate (no MRTs), low BW requirement, etc. All the heavy lifting is done later, once a pixel, in a compute shader.
Maybe I'm missing something, but how would the ddx/ddy calculation work in that compute shader? How do you know the neighboring pixel is part of the same object and what happens if the neighboring pixels are all of different objects? The only safe way I can think of implementing this is to also store ddx/ddy, and then you aren't really saving much bandwidth and your texture instructions go at 1/4th the normal speed because they work on individual pixels instead of quads...

IMO, the most bandwidth efficient way to do deferred rendering is to do it on a TB(D)R with the right extensions (programmable blending, being able to use tile memory as scratch not being output, etc.). Even in a worst-case scenario where you don't benefit from the deferred rendering, you're still not really using more memory bandwidth than a forward renderer. I'd say that's pretty cool! :)

This technique is very similar to virtual shadow mapping. Virtual shadow mapping works pretty much like virtual texturing, except that you use projected shadow map texture coordinates instead of the mesh texture coordinates. By using depth buffer and shadow map matrix you can calculate all the visible pages. Each page (frustum) is rendered separately (we can of course combine neighborhood pages to single frustums to speed up the processing). Shadow map fetching uses the same indirection texture approach as virtual texturing (cuckoo hashing is also a pretty good fit for GPU). The best thing about this technique is that it renders shadow maps always at correct 1:1 screen resolution. Oversampling/undersampling is much reduced compared to techniques such as cascaded shadow mapping.
Agreed there are some similarities. However the architecture described in the patent would have a lower performance overhead and save most of the read bandwidth. And the bandwidth saving also applies to many post-processing and/or downsampling passes, in some cases it could even save the write bandwidth. Obviously all at the cost of some extra hardware...

TBDR/IMR aren't that different if 80%+ of frame rendering time is spend running compute shaders.
Agreed. And whether that's what the workload looks like or not, shader core efficiency is key.
 
IMO, the most bandwidth efficient way to do deferred rendering is to do it on a TB(D)R with the right extensions (programmable blending, being able to use tile memory as scratch not being output, etc.). Even in a worst-case scenario where you don't benefit from the deferred rendering, you're still not really using more memory bandwidth than a forward renderer. I'd say that's pretty cool!
Not cool if you don't have enough scratch to handle deffered lighting in full resolution(i am about Uncharted: Golden Abyss), don't even mention MSAA scratch requirements. I don't believe in bright future of deferred rendering on a TB(D)R because of developers complainings about 360's EDRAM limitations on deffered shading and after sub res Uncharted: Golden Abyss fail
 
I don't believe in bright future of deferred rendering on a TB(D)R because of developers complainings about 360's EDRAM limitations on deffered shading and after sub res Uncharted: Golden Abyss fail

I'm afraid I've completely lost connection with the above.
 
Back
Top