Deferred rendering, virtual texturing, IMR vs TB(D)R, and more...

Not cool if you don't have enough scratch to handle deffered lighting in full resolution(i am about Uncharted: Golden Abyss), don't even mention MSAA scratch requirements. I don't believe in bright future of deferred rendering on a TB(D)R because of developers complainings about 360's EDRAM limitations on deffered shading and after sub res Uncharted: Golden Abyss fail
I don't think it makes sense to look at specific examples on specific hardware and draw general conclusions from them. Anyway Uncharted: Golden Abyss kept the light accumulation buffer on-chip but the G-Buffers were still off-chip because they didn't have enough space on SGX. While still limited, you'd obviously expect next-generation hardware to have more space...

As for MSAA, it will reduce tile size rather than reduce tile memory per pixel. Otherwise since SGX only has 64bpp you wouldn't even have enough space to do 4xMSAA for 32bpp framebuffers ;)
 
Predicated rendering on a TBDR is no worse than rendering on IMR.
Not true - what if I set up a predicate and render a pixel at 0,0 and then predicate a draw call that renders a triangle on the opposite corner of the screen? I've now set up an arbitrary spatial dependency, which defeats any ability to bin/re-sort the incoming geometry.

Wouldn't using a UAV immediately after rendering to it stall an IMR as well? A TBR should do no worse than an IMR in such case.
Well "stall" in that the draw call has to finish through the pipeline before the next one starts, yes. But there's a large order of magnitude difference... IMRs apply the draw calls "wide" across the GPU for the most part and only pipeline multiple together when necessary. TBRs bin geometry spatially first, and really want to bin *all* geometry for a given render target before starting to shade tiles. This is simply not possible with the above UAV/predication semantics (really the same issue in both cases) - every draw call must flush *all* tiles before starting binning the next. That's extremely bad.

I had not thought of that. That would definitely work. Wouldn't the fb read involved be beyond current APIs though?
Tegra - while not tiled per se - has a framebuffer read extension and IIRC Apple actually just added one to iOS as well. That plus "discard" should be all you really need.

Thanks for this. I am looking for something that has hand writing recognition, math formula -> Latex recognition/conversion. Is there anything out there that does that.
Right, so OneNote will get you the handwriting -> math symbols part, just need to convert the resulting symbols to Latex I guess.

Users of tiled forward rendering might not agree. MSAA is quite useful.
... and totally usable with tiled deferred. In fact with sufficient MSAA compression hardware and the ability for samplers to read these compressed buffers (not always ubiquitous right now I will admit), the overhead is actually not dissimilar to forward rendering. And if we exposed some bits from the compression format it to user space it could be even cheaper, as I've discussed in my talks in the past.

I personally try to avoid all techniques that require rendering geometry twice, because geometry transform/rasterization is the step that has by the far the most fluctuating running time.
I completely agree, and I extend that to not liking to run expensive pixels shaders for the same reason - it's too variable how long they take due to hardware scheduling decisions and features. Spikes are bad... I'd rather predictably run at the "worst case" speed all the time than randomly spiking up and down in performance.

But in the future the materials will become more complex and the g-buffers will become fatter (as we need to store all the texture data to the g-buffer for later stages).
Right but as you note, there's no reason to store that texture data to the g-buffer ultimately, only interpolants (and even then you could store the sources and barycentrics if you had a ton of them, but that's not exactly common). The only reason people tend to store the texture data itself these days is because typically it's smaller than uv + gradients (6 floats, although you should be able to get away with 5, or obviously 3 for isotropic sampling). If you have >2-3 textures using the same coordinates though, it becomes cheaper to defer the texture lookup too. All these things are fairly straightforward choices though - there's no need make a big deal out of it; just test which is faster for a specific case.

Basically with current architectures this means you need a big texture atlas, and you need to store all your textures there.
Sure, but ultimately this will be solved by bindless textures/resources referenced in constant buffers or similar. i.e. you just store a pointer/offset to the material data in the G-buffer and look up into it in the deferred pass. There may be an interim time when people use virtual texturing for this, but in the long run there's no need for atlases/continuous address spaces for this kind of work. This will be far more efficient than redundantly dumping all this data into the G-buffer itself (much of it is constant), and basically reduces the role of the rendering pipeline to just rasterization (and perhaps displacement mapping) and some basic attribute interpolation.

Anyways, this is an interesting conversation but we're pretty off-topic for this thread... might be worth someone splitting this?
 
Maybe I'm missing something, but how would the ddx/ddy calculation work in that compute shader? How do you know the neighboring pixel is part of the same object and what happens if the neighboring pixels are all of different objects? The only safe way I can think of implementing this is to also store ddx/ddy, and then you aren't really saving much bandwidth and your texture instructions go at 1/4th the normal speed because they work on individual pixels instead of quads...
Didn't explain it, because I tried to keep my post as short as possible (this is off topic discussion after all)... but I failed miserably :)

Lets first go though the easy case of bilinear filtering from virtual texture. In this case, the texture coordinate already implicitly contains the mip level, as the indirection texture lookup transforms the texture coordinate to the correct 128x128 pixel page depending on the mip level (gradients). Basically the x,y texture coordinate pair contains all the info you need. Bilinear filtering isn't exactly a hot technique itself, but if you use anisotropic mip hardware to calculate the lod level based on gradients (min of x,y clamped to max+1 instead of max) you will get higher detail on slopes. I call this "bilinear anisotropic", and we used it in Trials Evolution (hacks like this are required for 60 fps on current gen consoles).

Trilinear isn't much harder. The virtual texture indirection lookup basically truncates the mip value to floor(mip). That data is implicitly stored to the x,y texture coordinate pair for free. All extra data you need to store is the fraq(mip) portion. 4 bit normalized [0,1] integer is enough for this purpose. We are blending between two adjacent mip levels after all, not two completely different images (**), so using 8 bits or even more (floats :cry:) is pure overkill for this purpose. However if you have a traditional texture atlas (and not a virtual textured atlas), the texture coordinate doesn't implicitly contain any extra information about the mip level, and you need to have extra bits to store the mip level.

Anisotropic filtering with virtual texturing is still a topic that hasn't been researched a lot ("Carmack's Hack" being the state of the art for performance :) ). Anisotropic filtering can be approximated just by using the trilinear version above, and adjusting the mip calculation based on the gradients (just like we did for our "bilinear anisotropic" for Trials Evolution). This doesn't require any extra g-buffer storage, but can sometimes result in slight oversampling (FXAA in Trials Evolution did take care of that). Of course this isn't a perfect solution, and we absolutely need to do better in the future. Good texture filtering quality is as important as good antialiasing quality.

If you want to do proper anisotropic filtering, you obviously need to store both gradients. Virtual texture indirection lookup points you to a location that stores the most detailed data you need (minimum of the gradients). Both gradients are increments to this value. The smaller gradient increment is always in range of [0,1] (when measured in mip levels), the larger can be more than that (but it's always positive). Again 4 bits should be enough for the first one, and if we share a 16 bit value for them, we have 12 bits remaining. That's more than enough for the second gradient bias.

Another way to approach this problem is to prefilter 128x128 tiles as 128x64 and 64x128 anisotropic tiles. Now we also use the gradient values to adjust the indirection lookup before storing the texture coordinate to the g-buffer. We can store these tiles adjacent to the original 128x128 (splitting the cache basically to 256x128 tiles) if we do not want to increase the indirection texture size (as coordinate bias to anisotropic pages is easy to calculate). Alternatively we can use a hash instead of the indirection texture (cuckoo hash is guaranteed O(1), can be easily coded with no branching/flow control, has no dependency chains and benefits nicely from GPU latency hiding). As a extra bonus, this technique saves bandwidth compared to standard anisotropic filtering, but it doubles the virtual texture cache atlas size.

The last, and the most ambitious way is to store no texture coordinate data at all. Use rasterization only for a depth pre-pass. Depth value is translated to a 3d-coordinate in the lighting shader (all deferred renderers do this already). If you have unique mapping in the virtual texture (***), you can do a hash lookup using this world coordinate to get the virtual texture coordinate. Naive thing would be to add all virtual texture pixels to a hash based on their 3d world coordinates (and update the hash whenever a page is loaded). A better way would be to have a sparse multilayer volume texture where the texture coordinates could be queried (this is basically a hash as well, but hash nodes are (8x8x8) volumes instead single pixels, and it would be easy to query if GPU has paged virtual memory, AMDs PRT OpenGL extension for example). It would contain only the surfaces visible in the screen (or virtual texture cache, because it's a superset of screen pixels). This kind of structure wouldn't need to be super high resolution, because texture coordinates are linearly interpolated along polygons (linear filtering from volume texture would work just fine).

(**) When using trilinear filtering, the virtual texture atlas has a single mip level. This allows you to use hardware trilinear filtering to blend between the current level and one below it. It increases virtual texture atlas memory consumption by 25%. That's usually not a big deal.

(***) You would want to have unique mapping for other purposes as well. It allows you to have unique decals on all your objects in world, and it allows you to precalculate object based texture transformations to the virtual texture cache (for example colorization). Unique virtual mapping shouldn't be confused with unique physical mapping. You don't need to store all versions of pages to the hard drive (like Rage does), you can burn the decals (and colorizations, etc) to pages during page loading.
Sure, but ultimately this will be solved by bindless textures/resources referenced in constant buffers or similar. i.e. you just store a pointer/offset to the material data in the G-buffer and look up into it in the deferred pass. There may be an interim time when people use virtual texturing for this, but in the long run there's no need for atlases/continuous address spaces for this kind of work. This will be far more efficient than redundantly dumping all this data into the G-buffer itself (much of it is constant), and basically reduces the role of the rendering pipeline to just rasterization (and perhaps displacement mapping) and some basic attribute interpolation.
Absolutely. Fully featured GPU virtual memory and data addressing is the future. AMD is touting it with HSA, Nvidia is touting it with Kepler, even ARMs Mali-T604 papers talk about GPU virtual memory. AMDs PRT OpenGL extensions are the first developer controllable virtual memory API for GPUs. It's currently only available for texturing (and has pretty big 64 kB pages), but it's a very good first step. I hope will will soon have unified 64 bit address space between CPU and GPU with same sized (preferably small 4 kB pages) and total developer control over handling page faults and virtual mappings. That would allow us to do all kind of crazy deferred rendering implementations :)
Anyways, this is an interesting conversation but we're pretty off-topic for this thread... might be worth someone splitting this?
Agreed. This is indeed a interesting topic, and unfortunately something that's not been discussed enough.

--> Please someone move this discussion to it's own thread. Thank you! :)
 
Users of tiled forward rendering might not agree. MSAA is quite useful.
I'm not saying there is no trade off, I'm saying that however you dice it the bandwidth needed either for an early Z pass or for some form of through framebuffer deferred shading is always going to stay significant. The raison d'etre for hardware tilers is not going away (although if the geometry load keeps increasing at some point they will need object level binning to compete).
 
I hope will will soon have unified 64 bit address space between CPU and GPU with same sized (preferably small 4 kB pages) and total developer control over handling page faults and virtual mappings. That would allow us to do all kind of crazy deferred rendering implementations :)
Amen to that, I've been asking for that feature for a few years already, empowering developers is the right way to go.
That along with standard texture layout would allow for immense worlds streamed in memory and requiring less memory but likely more bandwidth though.
(That shouldn't be too much of a problem with SSD becoming mainstream, but memory being already bandwidth limited today I'm not too sure how that would go. Didn't do any estimate of the required bandwidth recently either ;p)
 
Amen to that, I've been asking for that feature for a few years already, empowering developers is the right way to go.
That along with standard texture layout would allow for immense worlds streamed in memory and requiring less memory but likely more bandwidth though.
(That shouldn't be too much of a problem with SSD becoming mainstream, but memory being already bandwidth limited today I'm not too sure how that would go. Didn't do any estimate of the required bandwidth recently either ;p)


Its the way have take Nvidia and AMD, it was allready described in august 2011 by AMD when they have presented GCN. now lets hope it will be fully implemented and used soon " .

Normally we should see the complete unified adress between cpu and gpu on next AMD series HD8000 ( well as planned in the roadmap ) ( GCN today have just the access to cpu and memory, not unified )

Just find an old roadmap
 
Last edited by a moderator:
Another way to approach this problem is to prefilter 128x128 tiles as 128x64 and 64x128 anisotropic tiles.
Right, "rip-map" style. It's worth noting that if you're willing to go the virtual texture route, you can also cache textures warped at the relevant angles and ratios and "lazily" create them as different geometry and textures come into view.

Fully featured GPU virtual memory and data addressing is the future.
Sure, this is on everyone's roadmap but like I said, it isn't even required for this purpose. There's no reason why all of your texture data has to be in a continuous virtual address space - that's more a hack we do today because GPUs have static binding tables. Bindless completely eliminates that.

Virtual memory you only really need when you want sparse (or redundant) mappings to storage. That's certainly useful too, but unnecessary for the simple situation of accessing arbitrary textures without binding.

I hope will will soon have unified 64 bit address space between CPU and GPU with same sized (preferably small 4 kB pages) and total developer control over handling page faults and virtual mappings.
I'm skeptical that 4 kB will be practical for GPUs... that's a lot of thrash on the TLBs for a GPU. If you want them coherent with the CPU ones, that's even more overhead. 64 kB is already much more usable than 2MB, so count your blessings ;) We'll see though, I just wouldn't count on it.
 
Virtual memory you only really need when you want sparse (or redundant) mappings to storage. That's certainly useful too, but unnecessary for the simple situation of accessing arbitrary textures without binding.
Of course it would be also nice to be able to read texture/buffer descriptors directly from memory (from shader calculated address). But aren't GPUs reading the whole SIMD (warp/wavefront) from the same texture/buffer. So 32 or 64 threads need to access the same texture (and thus you are required to bin processing according to resource access patterns). A simple pointer per pixel approach doesn't have to worry about SIMD lanes.
I'm skeptical that 4 kB will be practical for GPUs... that's a lot of thrash on the TLBs for a GPU. If you want them coherent with the CPU ones, that's even more overhead. 64 kB is already much more usable than 2MB, so count your blessings ;) We'll see though, I just wouldn't count on it.
The smaller the better for sparse structures (and for exploiting the huge 64 bit address space for example for hashing).
 
Various displacement mapping techniques will be used more and more in future games, and these make the extra geometry pass even more expensive. DX11 supports vertex tessellation and conservative depth output. Tessellation will promote usage of vertex based displacement mapping techniques, and conservative depth is very useful for pixel based displacement mapping techniques (allows early-z and hi-z to be enabled with pixel shader programmable depth output).
How does pixel shader programmable depth NOT disable Early Z and or HiZ? You can't know the depth unless you run the shader and hence you can't do the depth test before running the shader.
 
It would be so nice if hardware tilers were given the necessary information to do this as well ... all this time and the APIs are still brain dead.

I'd prefer to simply have a HiZ unit that reduces the geometry stream out in a TBDR.
 
Not true - what if I set up a predicate and render a pixel at 0,0 and then predicate a draw call that renders a triangle on the opposite corner of the screen? I've now set up an arbitrary spatial dependency, which defeats any ability to bin/re-sort the incoming geometry.
I didn't get need to re-bin the incoming geometry bit at all.

http://www.google.com/patents/US20090058848

Render the predicate (say a cube, even if it is somewhere else) and then the object. First the cube is binned and a marker for the object is stored in the geometry stream.
While rasterizing, count the fragments that survive and if non-zero, interrupt the rasterization to bin the object. Then proceed as per usual.

Did you have some other implementation in mind?

Well "stall" in that the draw call has to finish through the pipeline before the next one starts, yes. But there's a large order of magnitude difference... IMRs apply the draw calls "wide" across the GPU for the most part and only pipeline multiple together when necessary. TBRs bin geometry spatially first, and really want to bin *all* geometry for a given render target before starting to shade tiles. This is simply not possible with the above UAV/predication semantics (really the same issue in both cases) - every draw call must flush *all* tiles before starting binning the next. That's extremely bad.
I don't see why it is that bad. You lose deferral somewhat but a TBR can still achieve close to full utilization.


... and totally usable with tiled deferred. In fact with sufficient MSAA compression hardware and the ability for samplers to read these compressed buffers (not always ubiquitous right now I will admit), the overhead is actually not dissimilar to forward rendering. And if we exposed some bits from the compression format it to user space it could be even cheaper, as I've discussed in my talks in the past.
You mean use MSAA textures for G buffer?
 
I'm not saying there is no trade off, I'm saying that however you dice it the bandwidth needed either for an early Z pass or for some form of through framebuffer deferred shading is always going to stay significant. The raison d'etre for hardware tilers is not going away (although if the geometry load keeps increasing at some point they will need object level binning to compete).

As I have said elsewhere, a HiZ unit before geometry stream out might be simpler solution to that.
 
Sure, this is on everyone's roadmap but like I said, it isn't even required for this purpose. There's no reason why all of your texture data has to be in a continuous virtual address space - that's more a hack we do today because GPUs have static binding tables. Bindless completely eliminates that.
I would advocate doing away with binding tables for every GPU resource, not just textures. For stuff like vertex/index/uniform/texture buffers etc.
 
Of course it would be also nice to be able to read texture/buffer descriptors directly from memory (from shader calculated address). But aren't GPUs reading the whole SIMD (warp/wavefront) from the same texture/buffer.
"It depends", but ultimately I expect you will be able to provide a handle per-lane. Otherwise the SPMD model is kind of broken (and I cringe whenever people get lazy and try to define some operation as requiring to be "uniform" across lanes, but yet still try and use abstract SIMD widths). Of course it may be slower if there is handle divergence, but that is already the case with general purpose memory accesses (scatter/gather), control flow, function calls, etc.

The smaller the better for sparse structures (and for exploiting the huge 64 bit address space for example for hashing).
Certainly, but it's naive to treat it like a magical "free" indirection just because it's in hardware. I'm just saying that the cost of 4kb pages may very well be too high to be reasonable. Also note that - just like on the CPU - you're likely to only get a subset of those 64-bit addresses (CPU can do ~48-bits or something IIRC?) :)

That's not to dampen enthusiasm, but think about how much TLB traffic 4kb pages implies for the usages that are being envisioned here. Hardware isn't magical - it's still important to optimize from the algorithm/data structure level downwards, not just rely on hardware page table indirections for performance.
 
How does pixel shader programmable depth NOT disable Early Z and or HiZ? You can't know the depth unless you run the shader and hence you can't do the depth test before running the shader.
That's what DX11 conservative depth output is all about. It enables you to use early-z / hi-z with programmable depth output. You have to promise that your depth output will be greater (or less) than the rasterized depth (it will be clamped to that if you don't keep your promise). Usually pixel displacement mapping techniques are only increasing the depth value from the rasterized value (or discarding the pixel completely if ray misses the object silhouette). Conservative depth output is perfect for this use case.
 
I don't see why it is that bad. You lose deferral somewhat but a TBR can still achieve close to full utilization.
How is it not bad to spill and reload your framebuffer tiles? Especially if they include (uncompressed) MSAA data that you were otherwise hoping to resolve in local memory?

You mean use MSAA textures for G buffer?
Yeah, but compressed by the hardware as always, and subsequently shaded at variable rates according to the unique samples.
 
Certainly, but it's naive to treat it like a magical "free" indirection just because it's in hardware. I'm just saying that the cost of 4kb pages may very well be too high to be reasonable. Also note that - just like on the CPU - you're likely to only get a subset of those 64-bit addresses (CPU can do ~48-bits or something IIRC?) :)

That's not to dampen enthusiasm, but think about how much TLB traffic 4kb pages implies for the usages that are being envisioned here. Hardware isn't magical - it's still important to optimize from the algorithm/data structure level downwards, not just rely on hardware page table indirections for performance.
I know that 4kB pages can be a performance limiting factor (in Trials Evolution we use solely 64kB pages because of performance reasons). But that's on a 7 year old console, so it's forgivable.

The real deal breaker is that "wintel" doesn't support anything else than 4kB and 2MB pages (PPC supports 64kB as well, read here for Linus Torvald's opinion about that: http://yarchive.net/comp/linux/page_sizes.html :) ). It would be sad, if we had to extensively use 2MB pages just to have inter-operation between CPU and GPU in the same memory regions. It would be a reasonable first step, but Intel/AMD/Nvidia need to settle for something better in the long run to make their CPUs and GPUs really be able to process same memory regions seamlessly. HSA also touts complete cache coherency between CPU and GPU as the long term goal. Wouldn't that also dictate that CPU and GPU eventually need equal size cache lines as well?
Yeah, but compressed by the hardware as always, and subsequently shaded at variable rates according to the unique samples.
I am hoping DX12 gives a proper API support for that. Wouldn't likely require much hardware changes (if any).

---

Had to include this Linus comment:
"Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB
pages are insane, and 16kB pages are borderline. I've told people so.

The ppc people run databases, and they don't care about sane people
telling them the big pages suck. It's made worse by the fact that they
also have horribly bad TLB fills on their broken CPU's, and years and
years of telling people that the MMU on ppc's are sh*t has only been
reacted to with "talk to the hand, we know better"."
 
Last edited by a moderator:
That's what DX11 conservative depth output is all about. It enables you to use early-z / hi-z with programmable depth output. You have to promise that your depth output will be greater (or less) than the rasterized depth (it will be clamped to that if you don't keep your promise). Usually pixel displacement mapping techniques are only increasing the depth value from the rasterized value (or discarding the pixel completely if ray misses the object silhouette). Conservative depth output is perfect for this use case.
Why can't a TBDR use programmable z with depth bounds? You have an on chip z tile, you have a interval for z for a fragment...
 
How is it not bad to spill and reload your framebuffer tiles? Especially if they include (uncompressed) MSAA data that you were otherwise hoping to resolve in local memory?
An IMR, with it's tiny ROP caches, is not going to cover itself in glory with such a use case either. Tricks like z/color/msaa compression can easily be used by a TBR designed for such a use case.
Yeah, but compressed by the hardware as always, and subsequently shaded at variable rates according to the unique samples.
Ok, thanks for this bit of clarification.
 
Why can't a TBDR use programmable z with depth bounds? You have an on chip z tile, you have a interval for z for a fragment...
I don't know enough about TBDR internals to say exactly how much extra bookkeeping work conservative z-bounds would produce. Not much extra I assume.
An IMR, with it's tiny ROP caches, is not going to cover itself in glory with such a use case either. Tricks like z/color/msaa compression can easily be used by a TBR designed for such a use case.
One of TBDRs strengths that it doesn't need to spend lots of transistors for these IMR compression tricks. It also needs lots of dedicated logic for TBDRs own specific needs. ROPs internal caches in IMR architectures might be small, but current generation cards have pretty big general purpose read/write L2 caches in their memory controllers (512 kB for Kepler, 768 kB for GCN).
 
Back
Top