Early Z IMR vs TBDR - tessellation and everything else - 2012 edition

rpg.314

Veteran
I myself, am in the TBDR camp, for now. And I am approaching this issue from the POV of practically unlimited memory capacity (since you have the system memory to lean on in a unified system) and memory bandwidth being the primary constraint on performance.

The usual argument against TBDR is that geometry binning is it's Achilles heel and tessellation would just kill it.

Here's a patent describing how it might be handled.

As I understood it, it proposes running the hull shader, the tessellator and the part of domain shader which calculates the final position in the first phase. The patch attributes, and tessFactor are dumped to memory. Since you now know the positions, the overlapping tiles are computed and in those tile lists, only the compressed indices represented the triangles are written. The patch attributes should not be much more than the attribute data that was read by the vertex/hull shader in the first place and the indices should be quite small. All in all, the extra memory bw used should be quite small.

In the second phase, the per tile indices, the patch attributes, are read and the position part of domain shader is re run, HSR is performed, the rest of domain shader runs, and from then on, it's business as usual.

The way I see it, it all comes down to which operation is more bandwidth efficient or has better locality. For an IMR, this would be the hw managed ROP cache. For a TBDR, this would be the object list. Without tessellation, I would argue that the two are probably close but intuitively, it appears that there is more locality in object space. With tessellation, especially with very large tessellation factors, an IMR will have to juggle lots of fragment traffic while this implementation of TBDR will have to deal with patch attributes (which would be small in comparison to fragment traffic as this data doesn't scale with tessFactor's) and compressed indices, which should be very tiny.

The position computation has to be done twice, but the evaluation itself would be very cheap and hence, the real cost would be in displacement map lookups, but one could argue that this will have very good locality and with a good texture cache, this wouldn't scale with tessFactor.

Reference Threads (Good ones, IMO)

http://forum.beyond3d.com/showthread.php?t=37290
http://forum.beyond3d.com/showthread.php?t=11554
 
I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).

PowerVR Rogue is future embedded GPU IP and I doubt it'll last less than 4-5 years on estimate. In that market the majority of GPUs are tile based earlyZ IMRs anyway and it's rather a question if exceptions like NVIDIA Tegra will go tile based within that timeframe with future GPU generations or not.

As for the less foreseeable future beyond roughly half a decade I doubt that IMG intends to go a more sw oriented route, nor that Intel in the meantime won't utilize fore mentioned IMG GPU IP.
 
PowerVR Rogue is future embedded GPU IP and I doubt it'll last less than 4-5 years on estimate. In that market the majority of GPUs are tile based earlyZ IMRs anyway and it's rather a question if exceptions like NVIDIA Tegra will go tile based within that timeframe with future GPU generations or not.

As for the less foreseeable future beyond roughly half a decade I doubt that IMG intends to go a more sw oriented route, nor that Intel in the meantime won't utilize fore mentioned IMG GPU IP.

At 14 nm, 12MB SRAM will be pretty cheap. In that space, you can pack 32 bit depth + two fp16 x 4 rendertargets in smartphone resolution (~million pixels @ 1280x720). So it is open to question whether someone will bother to make a TB(D?)R in that time frame.

I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).

Jut because future hardware will be flexible enough to do both techniques doesn't mean that it will do both techniques equally efficiently. So yes, picking sides matters.
 
At 14 nm, 12MB SRAM will be pretty cheap. In that space, you can pack 32 bit depth + two fp16 x 4 rendertargets in smartphone resolution (~million pixels @ 1280x720). So it is open to question whether someone will bother to make a TB(D?)R in that time frame.

What about tablets with 3-4x that number of pixels? What about offscreen render targets (e.g. for defered rendering)? What about UAV's (e.g. for trans sorting)? What about antialiasing?

With such a limited amount of memory you would quickly need to spill to extenal memory creating performance cliff edges or serious limitations on what you can do.

Nick said:
I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).

Not in a power constrained environment any time soon.

John.
 
What about tablets with 3-4x that number of pixels? What about offscreen render targets (e.g. for defered rendering)? What about UAV's (e.g. for trans sorting)? What about antialiasing?

Smartphones are a much bigger market, so it may be worthwhile to make a chip specifically for this market. For UAVs, configure this RAM as a usual cache. For AA and MRT, I think a better solution exists out there, but I don't know what it is. But of course, it is possible that none of it will work. May be you can tell us what will work in that time frame. ;)

With such a limited amount of memory you would quickly need to spill to extenal memory creating performance cliff edges or serious limitations on what you can do.
Tegra has a far smaller color depth cache and it seems to work fine.


Not in a power constrained environment any time soon.
That is basically everything from a mobile phone to a supercomputer.
 
Smartphones are a much bigger market, so it may be worthwhile to make a chip specifically for this market. For UAVs, configure this RAM as a usual cache. For AA and MRT, I think a better solution exists out there, but I don't know what it is. But of course, it is possible that none of it will work. May be you can tell us what will work in that time frame. ;)

Do you think smart phone res will stop at 720p? What about smart phones driving external displays? New form factors or display technologies will always push up the memory requirement for underlying display surfaces.

Configuring the RAM as a cache won't help unless you have enough memory to encompass the full expanse and layers of the pixels that use the UAV.

Obviously there are post processing hacks for AA which aren't too bad that you could argue make a reasonable replacement to brute force multi-sampling (I would argue that they're not good enough). For MRT's I'm not seeing any practical replacement so you still have to accommodate their footprint somewhere. There's also environment maps and shadow maps to consider, the latter of which need even more memory.

Tegra has a far smaller color depth cache and it seems to work fine.
Yes they help, generally improving burst utilisation, but they are very dependent on spatial ordering and break as soon as you push a polygon or mesh down the pipeline that covers more underlying memory than the size of the cache.
That is basically everything from a mobile phone to a supercomputer.

Yes power constraints are coming in across the board, however smart phones and tablets have thermal related power limits that are ~3 orders of magnitude lower than high end desktop systems, it's unlikely that there will ever be space for generallised programability, at acceptable performance, within these limits.

John.
 
How about constructing the TAG buffer for the entire frame on chip in one go? That should work.

Not sure what that buys you other than IMG style deferred rendering, you still need to construct a G Buffer from the underlying geometry the the tags point to, which means you still need the memory for it.

John.
 
But then, it isn't TBDR anymore, is it?

Eh? That's a moot point as even though doing a full frame tag buffer would mean you're not a tiler it still doesn't buy you anything relative to an IMR i.e. it still suffers from needing large amounts of on chip memory in order to efficiently support things like G Buffers.
 
I've got to agree with JohnH here. However I do believe that there is one very good use case for a large block of SRAM on an IMR: keep the current Z-Buffer completely on-chip! The coolest part is that neither ridiculous resolutions nor MSAA are a fundamental obstacle because you want to support Z Compression anyway (think shadowmaps).

So you could have a very simple scheme where you have 4MB of SRAM on chip (enough for 1280x720 0xMSAA without compression!) and reserve the full framebuffer size in external memory anyway. If the compression ratio for a tile is good enough, depth-related bandwidth is zero. If the compression ratio isn't good enough, you write part of the tile to your on-chip SRAM and the remaining part to DRAM. So if you had a moderately complex tile, you might still save 50% bandwidth, and even for very complex tiles you might save 10% (for example) on both reads and writes. If the depth buffer is required afterwards (e.g. shadowmaps) you write the data from the on-chip SRAM to the already reserved DRAM memory locations, nothing more and nothing less.

If you had a 2D GUI without a Z Buffer, you could reuse that SRAM as a gigantic cache (blending, textures, etc.) but I'm honestly not sure how beneficial that would be compared to the Z-Buffer case (it could be nice for GPGPU though). You wouldn't get most of the benefits of a TBDR but you wouldn't get the binning overhead either.

This gets us back to the original topic of this thread which is ways to minimise the binning overhead. Tesselation is a very interesting and important corner case where specific optimisations can help a lot but there certainly are things you can do to improve the general case as well. This kind of discussion is obviously (and sadly) very sensitive for legal reasons - I don't think it's a coincidence John isn't replying to the topic's original subject here, and I certainly can't blame him for it! :)
 
Honestly I expected a worthier analysis from you Arun on the patent itslef. Not really an Uttargram from hell (God help!) but you know what I mean :devilish:
 
So you could have a very simple scheme where you have 4MB of SRAM on chip (enough for 1280x720 0xMSAA without compression!) and reserve the full framebuffer size in external memory anyway. If the compression ratio for a tile is good enough, depth-related bandwidth is zero. If the compression ratio isn't good enough, you write part of the tile to your on-chip SRAM and the remaining part to DRAM. So if you had a moderately complex tile, you might still save 50% bandwidth, and even for very complex tiles you might save 10% (for example) on both reads and writes. If the depth buffer is required afterwards (e.g. shadowmaps) you write the data from the on-chip SRAM to the already reserved DRAM memory locations, nothing more and nothing less.

It's a reasonable use case, but there are a couple of problems. The first is that it's quite hard to do lossless compression that both reduces memory footprint and allows random access to that data, it's not impossible to solve this but there are reasons why the compression mechanisms used currently tend to only target bandwidth reduction. The second problem is that the rendering sequences you see with many apps requires multiple concurrent Z buffers e.g. a render sequence of render to main scene->render to texture->render to main scene->render to texture->render to main scene-> etc requires a number of Z buffer contexts switches which could take significant time and BW if you need to flush your current Z buffer to memory.

Not saying that these sort of problems are insurmountable, just pointing out practicalities.
 
Back
Top