Early Z IMR vs TBDR - tessellation and everything else - 2012 edition

Discussion in 'Architecture and Products' started by rpg.314, Dec 21, 2011.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    I myself, am in the TBDR camp, for now. And I am approaching this issue from the POV of practically unlimited memory capacity (since you have the system memory to lean on in a unified system) and memory bandwidth being the primary constraint on performance.

    The usual argument against TBDR is that geometry binning is it's Achilles heel and tessellation would just kill it.

    Here's a patent describing how it might be handled.

    As I understood it, it proposes running the hull shader, the tessellator and the part of domain shader which calculates the final position in the first phase. The patch attributes, and tessFactor are dumped to memory. Since you now know the positions, the overlapping tiles are computed and in those tile lists, only the compressed indices represented the triangles are written. The patch attributes should not be much more than the attribute data that was read by the vertex/hull shader in the first place and the indices should be quite small. All in all, the extra memory bw used should be quite small.

    In the second phase, the per tile indices, the patch attributes, are read and the position part of domain shader is re run, HSR is performed, the rest of domain shader runs, and from then on, it's business as usual.

    The way I see it, it all comes down to which operation is more bandwidth efficient or has better locality. For an IMR, this would be the hw managed ROP cache. For a TBDR, this would be the object list. Without tessellation, I would argue that the two are probably close but intuitively, it appears that there is more locality in object space. With tessellation, especially with very large tessellation factors, an IMR will have to juggle lots of fragment traffic while this implementation of TBDR will have to deal with patch attributes (which would be small in comparison to fragment traffic as this data doesn't scale with tessFactor's) and compressed indices, which should be very tiny.

    The position computation has to be done twice, but the evaluation itself would be very cheap and hence, the real cost would be in displacement map lookups, but one could argue that this will have very good locality and with a good texture cache, this wouldn't scale with tessFactor.

    Reference Threads (Good ones, IMO)

    http://forum.beyond3d.com/showthread.php?t=37290
    http://forum.beyond3d.com/showthread.php?t=11554
     
  2. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,726
    Hopefully flexible enough not to tie up multipliers for MSAA Z-comparisons too.
     
  4. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,222
    Location:
    Chania
    ***delete
     
  5. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,222
    Location:
    Chania
    PowerVR Rogue is future embedded GPU IP and I doubt it'll last less than 4-5 years on estimate. In that market the majority of GPUs are tile based earlyZ IMRs anyway and it's rather a question if exceptions like NVIDIA Tegra will go tile based within that timeframe with future GPU generations or not.

    As for the less foreseeable future beyond roughly half a decade I doubt that IMG intends to go a more sw oriented route, nor that Intel in the meantime won't utilize fore mentioned IMG GPU IP.
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    At 14 nm, 12MB SRAM will be pretty cheap. In that space, you can pack 32 bit depth + two fp16 x 4 rendertargets in smartphone resolution (~million pixels @ 1280x720). So it is open to question whether someone will bother to make a TB(D?)R in that time frame.

    Jut because future hardware will be flexible enough to do both techniques doesn't mean that it will do both techniques equally efficiently. So yes, picking sides matters.
     
  7. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Location:
    UK
    What about tablets with 3-4x that number of pixels? What about offscreen render targets (e.g. for defered rendering)? What about UAV's (e.g. for trans sorting)? What about antialiasing?

    With such a limited amount of memory you would quickly need to spill to extenal memory creating performance cliff edges or serious limitations on what you can do.

    Not in a power constrained environment any time soon.

    John.
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Smartphones are a much bigger market, so it may be worthwhile to make a chip specifically for this market. For UAVs, configure this RAM as a usual cache. For AA and MRT, I think a better solution exists out there, but I don't know what it is. But of course, it is possible that none of it will work. May be you can tell us what will work in that time frame. :wink:

    Tegra has a far smaller color depth cache and it seems to work fine.


    That is basically everything from a mobile phone to a supercomputer.
     
  9. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Location:
    UK
    Do you think smart phone res will stop at 720p? What about smart phones driving external displays? New form factors or display technologies will always push up the memory requirement for underlying display surfaces.

    Configuring the RAM as a cache won't help unless you have enough memory to encompass the full expanse and layers of the pixels that use the UAV.

    Obviously there are post processing hacks for AA which aren't too bad that you could argue make a reasonable replacement to brute force multi-sampling (I would argue that they're not good enough). For MRT's I'm not seeing any practical replacement so you still have to accommodate their footprint somewhere. There's also environment maps and shadow maps to consider, the latter of which need even more memory.

    Yes they help, generally improving burst utilisation, but they are very dependent on spatial ordering and break as soon as you push a polygon or mesh down the pipeline that covers more underlying memory than the size of the cache.
    Yes power constraints are coming in across the board, however smart phones and tablets have thermal related power limits that are ~3 orders of magnitude lower than high end desktop systems, it's unlikely that there will ever be space for generallised programability, at acceptable performance, within these limits.

    John.
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    How about constructing the TAG buffer for the entire frame on chip in one go? That should work.
     
  11. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Location:
    UK
    Not sure what that buys you other than IMG style deferred rendering, you still need to construct a G Buffer from the underlying geometry the the tags point to, which means you still need the memory for it.

    John.
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    But then, it isn't TBDR anymore, is it?
     
  13. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Location:
    UK
    Eh? That's a moot point as even though doing a full frame tag buffer would mean you're not a tiler it still doesn't buy you anything relative to an IMR i.e. it still suffers from needing large amounts of on chip memory in order to efficiently support things like G Buffers.
     
  14. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    I've got to agree with JohnH here. However I do believe that there is one very good use case for a large block of SRAM on an IMR: keep the current Z-Buffer completely on-chip! The coolest part is that neither ridiculous resolutions nor MSAA are a fundamental obstacle because you want to support Z Compression anyway (think shadowmaps).

    So you could have a very simple scheme where you have 4MB of SRAM on chip (enough for 1280x720 0xMSAA without compression!) and reserve the full framebuffer size in external memory anyway. If the compression ratio for a tile is good enough, depth-related bandwidth is zero. If the compression ratio isn't good enough, you write part of the tile to your on-chip SRAM and the remaining part to DRAM. So if you had a moderately complex tile, you might still save 50% bandwidth, and even for very complex tiles you might save 10% (for example) on both reads and writes. If the depth buffer is required afterwards (e.g. shadowmaps) you write the data from the on-chip SRAM to the already reserved DRAM memory locations, nothing more and nothing less.

    If you had a 2D GUI without a Z Buffer, you could reuse that SRAM as a gigantic cache (blending, textures, etc.) but I'm honestly not sure how beneficial that would be compared to the Z-Buffer case (it could be nice for GPGPU though). You wouldn't get most of the benefits of a TBDR but you wouldn't get the binning overhead either.

    This gets us back to the original topic of this thread which is ways to minimise the binning overhead. Tesselation is a very interesting and important corner case where specific optimisations can help a lot but there certainly are things you can do to improve the general case as well. This kind of discussion is obviously (and sadly) very sensitive for legal reasons - I don't think it's a coincidence John isn't replying to the topic's original subject here, and I certainly can't blame him for it! :)
     
  15. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,222
    Location:
    Chania
    Honestly I expected a worthier analysis from you Arun on the patent itslef. Not really an Uttargram from hell (God help!) but you know what I mean :twisted:
     
  16. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Location:
    UK
    It's a reasonable use case, but there are a couple of problems. The first is that it's quite hard to do lossless compression that both reduces memory footprint and allows random access to that data, it's not impossible to solve this but there are reasons why the compression mechanisms used currently tend to only target bandwidth reduction. The second problem is that the rendering sequences you see with many apps requires multiple concurrent Z buffers e.g. a render sequence of render to main scene->render to texture->render to main scene->render to texture->render to main scene-> etc requires a number of Z buffer contexts switches which could take significant time and BW if you need to flush your current Z buffer to memory.

    Not saying that these sort of problems are insurmountable, just pointing out practicalities.
     

Share This Page

Loading...