Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 21-Dec-2011, 06:10   #1
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default Early Z IMR vs TBDR - tessellation and everything else - 2012 edition

I myself, am in the TBDR camp, for now. And I am approaching this issue from the POV of practically unlimited memory capacity (since you have the system memory to lean on in a unified system) and memory bandwidth being the primary constraint on performance.

The usual argument against TBDR is that geometry binning is it's Achilles heel and tessellation would just kill it.

Here's a patent describing how it might be handled.

As I understood it, it proposes running the hull shader, the tessellator and the part of domain shader which calculates the final position in the first phase. The patch attributes, and tessFactor are dumped to memory. Since you now know the positions, the overlapping tiles are computed and in those tile lists, only the compressed indices represented the triangles are written. The patch attributes should not be much more than the attribute data that was read by the vertex/hull shader in the first place and the indices should be quite small. All in all, the extra memory bw used should be quite small.

In the second phase, the per tile indices, the patch attributes, are read and the position part of domain shader is re run, HSR is performed, the rest of domain shader runs, and from then on, it's business as usual.

The way I see it, it all comes down to which operation is more bandwidth efficient or has better locality. For an IMR, this would be the hw managed ROP cache. For a TBDR, this would be the object list. Without tessellation, I would argue that the two are probably close but intuitively, it appears that there is more locality in object space. With tessellation, especially with very large tessellation factors, an IMR will have to juggle lots of fragment traffic while this implementation of TBDR will have to deal with patch attributes (which would be small in comparison to fragment traffic as this data doesn't scale with tessFactor's) and compressed indices, which should be very tiny.

The position computation has to be done twice, but the evaluation itself would be very cheap and hence, the real cost would be in displacement map lookups, but one could argue that this will have very good locality and with a good texture cache, this wouldn't scale with tessFactor.

Reference Threads (Good ones, IMO)

http://forum.beyond3d.com/showthread.php?t=37290
http://forum.beyond3d.com/showthread.php?t=11554
rpg.314 is offline   Reply With Quote
Old 27-Dec-2011, 14:22   #2
Nick
Senior Member
 
Join Date: Jan 2003
Location: Montreal, Quebec
Posts: 1,859
Default

Quote:
Originally Posted by rpg.314 View Post
I myself, am in the TBDR camp, for now.
I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).
Nick is offline   Reply With Quote
Old 27-Dec-2011, 19:07   #3
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,543
Send a message via ICQ to MfA
Default

Hopefully flexible enough not to tie up multipliers for MSAA Z-comparisons too.
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote
Old 27-Dec-2011, 23:22   #4
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,463
Default

***delete
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 27-Dec-2011, 23:42   #5
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,463
Default

Quote:
Originally Posted by Nick View Post
I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).
PowerVR Rogue is future embedded GPU IP and I doubt it'll last less than 4-5 years on estimate. In that market the majority of GPUs are tile based earlyZ IMRs anyway and it's rather a question if exceptions like NVIDIA Tegra will go tile based within that timeframe with future GPU generations or not.

As for the less foreseeable future beyond roughly half a decade I doubt that IMG intends to go a more sw oriented route, nor that Intel in the meantime won't utilize fore mentioned IMG GPU IP.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 05-Jan-2012, 12:43   #6
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Ailuros View Post
PowerVR Rogue is future embedded GPU IP and I doubt it'll last less than 4-5 years on estimate. In that market the majority of GPUs are tile based earlyZ IMRs anyway and it's rather a question if exceptions like NVIDIA Tegra will go tile based within that timeframe with future GPU generations or not.

As for the less foreseeable future beyond roughly half a decade I doubt that IMG intends to go a more sw oriented route, nor that Intel in the meantime won't utilize fore mentioned IMG GPU IP.
At 14 nm, 12MB SRAM will be pretty cheap. In that space, you can pack 32 bit depth + two fp16 x 4 rendertargets in smartphone resolution (~million pixels @ 1280x720). So it is open to question whether someone will bother to make a TB(D?)R in that time frame.

Quote:
Originally Posted by Nick View Post
I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).
Jut because future hardware will be flexible enough to do both techniques doesn't mean that it will do both techniques equally efficiently. So yes, picking sides matters.
rpg.314 is offline   Reply With Quote
Old 09-Jan-2012, 17:33   #7
JohnH
Member
 
Join Date: Mar 2002
Location: UK
Posts: 580
Default

Quote:
Originally Posted by rpg.314 View Post
At 14 nm, 12MB SRAM will be pretty cheap. In that space, you can pack 32 bit depth + two fp16 x 4 rendertargets in smartphone resolution (~million pixels @ 1280x720). So it is open to question whether someone will bother to make a TB(D?)R in that time frame.
What about tablets with 3-4x that number of pixels? What about offscreen render targets (e.g. for defered rendering)? What about UAV's (e.g. for trans sorting)? What about antialiasing?

With such a limited amount of memory you would quickly need to spill to extenal memory creating performance cliff edges or serious limitations on what you can do.

Quote:
Originally Posted by Nick
I don't think there's a need to pick sides. Future hardware should be flexible enough to support both techniques (and more).
Not in a power constrained environment any time soon.

John.
JohnH is offline   Reply With Quote
Old 10-Jan-2012, 10:25   #8
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by JohnH View Post
What about tablets with 3-4x that number of pixels? What about offscreen render targets (e.g. for defered rendering)? What about UAV's (e.g. for trans sorting)? What about antialiasing?
Smartphones are a much bigger market, so it may be worthwhile to make a chip specifically for this market. For UAVs, configure this RAM as a usual cache. For AA and MRT, I think a better solution exists out there, but I don't know what it is. But of course, it is possible that none of it will work. May be you can tell us what will work in that time frame.

Quote:
With such a limited amount of memory you would quickly need to spill to extenal memory creating performance cliff edges or serious limitations on what you can do.
Tegra has a far smaller color depth cache and it seems to work fine.


Quote:
Not in a power constrained environment any time soon.
That is basically everything from a mobile phone to a supercomputer.
rpg.314 is offline   Reply With Quote
Old 10-Jan-2012, 14:02   #9
JohnH
Member
 
Join Date: Mar 2002
Location: UK
Posts: 580
Default

Quote:
Originally Posted by rpg.314 View Post
Smartphones are a much bigger market, so it may be worthwhile to make a chip specifically for this market. For UAVs, configure this RAM as a usual cache. For AA and MRT, I think a better solution exists out there, but I don't know what it is. But of course, it is possible that none of it will work. May be you can tell us what will work in that time frame.
Do you think smart phone res will stop at 720p? What about smart phones driving external displays? New form factors or display technologies will always push up the memory requirement for underlying display surfaces.

Configuring the RAM as a cache won't help unless you have enough memory to encompass the full expanse and layers of the pixels that use the UAV.

Obviously there are post processing hacks for AA which aren't too bad that you could argue make a reasonable replacement to brute force multi-sampling (I would argue that they're not good enough). For MRT's I'm not seeing any practical replacement so you still have to accommodate their footprint somewhere. There's also environment maps and shadow maps to consider, the latter of which need even more memory.

Quote:
Tegra has a far smaller color depth cache and it seems to work fine.
Yes they help, generally improving burst utilisation, but they are very dependent on spatial ordering and break as soon as you push a polygon or mesh down the pipeline that covers more underlying memory than the size of the cache.
Quote:
That is basically everything from a mobile phone to a supercomputer.
Yes power constraints are coming in across the board, however smart phones and tablets have thermal related power limits that are ~3 orders of magnitude lower than high end desktop systems, it's unlikely that there will ever be space for generallised programability, at acceptable performance, within these limits.

John.
JohnH is offline   Reply With Quote
Old 10-Jan-2012, 16:17   #10
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

How about constructing the TAG buffer for the entire frame on chip in one go? That should work.
rpg.314 is offline   Reply With Quote
Old 11-Jan-2012, 20:39   #11
JohnH
Member
 
Join Date: Mar 2002
Location: UK
Posts: 580
Default

Quote:
Originally Posted by rpg.314 View Post
How about constructing the TAG buffer for the entire frame on chip in one go? That should work.
Not sure what that buys you other than IMG style deferred rendering, you still need to construct a G Buffer from the underlying geometry the the tags point to, which means you still need the memory for it.

John.
JohnH is offline   Reply With Quote
Old 12-Jan-2012, 10:43   #12
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,230
Send a message via Skype™ to rpg.314
Default

But then, it isn't TBDR anymore, is it?
rpg.314 is offline   Reply With Quote
Old 12-Jan-2012, 13:42   #13
JohnH
Member
 
Join Date: Mar 2002
Location: UK
Posts: 580
Default

Quote:
Originally Posted by rpg.314 View Post
But then, it isn't TBDR anymore, is it?
Eh? That's a moot point as even though doing a full frame tag buffer would mean you're not a tiler it still doesn't buy you anything relative to an IMR i.e. it still suffers from needing large amounts of on chip memory in order to efficiently support things like G Buffers.
JohnH is offline   Reply With Quote
Old 13-Jan-2012, 16:55   #14
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,912
Default

I've got to agree with JohnH here. However I do believe that there is one very good use case for a large block of SRAM on an IMR: keep the current Z-Buffer completely on-chip! The coolest part is that neither ridiculous resolutions nor MSAA are a fundamental obstacle because you want to support Z Compression anyway (think shadowmaps).

So you could have a very simple scheme where you have 4MB of SRAM on chip (enough for 1280x720 0xMSAA without compression!) and reserve the full framebuffer size in external memory anyway. If the compression ratio for a tile is good enough, depth-related bandwidth is zero. If the compression ratio isn't good enough, you write part of the tile to your on-chip SRAM and the remaining part to DRAM. So if you had a moderately complex tile, you might still save 50% bandwidth, and even for very complex tiles you might save 10% (for example) on both reads and writes. If the depth buffer is required afterwards (e.g. shadowmaps) you write the data from the on-chip SRAM to the already reserved DRAM memory locations, nothing more and nothing less.

If you had a 2D GUI without a Z Buffer, you could reuse that SRAM as a gigantic cache (blending, textures, etc.) but I'm honestly not sure how beneficial that would be compared to the Z-Buffer case (it could be nice for GPGPU though). You wouldn't get most of the benefits of a TBDR but you wouldn't get the binning overhead either.

This gets us back to the original topic of this thread which is ways to minimise the binning overhead. Tesselation is a very interesting and important corner case where specific optimisations can help a lot but there certainly are things you can do to improve the general case as well. This kind of discussion is obviously (and sadly) very sensitive for legal reasons - I don't think it's a coincidence John isn't replying to the topic's original subject here, and I certainly can't blame him for it!
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 13-Jan-2012, 17:50   #15
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,463
Default

Honestly I expected a worthier analysis from you Arun on the patent itslef. Not really an Uttargram from hell (God help!) but you know what I mean
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 14-Jan-2012, 08:53   #16
JohnH
Member
 
Join Date: Mar 2002
Location: UK
Posts: 580
Default

Quote:
Originally Posted by Arun View Post
So you could have a very simple scheme where you have 4MB of SRAM on chip (enough for 1280x720 0xMSAA without compression!) and reserve the full framebuffer size in external memory anyway. If the compression ratio for a tile is good enough, depth-related bandwidth is zero. If the compression ratio isn't good enough, you write part of the tile to your on-chip SRAM and the remaining part to DRAM. So if you had a moderately complex tile, you might still save 50% bandwidth, and even for very complex tiles you might save 10% (for example) on both reads and writes. If the depth buffer is required afterwards (e.g. shadowmaps) you write the data from the on-chip SRAM to the already reserved DRAM memory locations, nothing more and nothing less.
It's a reasonable use case, but there are a couple of problems. The first is that it's quite hard to do lossless compression that both reduces memory footprint and allows random access to that data, it's not impossible to solve this but there are reasons why the compression mechanisms used currently tend to only target bandwidth reduction. The second problem is that the rendering sequences you see with many apps requires multiple concurrent Z buffers e.g. a render sequence of render to main scene->render to texture->render to main scene->render to texture->render to main scene-> etc requires a number of Z buffer contexts switches which could take significant time and BW if you need to flush your current Z buffer to memory.

Not saying that these sort of problems are insurmountable, just pointing out practicalities.
JohnH is offline   Reply With Quote

Reply

Tags
early z, imr, memory bandwidth, tbdr, tessellation

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 14:53.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.