Why isn't TBDR used anymore ?

That's what you do today when using a Z-prepass on IMRs, so it's clearly doable.

Cheers
Sure, but not all games use a pre-pass. Some sub-object level sorting would net you much of the benefit of a Z-prepass without doubling the geometry load, so it's not a given that you want to do it for high-poly games.

For a game that doesn't do a pre-pass, an equal vertex speed TBDR would need around 1.5x the setup rate to work with this index-only method (assuming half the polys are culled/clipped). But as I said earlier, you need to disallow changes to the vertex buffer and also separate position and iterator shaders to guaranteed space savings, so it doesn't fit the current programming model.
 
Ever since the DX9 era became mainstream, the data flow in the 3D graphics workload changed significantly. Before that we could get away with 1 or two UV coordinates per vertex. Nowadays we have the vertex shader feeding all kinds of data to the PS, like a transformed basis, light positions and attenuation factors, more texture coordinates, eye position, and any number of other things that 3D graphics innovators want to feed the PS with for unique effects. This is all in addition to the growth in triangle count.

With DX9, there was an explosion in the data flow right where it hurt TBDR most: Post-transformed vertices. The DX8/DX9 programming model also unified position and iterator calculations into one program stream, making it a bit harder for a TBDR to ameliorate the problem (and the presence of dynamic VBs doesn't help either).

The Neon 250...
NAOMI2...
Lazy8s, all your comparisons in the past are meaningless for this reason. Things have changed. Some of your claims are dubious too. A usable IMR mode fallback would need a lot of additions/changes (ROPs, memory controller, compression, etc.), and saying vertex count is leveling off is silly, especially if you compare its scaling to that of system memory capacity.

Why, then, should we think that PowerVR hasn't also advanced to overcome some of the inherent TDBR problems?
IMRs don't really have problems, TBDR's just had a few advantages back in the day. However, since then IMRs have taken away some advantages of TBDRs through compression and early-Z cull at high speed Z, and even the effective BW limitation has been reduced circumstancially due to less blending than in the multipass days and low incremental cost for a 256-bit bus. The TBDR problem of more memory usage, however, is fundamental and unavoidable. Any deferred renderer can't avoid it.
 
Well, this is only one of the possible "workarounds", but I fail to see the huge drawbacks here. You wouldn't have to transform shadow/simple polygons twice since they don't need much space anyway. 10 bytes should be enough for screen space XYZ, and shadow maps are a separate render even. Shadow polygons already make quite a significant portion of the overall polygon count.

And if you consider culling you could end up doing even less vertex work than performing the full transform on all vertices. Plus there are other benefits to be had from having the whole scene data, like the possibility of order-independent transparency.

Sure, huge amounts of embedded memory are nice, but certainly not free either.
You're right, I shouldn't have said "twice for every vertex". However, remember that I'm saying this method is the way I think TBDR could be feasible, not impossible.

If you're going to store screen space position in addition to the renderstate index, you also need either indices or lots of dupicates, and even with the former, not only are triangle lists sometimes used, but triangle strips will get heavily chopped at the tile boundaries. 25 bytes per vertex seems like a reasonable average to me, and that sits right in the iffy area to me when you consider future scaling.

Oh yeah, when I said "smaller price to pay than the alternative", I was comparing it to alternative ways of doing TBDR, not comparing to Xenos in that particular paragraph.

* How do you come up with a figure of 100 bytes per vertex?
100 bytes are almost 50% more than 68 bytes (and often you don't even need vertex color).
As nAo and SMM have attested to, it's really not that hard. Remember that the reason for not using ps2.0 instead of ps3.0 for that Farcry patch basically boiled down to the two extra iterators in the latter. 100 bytes average for iterators plus position, polygon indices, and renderstate index is very reasonable.

Having to render between 0.5m and 1M 'fat' vertices per frame would correspond to writing to and reading back from mem 50MB-100MB of data per frame..a few gigabytes per second,
It does not seem undoable to me.. from a bw standpoint. unfortunately it consumes a lot of memory, but if you design a new console from scratch based on TBDR you can easily address the problem imho.

Marco
100MB * 60fps * 2 (read/write) = 12 GB/s! For a lower cost console like XB360 (which is still fairly pricey), that's a huge chunk of BW, the very thing TBDR is claimed to save. How exactly do you "easily address the problem" without increasing cost, especially the memory usage? Binning only indices and/or position is the only thing that makes sense.

I think for this generation TBDR might have worked this way (with appropriate programming model changes), but under an equal cost scenario, it is far from certain you'd get an improvement in overall performance. Xenos is about as TBDR-esque as you could go given today's reality (porting to IMRs, IMR based engines, shader programming model, etc).
 
regarding the space required to store binned vertex data, wouldn't it be plausable to compress this data, given that we already use compression in most other space and bandwidth limited areas of graphics hardware?
 
As nAo and SMM have attested to, it's really not that hard. Remember that the reason for not using ps2.0 instead of ps3.0 for that Farcry patch basically boiled down to the two extra iterators in the latter. 100 bytes average for iterators plus position, polygon indices, and renderstate index is very reasonable.
Perhaps I was being too subtle for this board. If there are a lot of values per vertex, it should imply a large amount of work per pixel. If the pixel is obscured....
 
Perhaps I was being too subtle for this board. If there are a lot of values per vertex, it should imply a large amount of work per pixel. If the pixel is obscured....
With modern hw rougly sorting things front to back already helps a lot, often (even with complex pixel shaders) a zprepass does not really give you an overall speed up.
In fact in our case it can sometimes slowdown things a bit..but we are doing it anyway cause it accelerates our deferred shadows as well..
 
But as I said earlier, you need to disallow changes to the vertex buffer and also separate position and iterator shaders to guaranteed space savings, so it doesn't fit the current programming model.
You don't have to disallow vertex buffer changes, you have to buffer them. But IMRs have to solve the same problem since they buffer render commands as well. So a vertex buffer can still be "in use" when you want to change it.
Separating the position calculation from the rest in the vertex shader isn't exactly rocket science either. It comes down to dead code elimination. Of course it can result in some duplicate code, so you have to hope not having to run the full vertex shader on all vertices makes up for that.
 
IMRs don't really have problems
Framebuffer bandwith wasn't going to be an issue anymore because shaders were becoming more complex ... then came multisampling. Framebuffer bandwith wasn't going to be an issue anymore because shaders were becoming more complex ... then came HDR and large amounts of shadow maps. For the moment, it's still a bottleneck.
 
regarding the space required to store binned vertex data, wouldn't it be plausable to compress this data, given that we already use compression in most other space and bandwidth limited areas of graphics hardware?


Would compression be possible? As the transformed vertices come in from the vertex shaders, they get marked for which tiles they fall into, and then get stored in memory, so wouldn't you only be able to compress very small groups at a time?
 
Would compression be possible? As the transformed vertices come in from the vertex shaders, they get marked for which tiles they fall into, and then get stored in memory, so wouldn't you only be able to compress very small groups at a time?

i have no idea. but i (and many here, apparently) considered compressing color (or framebuffer) data to be undoable in the near future not too long ago. funny, looking back, that thread was in march of 2002, and the radeon 9700 was released august of that year with... lossless color compression in the framebuffer. it just seams logical that someone (much smarter than me, of course) would have come up with the same idea, that there would either need to be a massive buffer to store the vertex data, or some sort of compression to save space.
 
Framebuffer bandwith wasn't going to be an issue anymore because shaders were becoming more complex ... then came multisampling. Framebuffer bandwith wasn't going to be an issue anymore because shaders were becoming more complex ... then came HDR and large amounts of shadow maps. For the moment, it's still a bottleneck.
MfA, don't selectively quote me like that. I fully acknowledged that TBDR has advantages in the very same damn sentence.

My point is that the advantages of TBDR are heavily reduced now, and the costs have gone up. And your points are pretty weak anyway. Multisampling doesn't have as high of a bandwidth cost as you think. Just look at AA scaling for 128-bit and 256-bit versions of the same/similar core (e.g. 9700 vs. 9500 pro, 7600 vs. 6800, etc). The difference is quite small. And as long as we're discussing alternative hardware, a shared exponent 32 bpp format would be fine. Even XB360's FP10 format has a 8000:1 dynamic range, which is plenty as I've argued before. No camera has a tone-map curve even close to that, yet we're years away from approaching photorealism. I think the only reason a shared exponent format hasn't been introduced is that devs have latched onto FP16, and the higher end hardware needed for it keeps NVidia and ATI in business.
 
i have no idea. but i (and many here, apparently) considered compressing color (or framebuffer) data to be undoable in the near future not too long ago. funny, looking back, that thread was in march of 2002, and the radeon 9700 was released august of that year with... lossless color compression in the framebuffer. it just seams logical that someone (much smarter than me, of course) would have come up with the same idea, that there would either need to be a massive buffer to store the vertex data, or some sort of compression to save space.
Compressing the color buffer is 'easy', especially when you're using multisampling as the vast majority of samples belonging to a pixel will share the same color ('trivial' compression..)
Compressing post transform attributes is certainly doable but it's not trivial at all. Lossless compression in this case is difficult and lossy compression can severely impact quality in many cases. Since the hw has not a priori knowledge about the 'meaning' of each post transform attribute it's not clear what kind of lossy algorithm should be employed, it seems also to me that different attributes would require different techniques..
Maybe devs could 'tag' those attributes don't need a full precision representation (as colors..)

Marco
 
You don't have to disallow vertex buffer changes, you have to buffer them. But IMRs have to solve the same problem since they buffer render commands as well. So a vertex buffer can still be "in use" when you want to change it.
Separating the position calculation from the rest in the vertex shader isn't exactly rocket science either. It comes down to dead code elimination. Of course it can result in some duplicate code, so you have to hope not having to run the full vertex shader on all vertices makes up for that.
Buffering dynamic VB's is common on the PC, but on consoles it really seems like a waste of memory, especially if you're doing some fancy stuff on lots of vertices with, say, Cell. IMRs at least have the option of not buffering. Moreover, efficient TBDR bins one frame while rendering another, so you need to keep the copies around longer. But this isn't one of my main points, so it's not really worth discussing much longer.

I know the separation of the position calculation is pretty simple, but I was simplifying things. You have scattered, almost random access patterns to the input streams when you don't bin post-VS verts. <EDIT>Forgot something: you also have to worry about clipped triangles (guard band, clip planes) in screen space, but having to run the VS on the original vertices and reinterpolating.</EDIT>Maybe you're right and that hope does come to fruition most of the time, but I still don't think it's given you'll have less vertex work. I do know that current IMR hardware similarly tries to remove excess VS calculations by using a driver aided mark of the last instruction involved in position calcs, so it's not unique to TBDR.
 
Last edited by a moderator:
Even XB360's FP10 format has a 8000:1 dynamic range, which is plenty as I've argued before.
I agree more or less on everything you wrote, just want to underline the fact that dynamic range alone is not a good metric to judge a colors buffer format, we could easily construct a 8 bits per pixel format with a 8000:1 dynamc range ;)

Marco
 
I agree more or less on everything you wrote, just want to underline the fact that dynamic range alone is not a good metric to judge a colors buffer format, we could easily construct a 8 bits per pixel format with a 8000:1 dynamc range ;)

Marco
Okay, you're right, but as long as you can find an input to hit most of the 16.8 million colours on the 24-bit output of your tone map then you have enough resolution to go with your dynamic range. I think FP10 and the various shared exponent proposals will satisfy that requirement for most tone maps.

Good catch, though. Something I haven't thought about much.
 
Last edited by a moderator:
IMRs don't really have problems, TBDR's just had a few advantages back in the day. However, since then IMRs have taken away some advantages of TBDRs through compression and early-Z cull at high speed Z, and even the effective BW limitation has been reduced circumstancially due to less blending than in the multipass days and low incremental cost for a 256-bit bus. The TBDR problem of more memory usage, however, is fundamental and unavoidable. Any deferred renderer can't avoid it.

With poly size approaching pixel sizes in this high polygon scenario, your Z-buffer compression ratios deteriorates rapidly. There's a pain threshold here for both IMRs and TBDRs.

The reason nVidia decided coverage sampling is a good idea now (they clearly thought it was pants back in the Matrox FAA days) is because of bandwidth. The reason limited precision backbuffers (fp10, fp16, nAo32 etc) are a good idea is because of bandwidth. And if the cost of a 256bit bus is so low, how come all low and mid-range cards have 128bit busses ?

How much memory is needed for a reasonable number of 1920x1080 128bit rendertargets and 8 x MSAA? How much bandwidth does that equate to on an IMR ? Would a TBDR targetted at 2 million (shaded, full parameter) tris/scene have much lower bandwidth requirements? Hell yeah!

Cheers
 
Last edited by a moderator:
Maybe you guys do some of the number crunching?

e.g. Take a 1920x1080 framebuffer with 2M triangles, 8x MSAA, and FP16 and calculate the memory footprint and bandwidth requirements for an IMR and TBDR.

This would be helpful to see where the "cost" and "savings" are for each design, and how they may scale in the future.

Kudos for the interesting thread guys :)
 
Also, how well do current IMRs handle overdraw with early Z-buffer checks (typically)? If we consider a TBDR to be 100% efficient, would IMRs be about 75%? If the game engine was doing a perfect job rendering from front to back, would it be 100%?
 
Mat3, early Z rejection rates on current hardware are stupendously high. R4xx, R5xx, and Xenos are 256 pixels per clock, not sure about G80 or G7x/RSX but it's up there. Basically you can add hidden triangles to a scene and your cost will barely be any more than the setup cycles.

As an example, if you had a scene that ran at 60fps at 720p, you could add one hundred full screens worth of hidden pixels every frame and your framerate will only drop theoretically to 57 fps. In actuality it'll be a little bigger drop, but you can see that the pixel cost is very little. The polygon cost is there for both TBDR and IMR.

The bigger issue for efficiency is how effective rough front to back sorting is (if you don't do a Z pre-pass), and that really depends on the scene and how well you can chunk up your objects to do this rough sorting.
 
Maybe you guys do some of the number crunching?

e.g. Take a 1920x1080 framebuffer with 2M triangles, 8x MSAA, and FP16 and calculate the memory footprint and bandwidth requirements for an IMR and TBDR.
By the time we're doing 1080p with 8xMSAA and FP16, I hope we're doing more than 2M triangles per frame :) I personally think 4xMSAA @ 1080p with FP10/RGBE is plenty even for next gen. Software and rendering techniques are so much more important than a barely perceptible increase in resolution and AA.

Anyway, as you can see in this thread, the numbers depend on the method of TBDR and specific workload of a game.
 
Back
Top