Why isn't TBDR used anymore ?

TBDR might always be more effective overall for the lifetime of rasterization because of its framebuffer RAM savings, especially considering AA and HDR, and the die area savings from the more effective performance requiring less execution units/ROPS and from the smaller on-chip, tile buffer.
Framebuffer savings are negligible compared to binning cost. 100 bytes per transformed vertex, sometimes 150, for 5 millions polys in a scene (easily doable on G80), equals over half a gig of RAM. All workarounds have other drawbacks, as we discussed in that thread I pointed out earlier. On consoles it makes even less sense. The only reason it worked in the past was poly counts were low and vertex size was small.

Saying you need fewer execution units and ROPs is BS also. The early Z-culling of current chips is ridiculously fast, so TBDR has no advantage here. You can compare the Kyro to the GF2, but the latter was very inefficient (it was designed for speed in 16-bit rendering), and neither NV15 nor R100 had Z-culling abilities comparable to today's GPUs.

There's really only one big advantage of TBDR: bandwidth savings. It comes at the cost of vertex processing ability and increased vertex devoted silicon.

TBDR is not the panacea you think it is. There's a reason why neither ATI nor NVidia went this route.
 
You think if PC CPU clocks hadn't hit a scaling wall, we would have still seen the demise of Netburst? nVidia and ATI are the 800-ton gorillas, so it should go without saying that will never "jump ship" until some punctuation occurs that says the status quo is failing.
ATI and NVidia are in vicious competition that is very closely related to perf/price ratio. Video card buyers are far more educated than CPU buyers on the whole, and the situation is entirely different for AMD vs. Intel where the latter has a huge marketing and monopoly advantages to boot. If TBDR gave even a mere 20% advantage (long term) for the same cost, ATI/NV would have gone for it.
 
ATI and NVidia are in vicious competition that is very closely related to perf/price ratio. Video card buyers are far more educated than CPU buyers on the whole, and the situation is entirely different for AMD vs. Intel where the latter has a huge marketing and monopoly advantages to boot. If TBDR gave even a mere 20% advantage (long term) for the same cost, ATI/NV would have gone for it.

^^ What he said.
 
ATI and NVidia are in vicious competition that is very closely related to perf/price ratio. Video card buyers are far more educated than CPU buyers on the whole, and the situation is entirely different for AMD vs. Intel where the latter has a huge marketing and monopoly advantages to boot. If TBDR gave even a mere 20% advantage (long term) for the same cost, ATI/NV would have gone for it.
Problem is that last part will never be known unless either of them tries it or some 3rd party comes in and actually shows it to be viable. Not one example has been for hardware that ever came within a hundred miles of nVidia or ATI spec wise or cost-wise. They've all been about low-end cheap stuff, which only really shows that it's a win in otherwise constrained situations where your performance would be far below abysmal otherwise, and we'd prefer just keep it a little better than total suckage. There's not even an isolated example for how it works out with high-end hardware.

Granted, most GPUs generally work around the problems or have some internal hard limits which prevent a lot of otherwise powerful components from showing their mettle. Not that they are arbitrary or anything, but Murphy's Law always seems to dictate that those hurt you first.
 
One giant issue on consoles is the memory use of TBDR.

nAo (or DeanoC) mentioned he's pushing 2M polygons per frame in some parts of HS which would have to be binned for deferred rendering. Post-transform vertex size can easily be over 100 bytes, and I don't think Ninja Theory would be happy if they had 100-200MB less RAM to work with.
You design a system according to its requirements. So if it indeed were true that a TBDR required that much more memory for the binning, you'd add 128 or 256 MiB, maybe try to offset the cost with slower memory, and look again at what you've got. Memory isn't that expensive.

If you take a PS3 game for comparison, please subtract the substantial amount of memory required by an IMR for multisampling and the depth buffer. Or the shader instructions for NAO32 conversion. Also, given the massive amount of polygons that usually go into shadow rendering today, I'm not at all convinced 100 bytes per post-transform vertex is a realistic average.

The eDRAM of Xenos is not enough for rendering with AA, so you need to process vertices multiple times.
 
I don't think memory is that cheap either. Adding another 128-256 MB is very significant, even if slower. After a year or so, it won't matter much if you use slower memory.

If you're comparing to the PS3, yes, I think TBDR looks a little more appealing because RSX is based on G70, which in turn was designed for a market where TBDR doesn't make sense. Even so, multisampling and the depth buffer is not very big for 720p. It's what, 25MB extra for 4x? And even at 50 bytes average, 2.5M polys is a lot of space! (BTW, I realized that many verts could be small, and that's why I said 100-200MB for 2.5M verts.)


As for Xenos, processing all vertices multiple times is a very silly way to do tiling unless you're completely strapped for time. It makes far more sense to do coarse object or sub-object based bounds testing, especially given the rapid speed at which this can be done with Xenon. This is only a tiny fraction of the work that Cell-based culling of individual polygons would entail. The huge size of the tiles means objects straddling tile boundaries should not be high. Because API overhead is much lower on consoles, splitting an object's IB into locally confined (say, by a bounding sphere) sub-IB's should be both easy and low cost.

Xenos is a good compromise for this reason. You can stick with immediate-mode rendering if you don't want AA and still get TBDR bandwidth efficiency, or you can be more clever and get both AA and TBDR bandwidth efficiency with less vertex overhead. (Or, of course, you can get AA the lazy way by just sending everything multiple times.)
 
You design a system according to its requirements. So if it indeed were true that a TBDR required that much more memory for the binning, you'd add 128 or 256 MiB, maybe try to offset the cost with slower memory, and look again at what you've got. Memory isn't that expensive.

Bandwidth could be a problem. 2M vertices/scene @ 150bytes/vertex @ 60Hz yields 36 GB/s (store and later reload it) - just for the vertex data, then there are the indices for the vertices for each individual tile.

I'm still not convinced it's a bad idea though.

You save a ton of bandwidth on backbuffers. The reason we only have 8x MSAA today is because of bandwidth limitations.

Of course patents is another good reason for ATI^D^DMD and Nvidia to stay away from TBDRs.

Cheers
 
Last edited by a moderator:
Why would they spend hundreds of millions of dollars to change something on which they've already spent billions to ensure it never changes?


You think if PC CPU clocks hadn't hit a scaling wall, we would have still seen the demise of Netburst? nVidia and ATI are the 800-ton gorillas, so it should go without saying that will never "jump ship" until some punctuation occurs that says the status quo is failing.

It depends how much competition we're truly seeing in the market. If ATI and nvidia are both just content to one up each other every couple months by 10%, then maybe there is no incentive. But if there's truly gains to be had, I think you'd have seen at least one company pursuing the technology to differentiate their product. (unless we are seeing a duopoly, but judging by ati's loss of market share and margins over the past few years, you could argue against that) Shame some form of TBDR didn't make it into the Wii, that would have been a good application of the technology I think. (anyone wanna argue the benefits of flipper versus a theoretical series 3, 4, or 5 based chip? then again, we do have naomi 2 w/ elan, and I think gamecube clearly demonstrated an advantage over that...and even in embedded markets, serious competition in the form of nvidia and ati's offerings, as well as the psp's graphics chip exist)

Problem is that last part will never be known unless either of them tries it or some 3rd party comes in and actually shows it to be viable. Not one example has been for hardware that ever came within a hundred miles of nVidia or ATI spec wise or cost-wise. They've all been about low-end cheap stuff, which only really shows that it's a win in otherwise constrained situations where your performance would be far below abysmal otherwise, and we'd prefer just keep it a little better than total suckage. There's not even an isolated example for how it works out with high-end hardware.

Kyro 2 was probably the closest example, and it sometimes it went toe to toe with a geforce 2 gts (or maybe even an ultra), but there were other times it couldn't even keep up with geforce 2 mx. It was very game dependent, favoring the narrow corridor shooters of the time (much of which had software rendering modes and were poly constrained anyhow) over the newer emerging games with broad, wide open expanses.
It also had rendering issues, even under direct3d, which I don't know if they were just driver bugs or unfixable aspects of the underlying tech without developing the software specifically for the chip.

You could point to the CLX in dreamcast as a high end example, but it performed worse and looked worse in every game that was on both Dreamcast and PC. Not only that, but I wouldn't say there were any Dreamcast exclusives that really blew away anything on PC either. Shenmue had some horrible iq problems, along with a bad framerate, as well as some serious draw in problems, and was more interesting for its scene and texture variety than for any isolated graphical element. IMO, Shenmue showed what a big budget could do for a game, not what the Dreamcast technology could enable over PC hardware of the time. (sure, there are quoted specs for games that say they blow away any PC game of the time and would even have trouble running on pcs now, didn't test drive le mans claim something like 20 million polygons per second, at least on the ps2 version, or something along with 4x anisotropic filtering and such? the game looked good, but it's no gt 4)

That said, there is something sexy about TBDR. In general, I tend to favor solutions that add more onto the main chip silicon to reduce the load on external memory. Detaching from an external memory subsystem as much as possible just seems appealing, since you aren't reliant on fast and expensive ram for optimal performance, and just seems like it would bring about a broader variety of areas the same chip could be used without castrating its performance. Why has Imgtech stayed out of the PC, console, and arcade market? Are they really making that much money off of cell phones?
 
The Kyro 2 didn't have hardware T&L. Which held it back in games that pushed a lot of polys.

Cheers
 
The Kyro 2 didn't have hardware T&L. Which held it back in games that pushed a lot of polys.

Cheers

It also had trouble in flight sims that didn't support hardware T&L. A game like a flight sim would be pretty much the worst possible thing you could run on a TBDR right, almost no overdraw. There are a lot more games now with wide, open expanses than there were back in 1999/2000. Mainly due to hardware T&L allowing more polys to do such things, but it would seem such designs would also limit overdraw.

How did the Kyro do on sales btw? Anyone know? I strongly considered purchasing a kyro 2, but its fairly low end market target (oh how I waited for a kyro 2 turbo) and lack of glide (which was still very important to me at the time) made me hold off. Good thing too, it gave me more money for the rapid advances in tech nvidia and ati would have over the next couple years. Really, it seems like there was only a brief pause during the r300/fx cards, and everything else has progressed at a pretty breakneck pace.
 
You could point to the CLX in dreamcast as a high end example, but it performed worse and looked worse in every game that was on both Dreamcast and PC. Not only that, but I wouldn't say there were any Dreamcast exclusives that really blew away anything on PC either. Shenmue had some horrible iq problems, along with a bad framerate, as well as some serious draw in problems, and was more interesting for its scene and texture variety than for any isolated graphical element. IMO, Shenmue showed what a big budget could do for a game, not what the Dreamcast technology could enable over PC hardware of the time. (sure, there are quoted specs for games that say they blow away any PC game of the time and would even have trouble running on pcs now, didn't test drive le mans claim something like 20 million polygons per second, at least on the ps2 version, or something along with 4x anisotropic filtering and such? the game looked good, but it's no gt 4)

Those problems in Shenmue, like the draw distance, were a limitation of how many polygons the SH4 could transform and send to the GPU. I doubt there were many games, if any, where the powervr chip was the bottleneck.
 
Those problems in Shenmue, like the draw distance, were a limitation of how many polygons the SH4 could transform and send to the GPU. I doubt there were many games, if any, where the powervr chip was the bottleneck.

Was Sh-4 that much of a limitation? It had a high FLOPs rating, especially at the time.
Ok, then there isn't really a single example of a high end TBDR solution. Well, I guess there's Naomi 2, and though it had Elan, it was still paired with the same SH-4 (x2) as the Dreamcast, wasn't it? Even then, it's not like it saw extensive development. There was Virtua Fighter 4, 18 Wheeler, and a handful of others, and graphics ranged from Dreamcast level to....well Virtua Fighter 4. Checking out System16.com, and apparently there was eventually a Virta Fighter 4: Final Tuned, which looks like it may have recieved a notably graphical upgrade over the original Virtua Fighter 4 release. Hard to tell from the screenshots though how it compares graphically to anything, and certainly Naomi 2 couldn't be considered high end by that point.
Comparing spec for spec though, even assuming an overdraw of 3x, puts it at least slightly below (if not more so) gamecube in everything except amount of ram. The general games graphics quality seems to support this as well, though VF4 Final Tuned might be a large outlier (or it may look almost identicle to VF4). Not sure if it could be considered high end anyway, even at release, it's 1998 hardware, doubled, and then elan added on.
 
As mentioned from other posters, the main problem with TBDR in the high end is the amount vertices that would have to be stored. Does anyone think it would it be conceivably possible to store that data in a large cache, with the idea being to constantly eliminate all overdraw on the fly so only the polygons in the final scene are being stored (so all polygons not seen, from overdraw, (or because they were outside the view, or backfacing), would not be stored)??

Oh, and I'm not suggesting that would be an efficient use of transistors, just whether or not it might be possible...
 
Last edited by a moderator:
Mat3, a cache is a bad idea because we're talking about tens if not hundreds of MB needed for the binned polygon data, and it only needs to be written and read once per frame so it's a waste of the bandwidth that you get from a cache.

I think given the large space needed for binning, the only way to feasibly do TBDR today and in the future is if you don't store the vertices. Disallow changing vertex buffers for the duration of a frame, and store a list of indices instead. Then you could get away with 4 bytes per vertex. By the time we can handle 100M polygons per frame, 400MB will be meaningless. Unfortunately, you have to transform each vertex twice this way, but double the setup silicon seems like a lot smaller price to pay than the alternative.

For consoles, I really don't see any need for more than 4xAA at 1080p (console buyers aren't that picky), and a shared exponent format should give all the dynamic range you need in 32bpp. IMO, by next gen a full frame should fit in eDRAM or Z-RAM (64MB) without taking up excessive die space. Then TBDR will no longer be necessary, and all the remaining BW can be used for textures and the CPU.
 
Last edited by a moderator:
I think it's probably possible to make things a little more transparent than that by just treating a small tile cache as a cache line of sorts. You can still render in immediate mode, but because you only really write back on "eviction," you can still save bandwidth (or at least leave more available for texture bandwidth) and probably get away with saving some space by storing AA samples in the caches only and resolving on writeback (won't be ideal, I know, but not unspeakably awful). The worst case scenario is when you are trying to render large polygons that end up spanning many tiles which causes a lot of tiles to get evicted after only one write.

I do think that having a full framebuffer's worth of space in eDRAM is probably the best option overall, but being that we may be talking things like 8xAA with 128-bit HDR at 2048x1536 or something, you're moving into ranges of hundreds of MB (384 MB in this case), and I don't see that being reasonable for eDRAM sizes within the next decade. Yeah, that sounds absurd, but the buying public is capable of the absurd and stupid -- they are consumers after all.
 
I think given the large space needed for binning, the only way to feasibly do TBDR today and in the future is if you don't store the vertices. Disallow changing vertex buffers for the duration of a frame, and store a list of indices instead. Then you could get away with 4 bytes per vertex. By the time we can handle 100M polygons per frame, 400MB will be meaningless. Unfortunately, you have to transform each vertex twice this way, but double the setup silicon seems like a lot smaller price to pay than the alternative.

That's what you do today when using a Z-prepass on IMRs, so it's clearly doable.

Cheers
 
Bandwidth could be a problem. 2M vertices/scene @ 150bytes/vertex @ 60Hz yields 36 GB/s (store and later reload it) - just for the vertex data, then there are the indices for the vertices for each individual tile.
Although your example is a little extreme, you bring up a good point.

For XBox 360, Microsoft clearly had a design goal to use a single 128-bit bus so that costs would be low and scaling would be easy. Even if the binning buffer was limited to 83MB, reading and writing at 60 times per second is 10 GB/s, which is nearly half the total bandwidth, and this doesn't even count the input vertex bandwidth. It doesn't look like it would be easy to get a TBDR design to be the same speed with the same silicon.
 
How did the Kyro do on sales btw? Anyone know? I strongly considered purchasing a kyro 2, but its fairly low end market target (oh how I waited for a kyro 2 turbo) and lack of glide (which was still very important to me at the time) made me hold off. Good thing too, it gave me more money for the rapid advances in tech nvidia and ati would have over the next couple years. Really, it seems like there was only a brief pause during the r300/fx cards, and everything else has progressed at a pretty breakneck pace.

I had one. IMO it wasn't for the faint of hearted, constant driver problems in the beginning.

Problem was that to get good performance you absolutely had to enable texture compression. Trouble was that a ton of games used two textures per poly, one for the texture and one for lightmaps. Forcing compression caused lightmaps to be compressed. Compressing grayscale lightmaps with s3tc resulted in all kinds of colourful artifacts, and on top of that the compression was of course done on-the-fly by the CPU netting you choppy gameplay.

They fixed all these issues in later driver revision and from that time on it absolutely rocked when paired with a high performance (and overclocked) CPU.

I really thought it was pretty neat. Cheap, fast and low power. It was faster than the GF2s of the day in 80% of the games I played, and had it had a T&L engine it would have rocked even more.

But as you said, they didn't keep up.

I still think the technology has a lot of potential. The fact that every VLSI chip today is limited by power means that low-power approaches automatically looks much more favourable.

Yes, there are costs associated with TBDRs, but it also enables things that are to be considered luxuries on IMRs such as massive numbers of HDR (128bit) render-targets, massive amounts of MSAA, hardware sorting of tranparent polys and more.

Cheers
 
Last edited by a moderator:
The most obvious point to consider is that that last mainstream PC card to use TDBR was released over 5 years ago when it's most immediate comtemporary was the R200. Think how IMR technology has advanced in leaps and bounds since then, overcoming many of the problems inherent with IMRs.

Why, then, should we think that PowerVR hasn't also advanced to overcome some of the inherent TDBR problems? Admittedly, ImagTech haven't released anything for the PC market since 2001 but this doesn't necessarily mean that they have just been marking time. After all, SGX supports PS3.0 which is somewhat more than a small advance over Kyro and we know that they are obviously aware of TDBR's difficulties. It seems perfectly logical to me that techniques to help ameliorate the geometry problems of TDBR's will have been researched.

Whether or not we'll ever see a TDBR back in the PC or console market remains to be seen, however!
 
You could point to the CLX in dreamcast as a high end example, but it performed worse and looked worse in every game that was on both Dreamcast and PC. Not only that, but I wouldn't say there were any Dreamcast exclusives that really blew away anything on PC either. Shenmue had some horrible iq problems, along with a bad framerate, as well as some serious draw in problems, and was more interesting for its scene and texture variety than for any isolated graphical element. IMO, Shenmue showed what a big budget could do for a game, not what the Dreamcast technology could enable over PC hardware of the time. (sure, there are quoted specs for games that say they blow away any PC game of the time and would even have trouble running on pcs now, didn't test drive le mans claim something like 20 million polygons per second, at least on the ps2 version, or something along with 4x anisotropic filtering and such? the game looked good, but it's no gt 4)

CLX was high end, very high end, when it came out. Higher end than any PC, arcade or console part. That you're looking for cases where PC ports (there were none to speak of in it's early days) to outperform the PC (despite using very different hardware) means you can't help but unfairly judge the chip.

I had a Pentium 2 and a Voodoo 2 back in 19998 - it was the peak of what PCs had to offer. The DC, frankly, shit on it in terms of performance, features and IQ. Sonic Adventure back in 1998, and Soul Calibur in early 1999 were way beyond any games that I could buy for my PC. These certainly weren't PC ports.

Despite having mostly been developed in 1997 and 1998 (predating finalised hardware), when Shenmue arrived in 1999 it was again a leap beyond anything I'd seen on my PC (or running on my friends Voodoo 3s, or on anything else).

Most of Shenmue's slowdown was caused by overloading the CPU (overclocking the CPU reportedly eliminates it), the character fade in was caused by "streaming" characters in to save memory (don't confuse this with the games actual "draw distance" which could be really quite large), and the IQ problems were due to not using mip maps, somethng Sega had a tendancy to do on games developed during 1997/1998 (House of the Dead, Bass fisihing).

CLX was way beyond Voodoo 2 , Voodoo 3 and a good match for the TNT2. I don't want to sound narky or protective (though I'm possibly both) but I don't really think this retrospective dismissal of the hardware by using rather unfair comparisons, and by picking randomised negative elements of games and pinning them unfairly on the graphcis chip, is deserved.

Enough people dance on "high end TBDR" and the DC's graves without them forming a tag team. ;)
 
Back
Top