Why isn't TBDR used anymore ?

Xmas · Jan 8, 2007

Mintmaster said:
I think given the large space needed for binning, the only way to feasibly do TBDR today and in the future is if you don't store the vertices. Disallow changing vertex buffers for the duration of a frame, and store a list of indices instead. Then you could get away with 4 bytes per vertex. By the time we can handle 100M polygons per frame, 400MB will be meaningless. Unfortunately, you have to transform each vertex twice this way, but double the setup silicon seems like a lot smaller price to pay than the alternative.

Well, this is only one of the possible "workarounds", but I fail to see the huge drawbacks here. You wouldn't have to transform shadow/simple polygons twice since they don't need much space anyway. 10 bytes should be enough for screen space XYZ, and shadow maps are a separate render even. Shadow polygons already make quite a significant portion of the overall polygon count.

And if you consider culling you could end up doing even less vertex work than performing the full transform on all vertices. Plus there are other benefits to be had from having the whole scene data, like the possibility of order-independent transparency.

Sure, huge amounts of embedded memory are nice, but certainly not free either.

Fox5 · Jan 8, 2007

CLX was way beyond Voodoo 2 , Voodoo 3 and a good match for the TNT2. I don't want to sound narky or protective (though I'm possibly both) but I don't really think this retrospective dismissal of the hardware by using rather unfair comparisons, and by picking randomised negative elements of games and pinning them unfairly on the graphcis chip, is deserved.

CLX was a match for TNT2 in what way? Voodoo 3 was about equal to TNT2 (non ultra) performance wise, though it fell way behind in features. (shenmue wouldn't have been possible on a v3, I do remember at least a few high res textures used in that game)

Looking at old benchmarks online, the Kyro 1 (not 2) fell slightly better a voodoo 4 and slightly worse than a rage fury maxx in 16 bit performance. In 32bit performance, it seemed to only fall behind the voodoo5 and geforce 2 gts, so it seems the kyro line of cards was more image quality focused than speed focused, contrary to pretty much all other card designs at the time. (or at least speed with image quality maxed) Oh wait, the picture changes in Unreal engine games, where it matches a geforce 2 gts on a game that seems decidedly cpu limited. Still seems to be about geforce 2 mx level overall.

Looking at the Neon 250 (higher clocked than CLX, right?), it seemed to have an isolated advantage in quake 3, but even then it couldn't keep up with the cards of 1999 at anything above 640x480. It seems to take unusual performance hits when raising resolution, perhaps the card was either severely memory bandwidth or fillrate limited? (since the same trends continue in 16 bit color, perhaps fillrate, or just driver limited)
http://www.tomshardware.com/1999/09/29/videologic_neon_250_review/index.html
Tom's Hardware seems to show it as more of a TNT competitor than TNT2, though I wouldn't rule out poor drivers, isn't the Neon 250 about equivalent to the Kyro 1 specs wise?

Lazy8s · Jan 9, 2007

A TBDR might find the better balance in rendering approaches by being able to switch to immediate mode to adapt in extreme workload scenarios.

More savings can be afforded by freeing binning space in a hierarchy of tiles and by PowerVR's on-chip MRTs.

With the growth of polygon rates leveling off in graphics, pixel workload won't be left behind by vertex load.

The Neon 250 was a generation behind Kyro, Series 2 compared to Series 3. While the Neon 250 was clocked higher than the CLX2, changes made to reduce the die area left it with only about 2/3rds of the performance.

The chipset of NAOMI2 used only one SH-4.

The money made in the cellphone business by Imgtec comes from chip design wins that can put them into over 500 million devices each for a large win like a generation of OMAP, and PowerVR has been selected for both OMAP 2 and 3. Though royalties per unit are under $1, only a few mid sized contracts would add up.

The reason Imgtec has stuck to cellphones and other constrained devices is because those are the markets in which their licensees operated, with the exception of Intel who's now readying a PowerVR UMPC/laptop level solution.

The claim of having used anisotropic filtering was made for Test Drive Le Mans on DC, but the claim of twenty million polygons per second was for Melbourne House's second PS2 racer, an F1 game.

The portable graphics offerings of nVidia, ATi, and Sony have not taken but a few design wins from MBX. Even Sony uses MBX for their Ericsson phones instead of trying to keep with their mandate of sourcing chips from in-house and cutting down their large PSP technology for the constraints of the cellphone environment.

GameCube compares to the year-older NAOMI2 much like Xbox does: comparable with a different set of advantages. T&L speed would favor NAOMI2.

Fox5 · Jan 9, 2007

I thought the Naomi 2 only used one SH-4, but System 16 lists two.

And according to wikipedia's article on PowerVR, there's a huge number of devices with MBX in them now, or some other PowerVR tech.

Additionally, I didn't realize at the time that the Neon 250/CLX is a single pipeline design while the kyro was two. A Kyro 2 alone should have more power than the Naomi 2, minus Elan, but on the plus side I'd imagine an Athlon or a Pentium 3 is a bit more beefy than the SH-4.

Would T&L speed favor Naomi 2? IIRC, it had a fully programmable T&L unit so while its performance can't be directly compared to a more fixed function design, I don't recall it having very high numbers. System 16 says 10 million with 6 light sources, didn't Factor 5 say they achieved that on Gamecube with the original Rogue Leader?

Lazy8s · Jan 9, 2007

ELAN had a flexible fixed function design where lights were just a type of modifier it supported among a more general range of possible modifiers.

The six simultaneous lights at 10M pps were full complexity spot lights and were a conservative measurement, representing a level of speed that is said to be unmatched all of the way through the GeForce 4 generation.

Simon F · Jan 9, 2007

Now that I have a little bit of time on my hands....

Fox5 said:
-Harder to make fast tbdr hardware.

Pardon? I presume you are refering to clock rate? Well it is harder to make hardware with a higher clockrate than a lower clock rate but that does not necessarily give you good returns when you factor in the silicon costs. It is probably true that a tbdr may be a bit more complex than a standard renderer, but standard renderers are also jumping through hoops to lower overdraw.

-TBDR hardware doesn't give much of a benefit to pixel shaders.

How did you arrive at that conclusion? I would say you have that completely back-the-front.

-Vertex shaders are so fast now that the vertex load of a game isn't really a limiting function anymore.

I don't see how that is relevant. Besides, it's unlikely that the vertex load has ever really been the major bottleneck in renderering. It's nearly always been fillrate.

Current hardware does tile, not in the same way but it still helps with memory bandwidth usage.

It does lower page breaks etc, yes.

The primary advantage of making TBDR hardware today would be lowering fillrate requirements, but you'd make a vastly weaker chip to do so.

Again, how do you come to these conclusions?

Rainbow Man said:
Perhaps.
I remember the add-in boards they made using those chips. I was tempted to get one myself because I'd seen virtua fighter or if it was soul calibur or somesuch fighting game on DC and thought it was pretty much the coolest smoothest thing I ever saw.
However I got discouraged by reports from all over those boards were more trouble than not in many games.
Is it so wrong then to assume where there's smoke there's also fire?

There were a few items that caused problems from time to time.

Pre-Kyro, the chips did not provide a way to save out the Z-buffer and some games absolutely insisted on reading back the odd pixel from the Z-buffer. This was usually always detrimental to performance on any architecture.
Some games insisted that there had to be hardware T&L which was just ridiculous - it turned out that the x86 CPU was nearly always more than fast enough to do those calcuations and run the game logic as well. The work around was simply to lie to the application and push the vertices through the CPU.
Some games insisted on a particular (optional) texture format which might not have been supported.

There may have been some others but I can't recall what they were. They usually stemmed from not checking the DX caps flags and coding accordingly. <sigh>

Mintmaster said:
One giant issue on consoles is the memory use of TBDR.

nAo (or DeanoC) mentioned he's pushing 2M polygons per frame in some parts of HS which would have to be binned for deferred rendering. Post-transform vertex size can easily be over 100 bytes, and I don't think Ninja Theory would be happy if they had 100-200MB less RAM to work with.

I have two issues with this:
* How do you come up with a figure of 100 bytes per vertex? If we assume that maps entirely to IEEE floats, that's ~25 values. If we assume 4 are for the position data, we end up with ~20 for colour and texture data. That seems to imply quite a number of texture layers. If this is the case, then the cost of vertex data is going to be insignificant compared the time spent shading!

* The second problem I have with this is that you are assuming that you do have to keep all the data before rendering.

There are some ways to reduce this like trying to separate position and iterator parts of the vertex shader, or doing two passes on the geometry and storing a bitmask the first time, but it gets messy and either reduces vertex throughput or requires much more vertex-related silicon.

I'm sorry, but I really have no idea what you are trying to say here.

ShootMyMonkey · Jan 9, 2007

Simon F said:
I have two issues with this:
* How do you come up with a figure of 100 bytes per vertex? If we assume that maps entirely to IEEE floats, that's ~25 values. If we assume 4 are for the position data, we end up with ~20 for colour and texture data. That seems to imply quite a number of texture layers. If this is the case, then the cost of vertex data is going to be insignificant compared the time spent shading!

Well, 100 bytes per vertex doesn't seem too outlandish to me because of the way people might organize things, things particularly blow up when skinning meshes (though that's not the most common type of vertex obviously). For example, Position (4) + Normal (4) + Tangent (4) + Binormal (4) + Blendweights (4) + Color (4) + Texture (16 for 4 UV sets), all adds up to 40 floats or 160 bytes per vertex. Bearing in mind that a lot of people will use 4 components even if unnecessary in order to maintain byte alignment. Of course, you can shrink this down by various methods, but I've seen worse, though. At the last place I worked, they stored all the bone-pivot-relative positions for a vert, so that meant having to move up to 4 positions (with a blendweight in the 4th component of each) per vertex -- they also went through the trouble of having multiple formats in case a vert was only influenced by 3 or 2 bones, but that's up to 64 bytes in position alone.

If you consider most of the environment to consist of 1 position, 3 basis vectors, 1 color and 4 UVs, and you pack things down as much as possible (dropping basis vector components and getting it back by normalizing, assuming all UVs to be 2d, bitpacked 8888 color), you still get Position (4) + basis vectors (4) + Color (1) + UVs (8) = 17 floats or 68 bytes... 100 is not too far away. Hell, if you stored color as 4 floats rather than bitpacked, you'd get to 80 bytes just like that.

EDIT : Since they were mentioning interpolated values, then 100 bytes is actually quite small. However, I do agree that there isn't necessarily a need to keep backing things up because a tile isn't ready to commit.

Inane_Dork · Jan 9, 2007

Simon F said:
* How do you come up with a figure of 100 bytes per vertex? If we assume that maps entirely to IEEE floats, that's ~25 values. If we assume 4 are for the position data, we end up with ~20 for colour and texture data. That seems to imply quite a number of texture layers. If this is the case, then the cost of vertex data is going to be insignificant compared the time spent shading!

AFAIK, all interpolated values are 4D vectors, so that would only be about 5 values. Given today's pixel shaders, that's not out of the question at all.

* The second problem I have with this is that you are assuming that you do have to keep all the data before rendering.

Interesting. The most obvious thing to do is to save all interpolants. You're suggesting a workaround?

Gubbi · Jan 9, 2007

Inane_Dork said:
Interesting. The most obvious thing to do is to save all interpolants. You're suggesting a workaround?

You could recalculate vertices on demand, store them in a nice chunk of cache.

Render tiles in a way that maximizes spatial coherence and it won't be much worse than todays 2-pass rendering with the initial Z-buffer priming where you do vertex re-evaluation anyway.

Edit: You could pipeline the bejesus out of this and re-calculate vertices well ahead of rendering a tile.

Cheers

Xmas · Jan 9, 2007

ShootMyMonkey said:
Well, 100 bytes per vertex doesn't seem too outlandish to me because of the way people might organize things, things particularly blow up when skinning meshes (though that's not the most common type of vertex obviously). For example, Position (4) + Normal (4) + Tangent (4) + Binormal (4) + Blendweights (4) + Color (4) + Texture (16 for 4 UV sets), all adds up to 40 floats or 160 bytes per vertex. Bearing in mind that a lot of people will use 4 components even if unnecessary in order to maintain byte alignment.

But Mintmaster was talking about transformed vertices, where the driver can remove all unused components (and you don't need bone weights).

If you consider most of the environment to consist of 1 position, 3 basis vectors, 1 color and 4 UVs, and you pack things down as much as possible (dropping basis vector components and getting it back by normalizing, assuming all UVs to be 2d, bitpacked 8888 color), you still get Position (4) + basis vectors (4) + Color (1) + UVs (8) = 17 floats or 68 bytes... 100 is not too far away. Hell, if you stored color as 4 floats rather than bitpacked, you'd get to 80 bytes just like that.

EDIT : Since they were mentioning interpolated values, then 100 bytes is actually quite small. However, I do agree that there isn't necessarily a need to keep backing things up because a tile isn't ready to commit.

100 bytes are almost 50% more than 68 bytes (and often you don't even need vertex color). Count shadow vertices and other simple stuff, and your average should go down quite fast. I don't think transformed vertices are really significantly larger than untransformed ones.

Inane_Dork said:
AFAIK, all interpolated values are 4D vectors, so that would only be about 5 values.

They don't have to be. You're probably referring to assembly level shaders where the PS input registers are 4-component vectors. But that doesn't mean they can't be packed and unused components removed at a lower level.

ShootMyMonkey · Jan 10, 2007

Xmas said:
But Mintmaster was talking about transformed vertices, where the driver can remove all unused components (and you don't need bone weights).

I realized that after I posted originally [hence the edit part added].

Xmas said:
100 bytes are almost 50% more than 68 bytes (and often you don't even need vertex color). Count shadow vertices and other simple stuff, and your average should go down quite fast. I don't think transformed vertices are really significantly larger than untransformed ones.

As a rule, I tend to find that I end up creating a lot more data in the end than I originally started with, though it definitely depends on the sort of approach you take and what sort of features you need to support. For instance, you're going to need to move world-space position through as an interpolated value so long as you want to support local lightsources or point lights. And yeah, shadow verts and verts moved through in the Z-Prepass are all well and good, but a pass is a pass, and I would think that's the same whether it's a deferred or an immediate mode renderer. You're not going to be rendering shadow maps in the same passes as your main render passes, so there's no point in worrying about the memory footprint of vertices in an earlier pass at that point -- they would anyway be meaningless. The *overall* average storage cost is not as important as the *current* average.

BTW, You'd be surprised what people use vertex color for. We have a need for 2 vertex color channels on the typical environment object. But they're somewhat... special... let's just say. And they're kind of used in an augmentative way that is independent of texture or screen space. But necessary nonetheless. But I do tend to draw a lot of examples from prior or current experience.

Xmas · Jan 10, 2007

ShootMyMonkey said:
And yeah, shadow verts and verts moved through in the Z-Prepass are all well and good, but a pass is a pass, and I would think that's the same whether it's a deferred or an immediate mode renderer. You're not going to be rendering shadow maps in the same passes as your main render passes, so there's no point in worrying about the memory footprint of vertices in an earlier pass at that point -- they would anyway be meaningless. The *overall* average storage cost is not as important as the *current* average.

Yes, and then "2M polygons/frame" doesn't mean you have to actually store 2M polygons at any given time (and many of them can be culled, anyway).
A Z prepass just for efficiency is obviously a waste on TBDRs. If it's required for stencil, it and shadow volumes belong to the same render, though.

BTW, You'd be surprised what people use vertex color for. We have a need for 2 vertex color channels on the typical environment object. But they're somewhat... special... let's just say. And they're kind of used in an augmentative way that is independent of texture or screen space. But necessary nonetheless.

I think I've seen too many shaders to be surprised by anything like that.

Sure, some shaders might even use the full 10 interpolants + position that SM3 allows, that could be 176 bytes, and SM4 allows even more. But these are rather rare cases for some time to come.

3dcgi · Jan 10, 2007

Do TBDR's have higher CPU load than IMR's? I'm wondering how much they are involved in the binning process.

ShootMyMonkey · Jan 10, 2007

Xmas said:
Yes, and then "2M polygons/frame" doesn't mean you have to actually store 2M polygons at any given time (and many of them can be culled, anyway).

That's partially what both myself and Mintmaster were trying to say, but the vast majority of people seem to be arguing otherwise.

Xmas said:
A Z prepass just for efficiency is obviously a waste on TBDRs. If it's required for stencil, it and shadow volumes belong to the same render, though.

I wouldn't say that a Z-Prepass would be useless. I think it's just a matter of assumptions on the part of most people as to what "TBDR" must necessarily entail. And to add to the "simple verts" part, I was also including the likelihood of shadow map renders. Shadow volumes are a different story, but they aren't my cup of tea. I can think of many cases where shadow volumes are the only way to go, but I would prefer shadow maps any day of the week if they could be afforded.

Xmas said:
I think I've seen too many shaders to be surprised by anything like that.

Sorry, but you did make it sound like vertex color is inherently not very useful, which I don't really think it is.

Xmas said:
Sure, some shaders might even use the full 10 interpolants + position that SM3 allows, that could be 176 bytes, and SM4 allows even more. But these are rather rare cases for some time to come.

I'll assume you mean "requires to meet spec" as opposed to "allows", as there exists SM3 hardware supporting more than 10 interpolants. But... uh... some? Rare? For the most part, though, I've never hit a case (in actual production titles) where the number of interpolants for a given material were more than enough. This is of course, excluding the trivial cases which I don't consider to be of concern in the first place because you do hardly any work on them anyway. I'd probably not be satisfied with fewer than 32 interpolants myself.

Acert93 · Jan 10, 2007

So a question: What does ATI/AMD plan to use for Fusion? System memory bandwidth doesn't seem very adequate and eDRAM is currently too small and tiling would not play nice on the PC.

This seems to be the reasons the TBDR are being used/licensed by Intel. If ATI stays the IMR course, how do the get around the fillrate issue?

(Yes, the initual Fusion processors will most likely be low end GPUs not intended for serious gameplay and such. But 6-12GB/s shared with a multicore CPU seems paltry).

Inane_Dork · Jan 10, 2007

Gubbi said:
You could recalculate vertices on demand, store them in a nice chunk of cache.

Render tiles in a way that maximizes spatial coherence and it won't be much worse than todays 2-pass rendering with the initial Z-buffer priming where you do vertex re-evaluation anyway.

For that matter, you might as well only project bounding volumes onto the tiles and associate volumes with draw calls.

Most any workaround I see, though, seems just as fast in IMR (or could be made that way with cache).

Xmas said:
They don't have to be. You're probably referring to assembly level shaders where the PS input registers are 4-component vectors. But that doesn't mean they can't be packed and unused components removed at a lower level.

I s'pose if you layed down a formatting header at the beginning of the values, that would work.

Still, 20 values is not an overly large amount. It would be quite easy to hit that amount.

nAo · Jan 10, 2007

The numbers quoted about HS are slighty wrong..cause a few days a go I realized there's a bug in the code that computes the triangle count: in some cases it's more close to 3M triangles per frame mark than 2M.
Said that I have also the specify that (on average) having to render 3 shadow maps and a z prepass per frame about 2/3 of those triangles are rendered via very simple vertices.
Vertex streams are simply compressed and splitted in two, one stream contains position (and bones indexes and weights for skinned mesh), the other one contains all the rest ( normal, tangent, uv sets, etc..) so that with simple rendering passes we only fetch the first stream (8 bytes per vertex) and we generate only one or two interpolated values (position is at least needed..)
The story completely changes when it comes down to use both streams with complex vertex and pixel shaders; in this case it's not uncommon to generate 100 bytes worth of data per post transformed vertex.
Having to render between 0.5m and 1M 'fat' vertices per frame would correspond to writing to and reading back from mem 50MB-100MB of data per frame..a few gigabytes per second,
It does not seem undoable to me.. from a bw standpoint. unfortunately it consumes a lot of memory, but if you design a new console from scratch based on TBDR you can easily address the problem imho.

Marco

Simon F · Jan 10, 2007

3dcgi said:
Do TBDR's have higher CPU load than IMR's? I'm wondering how much they are involved in the binning process.

Only series one required the CPU for binning.

Xmas · Jan 10, 2007

ShootMyMonkey said:
That's partially what both myself and Mintmaster were trying to say, but the vast majority of people seem to be arguing otherwise.

Well, Mintmaster was the one saying storing all transformed vertices would need too much memory, so a TBDR needs to do more vertex processing/process vertices twice. I think neither is true.

I wouldn't say that a Z-Prepass would be useless. I think it's just a matter of assumptions on the part of most people as to what "TBDR" must necessarily entail. And to add to the "simple verts" part, I was also including the likelihood of shadow map renders. Shadow volumes are a different story, but they aren't my cup of tea. I can think of many cases where shadow volumes are the only way to go, but I would prefer shadow maps any day of the week if they could be afforded.

Ok, more precisely

: you don't need a Z prepass on a TBDR that defers rendering to determine visibility. Except if the Z values are required for stencil ops.
I prefer shadow maps, too.

Sorry, but you did make it sound like vertex color is inherently not very useful, which I don't really think it is.

Maybe my wording wasn't that good, sorry. I don't think vertex color is useless, only that using interpolated colors is becoming less popular now that per-pixel lighting is dominant. Other interpolated attributes than color are becoming more important, though.

I'll assume you mean "requires to meet spec" as opposed to "allows", as there exists SM3 hardware supporting more than 10 interpolants. But... uh... some? Rare? For the most part, though, I've never hit a case (in actual production titles) where the number of interpolants for a given material were more than enough. This is of course, excluding the trivial cases which I don't consider to be of concern in the first place because you do hardly any work on them anyway. I'd probably not be satisfied with fewer than 32 interpolants myself.

The amount of work that needs to be done is not important when considering the amount of memory required by stored transformed vertices on average.

ShootMyMonkey · Jan 11, 2007

Xmas said:
Well, Mintmaster was the one saying storing all transformed vertices would need too much memory, so a TBDR needs to do more vertex processing/process vertices twice. I think neither is true.

D'oh... not Mintmaster... I meant Simon. I guess Mintmaster came up in the context, so that was the name that popped up.

Xmas said:
Ok, more precisely : you don't need a Z prepass on a TBDR that defers rendering to determine visibility. Except if the Z values are required for stencil ops.

Yeah, well... I guess I consider vertex throughput to be a weak point of all GPUs I've ever seen, so a Z-Prepass so that you never move and process that many "full-size" verts is generally a good thing.

Xmas said:
Maybe my wording wasn't that good, sorry. I don't think vertex color is useless, only that using interpolated colors is becoming less popular now that per-pixel lighting is dominant. Other interpolated attributes than color are becoming more important, though.

I guess in my case, I just look at a color interpolator as 4 values in which I can store anything with range from 0..1.

Xmas said:
The amount of work that needs to be done is not important when considering the amount of memory required by stored transformed vertices on average.

That's not quite what I was trying to say. I was saying that there are just so many trivial cases for specific passes (like shadow map passes) where you don't do much, and in turn don't need much data to work with. But in the actual *render* passes, I've never seen a case where the number of interpolants you can have is actually enough -- instead you tend more often to try and make do and squeeze in what you can.

Why isn't TBDR used anymore ?

Xmas

Porous

Fox5

Lazy8s

Fox5

Lazy8s

Simon F

Tea maker

ShootMyMonkey

Inane_Dork

Rebmem Roines

Gubbi

Xmas

Porous

ShootMyMonkey

Xmas

Porous

3dcgi

ShootMyMonkey

Acert93

Artist formerly known as Acert93

Inane_Dork

Rebmem Roines

nAo

Nutella Nutellae

Simon F

Tea maker

Xmas

Porous

ShootMyMonkey

Similar threads