Why isn't TBDR used anymore ?

Panajev2001a · Jan 15, 2007

Mintmaster said:
Mat3, early Z rejection rates on current hardware are stupendously high. R4xx, R5xx, and Xenos are 256 pixels per clock, not sure about G80 or G7x/RSX but it's up there. Basically you can add hidden triangles to a scene and your cost will barely be any more than the setup cycles.

The rejection rate is important, but shouldn't we also think about the efficiency of the rejection test itself ?

Pixel Shader ALU's efficiency is not only affected by how many pixels can be early-z rejected per clock, but also how large is the granularity (batch size) of these rejection tests (if the batch size is too large there will be a higher probability than you will have to process that batch further down the rendering pipeline either relying on branching in the pixel shader code, if you so programmed, or relying on the hardware just discarding those occluded pixel later in the pipeline and after the pixel shader program has been run on them).

Mintmaster · Jan 15, 2007

Gubbi said:
With poly size approaching pixel sizes in this high polygon scenario, your Z-buffer compression ratios deteriorates rapidly. There's a pain threshold here for both IMRs and TBDRs.

When you have a clump of tiny polygons, I doubt memory bandwidth is your limiting factor. It'll be setup rate for sure. Besides, I was just pointing out that comparing a 2 pipeline Kyro to a 4 pipeline GF2 really overestimates the advantage of a TBDR.

And if the cost of a 256bit bus is so low, how come all low and mid-range cards have 128bit busses ?

You're missing my point. Even 128-bit gives lots of bandwidth per pipe per clock for the low-end cards. What I'm saying is that die size is increasing, so not only do we have more pads/pins available, but the relative cost of going from 128-bit to 256-bit is not so big now.

How much memory is needed for a reasonable number of 1920x1080 128bit rendertargets and 8 x MSAA? How much bandwidth does that equate to on an IMR ? Would a TBDR targetted at 2 million (shaded, full parameter) tris/scene have much lower bandwidth requirements? Hell yeah!

Cheers

1080p 128bit with 8xMSAA? WTF is that for?!? And we're still stuck at 2M tris in this whack scenario?

Mintmaster · Jan 15, 2007

Panajev2001a said:
The rejection rate is important, but shouldn't we also think about the efficiency of the rejection test itself ?

Hence my next two paragraphs. Even 10% efficiency (and I doubt it's that low) makes the rejection time negligible, as we're rarely going to see 10x overdraw, let alone 100x. Moreover, reduced rejection efficiency tends to happen when there's smaller triangles, and you have to set those triangles up anyway.

As for granularity, PC cards have quad-level early-Z to handle the pixels that evade the coarser hyper-fast rejection. It'll probably be only 16 or 32 pix/clk, but it won't matter at this stage.

Panajev2001a · Jan 15, 2007

Mintmaster said:
Hence my next two paragraphs. Even 10% efficiency makes the rejection time negligible, as we're rarely going to see 10x overdraw, let alone 100x. Moreover, reduced rejection efficiency tends to happen when there's smaller triangles, and you have to set those triangles up anyway.

True, but it is not the only potential problem (and reduced efficiency also depends on the size of the batch of pixels we test for rejection): can we say for sure that there are not probable and possible scenarios where Vertex processing speed and Triangle set-up speed are enough, but very long pixel shader programs might be causing a problem with all the pixels we have to throw away at the end of the pipeline ?

I would love to see a company with bright minds such as IMG. Tech. to have the resources to try to push again for a very high-end part and manufacturing partners making it possible: like x86's history has taught us, tons of man hours from many very bright circuit engineers/cpu architects/compiler writers/etc... can take even something with more than one potential disadvantages/hard problems to solve (compared to advanced RISC designs, such as Alpha, POWER, SPARC, etc...) and turn it into a performance : price ratio colossus (can we really compare a 486/Pentium MMX with a Core 2 Duo or with an Athlon 64 X2?), why can't an already neat idea such as Deferred Renderers evolve too ?

IMR's have evolved a lot since the TNT1/Voodoo days (can we really compare a GeForce 8800 with one of those GPU's ?) thanks to having the advantage of being the dominant architecture in which HUGE companies have invested HUGE amount of R&D dollars, until TBDR's can really get some traction besides portable devices, we might see an IMR only future ahead in the high performance, non-embedded space.

Jawed · Jan 15, 2007

Panajev2001a said:
The rejection rate is important, but shouldn't we also think about the efficiency of the rejection test itself ?

Pixel Shader ALU's efficiency is not only affected by how many pixels can be early-z rejected per clock, but also how large is the granularity (batch size) of these rejection tests (if the batch size is too large there will be a higher probability than you will have to process that batch further down the rendering pipeline either relying on branching in the pixel shader code, if you so programmed, or relying on the hardware just discarding those occluded pixel later in the pipeline and after the pixel shader program has been run on them).

Yeah, there's several ATI patent applications on the subject of improving efficiency and scaling of early Z rejection:

http://www.beyond3d.com/forum/showthread.php?t=32653

Clearly, NVidia revisited this topic in G80 too, to make seemingly dramatic improvements.

Jawed

blakjedi · Jan 15, 2007

Mintmaster said:
For consoles, I really don't see any need for more than 4xAA at 1080p (console buyers aren't that picky), and a shared exponent format should give all the dynamic range you need in 32bpp. IMO, by next gen a full frame should fit in eDRAM or Z-RAM (64MB) without taking up excessive die space. Then TBDR will no longer be necessary, and all the remaining BW can be used for textures and the CPU.

One of the best answers to what devs will want/need in a next gen console that I have read. Thanks.

Gubbi · Jan 16, 2007

Mintmaster said:
You're missing my point. Even 128-bit gives lots of bandwidth per pipe per clock for the low-end cards. What I'm saying is that die size is increasing, so not only do we have more pads/pins available, but the relative cost of going from 128-bit to 256-bit is not so big now.

Every single card, from the lowest low to the highest high end is limited by bandwidth in all but corner cases.

The number of pads on the die is what enables the wide busses. It does not drive cost down. Boutique memory on many-layer PCBs is what defines cost at the high end. And it is not getting any cheaper!.

Bandwidth is what defines price points on GPUs today. You build a memory system, and slap a chunk of silicon on it that will saturate it. Look how similar X1800s and X1900s are priced, yet X1900 has three times the shading power.

Mintmaster said:
1080p 128bit with 8xMSAA? WTF is that for?!? And we're still stuck at 2M tris in this whack scenario?

It's only whack because of the massive performance penalty that is associated with it because of bandwidth limitations. It's a IMR mindset.

Cheers

nAo · Jan 16, 2007

Gubbi said:
Every single card, from the lowest low to the highest high end is limited by bandwidth in all but corner cases.

I respectfully disagree, in my experience this is not true, not in the general case for sure.
There are rendering phases that are usually bw limited (think about drawing particles) but often you get limited by something else before hitting bw problems.

Blazkowicz · Jan 16, 2007

Joshua Luna said:
So a question: What does ATI/AMD plan to use for Fusion? System memory bandwidth doesn't seem very adequate and eDRAM is currently too small and tiling would not play nice on the PC.

This seems to be the reasons the TBDR are being used/licensed by Intel. If ATI stays the IMR course, how do the get around the fillrate issue?

(Yes, the initual Fusion processors will most likely be low end GPUs not intended for serious gameplay and such. But 6-12GB/s shared with a multicore CPU seems paltry).

definitely system memory in my opinion. that should be an AM3 chip, so on a AM3 motherboard with DDR3 it would get more like 16-20GB/s. and 8-12GB/s when plugged on an AM2 socket (if you consider most PC come with DDR2 533, I don't see much PC2-3200 around).
There would also be a high speed HT 3.0 interconnect between the CPU and GPU parts (or maybe a ring bus controller?) instead of the lowly 2GB/s (4GB/s?) full-duplex we have right now.

Mintmaster · Jan 19, 2007

Gubbi said:
Every single card, from the lowest low to the highest high end is limited by bandwidth in all but corner cases.

At ATI I did performance analysis (albeit a while ago). Not simple benching, but probing the data gathered by the chip about where stalls occur etc., just like the XB360 lets the devs see.

I guarantee you that you are very wrong in this statement. In fact, you can very easily test it out by down-clocking memory and/or overclocking the core of any chip. If you were right, you'd get 1:1 scaling with the former and no scaling with the latter.

Granted, doing things like halving the per-clock bandwidth of RSX compared to G71 will make a substantial difference in the percent of the workload that's BW limited, but it still won't be "limited by bandwidth in all but corner cases."

Gubbi said:
The number of pads on the die is what enables the wide busses. It does not drive cost down. Boutique memory on many-layer PCBs is what defines cost at the high end. And it is not getting any cheaper!.

That's exactly my point. There wasn't enough room on a GF2/3/4 for a 256-bit bus without increasing die size, but there was on later chips. High speed memory is expensive, but ATI and NVidia aren't putting it on their high-end cards to maximize performance/cost ratio at launch. It loses its "boutique" status quite quickly. Moreover, much of the board complexity and chip pin-count comes from power delivery. The incremental cost of going to 256-bit is not that big on the PC, especially for the high end.

A perfect example is how the 7600GT is notably faster than the 6800U with much lower bandwidth. I'm very sure that a 20-pipe 6800U with a 128-bit bus would faster than the existing 6800U. I doubt it would be cheaper, though. Eight 32MB chips would barely be more expensive than four 64MB chips, and an extra PCB layer is pocket change for a high end card.

Bandwidth is what defines price points on GPUs today. You build a memory system, and slap a chunk of silicon on it that will saturate it. Look how similar X1800s and X1900s are priced, yet X1900 has three times the shading power.

First of all, the X1900's have only 20% more silicon than the X1800's (despite 3x the arithmetic shading power), so your logic is severly flawed. Secondly, there are plenty of examples that prove you wrong. The Geforce FX5800 cost about the same as the FX5900. The 256-bit 6800GS cost the same as the 128-bit 7600GT for a while.

On consoles you're right because it hampers scaling down the road. Available memory, however, is a huge cost also. I think MS said going from 256MB to 512MB will cost them 1B, so you don't want to waste a huge chunk of that on a tile buffer.

It's only whack because of the massive performance penalty that is associated with it because of bandwidth limitations. It's a IMR mindset.

Cheers

No, it's whack because it's totally unnecessary. You can see the difference between 1M polys and 10M polys a LOT more than you can see the difference between 32-bpp and 128-bpp, or 8xMSAA and 4xMSAA. Your proposed framebuffer is 5x larger for next to no benefit simply to make a TBDR look better. It's 20x larger than most games today (2xAA, 720p), yet your poly count is the same.

Mat3 · Jan 20, 2007

I was looking around for other similar topics and found this old one.

http://www.beyond3d.com/forum/showthread.php?t=494

The main topic was just about only needing one buffer to handle both the current scene and the incoming one (instead of a buffer for each frame) but it's the bolded part below that I'm interested in.

Kristof said:
Anyway thats really irrelavant to this patent. The patent has to do with the memory storage of the geometry binning and how this memory is freed. Rather than work with one huge screen buffer that gets tiles, the screen first get tiled into macro (big) tiles which are then tiled into micro tiles (small). The trick is now that macro tiles memory can be de-allocated as soon as its rendered. This is better than the old system where you had to render the whole screen before the buffers could be freed. Its even possible to render a macro tile even though it does not yet contain all the data, this means that Z data needs to be stored for that subsection of the screen and read back on the second pass (when you add the rest of the geometry). All in all its just a way to store the tiled geometry efficiently. I guess this would be interesting to read for most here given the common claims that storage of the scene data is the problem with tilers. Also note the "compression" blocks in the patent.

So about rendering part of the frame (but not all of it) before all the polygons have been delivered (this would probably kick when it starts getting too full).. On one hand, all that polygon data could be cleared, and you'd need to store just the Z values for the rendered pixels, I guess, but how much would that reduce the advantage of deferred rendering? There's the possibility of having done work on overdraw, but what about the frame buffer bandwidth with FSAA? (I hope that makes sense.. if not, I'll just ask simply, is that a solution to the problem?)

Thanks

Mintmaster · Jan 20, 2007

Yup, I assumed that only one frame's geometry needs to be binned at a time in all my calculations.

I was going to bring up the point that you need multiple frames to be binned at a time for max efficiency, but then I realized that you could start filling the deallocated space as the renderer cleared space one tile at a time, much in the way that this patent describes. My idea was a bit different, simply using a queue of pointers to small blocks of memory. Pop one off when more bin space is needed for any tile, and when you've finished reading the deferred data contained in a block, push its pointer back on.

As for rendering a tile before it's completely binned, that would sort of defeat the purpose of a TBDR. A reasonable fallback, I suppose, but you're going to lose to an IMR when this happens, so I don't think Kristof is correct in saying it eliminates the storage issue.

Mat3 · Mar 4, 2008

Sorry to dig up an old thread, although this seemed like the best place to ask instead starting a new one.
Why couldn't a TBDR just calculate and store only vertex positions, then after the scene information is entirely collected, run the geometry through the vertex shaders again to calculate texture coordinates (so they wouldn't have to be stored), and then from there through the rest of the pipeline?
While I'm sure this would there's something not feasible about it, I don't quite have the necessary understanding of 3D graphics to know why.

Simon F · Mar 4, 2008

It's not impossible and has certainly been considered. . One problem is whether the vertex shaders supplied by the applications are easily separable (or not) into geometry and shading parts.

TapamN · Mar 4, 2008

Mat3 said:
Sorry to dig up an old thread, although this seemed like the best place to ask instead starting a new one.
Why couldn't a TBDR just calculate and store only vertex positions, then after the scene information is entirely collected, run the geometry through the vertex shaders again to calculate texture coordinates (so they wouldn't have to be stored), and then from there through the rest of the pipeline?
While I'm sure this would there's something not feasible about it, I don't quite have the necessary understanding of 3D graphics to know why.

Hehe. I like my Dreamcast a lot, and I have used far too much time idly thinking about things like this.

With this, you'd have to resend the information like texture data and normals in a second pass, since, unless you'd storing that in a texture or use some weird algorithmic method of calculating them in the shader, it won't really be able to easily figure out what they are. But it would probably work pretty well as is for deferred shading, if UV coords are stored with the position information.

But on a console, you could have unified memory and a store a pointer to the original model data with the GPU's vertex data. With flexible vert/geo shaders, that would make the "store position only" method a lot easier to work with.

Another simple space saving enhancement (compared to what the Dreamcast does) would be to use vertex, uh, grids to store vertex data, rather than strips. (i.e. on a vertex strip, the ideal polygon to vertex ratio is 1:1. If you make a grid of quads with vertex strips you end up duplicating about half the vertices. But with the right storage format, there's no need to keep the repeated vertices, cutting down the vertices needed to be stored by about half versus vertex strips.) A 4*4 grid of quads would take 40 vertices with vertex strips, but 25 vertices when not storing duplicates. But DX and OGL don't support this natively, so it'd be another console specific enhancement...

The best PC-method-of-rendering compatible space saving technique, I think, would be to try to identify repeated information or simple constants... (base and normal textures would likely have identical UV, and constants like 0.0 and 1.0 could be marked by a few bits.) I suppose it might be possible to get the vertex grid to work through a method like this for some games...

Why isn't TBDR used anymore ?

Panajev2001a

Mintmaster

Mintmaster

Panajev2001a

Jawed

blakjedi

Gubbi

nAo

Nutella Nutellae

Blazkowicz

Mintmaster

Mat3

Mintmaster

Mat3

Simon F

Tea maker

TapamN

Similar threads