Future solution to memory bandwidth

ddes said:
Didn't TBDR die long time ago? I thought TBDR would not scale well with increasing triangle counts.

No it didn't, it's in a niche market right now (at least PowerVR is IMO).
There's no issue with high triangle count either.
 
Ingenu said:
There's no issue with high triangle count either.
I don't buy that. The only question is how high triangle count do you need before it's really a problem? It's simple math, really: each vertex requires quite a bit more storage than one pixel in the z-buffer. So once you start approaching a number of vertices close to the number of pixels in the z-buffer, you start having worse bandwidth characteristics for a TBDR.

That's not to say there's nothing good about deferred rendering. I think it has great potential for more closed systems (like where it is now, with the MBX). But I don't think it has much potential, in the way TBDR is currently done, for the high-end PC segment.
 
okay, add a "yet" at the end of my sentence.
(Not really expert about this issue with TBDR, so can't give you any solution, but I bet PowerVR have quite a few ideas about it....)
 
Chalnoth said:
I don't buy that. The only question is how high triangle count do you need before it's really a problem? It's simple math, really: each vertex requires quite a bit more storage than one pixel in the z-buffer. So once you start approaching a number of vertices close to the number of pixels in the z-buffer, you start having worse bandwidth characteristics for a TBDR.
Before you reach that point you already have other, bigger problems (not TBDR related).

That's not to say there's nothing good about deferred rendering. I think it has great potential for more closed systems (like where it is now, with the MBX). But I don't think it has much potential, in the way TBDR is currently done, for the high-end PC segment.
Why do you think so, apart from "high triangle count" which really isn't a problem with today's games?
 
Xmas said:
Why do you think so, apart from "high triangle count" which really isn't a problem with today's games?
My main concern is with the scene buffer. You've got this buffer whose size is unknowable before the frame is binned. If you're running with anti-aliasing, an unexpected buffer overflow could be catastrophic for performance (sure, it'd only be for a frame or two, if it's handled well, but it still wouldn't be pleasant).

So, you have one of two situations: you're always running with a large amount of memory space reserved for the scene buffer, the majority of which is never used, or you run the risk of a scene buffer overflow causing a drastic performance drop.
 
Expanding the scene buffer dynamically as needed isn't THAT hard. If you have a virtual memory subsystem, then it's trivial, otherwise keeping track of multiple blocks of scene buffer data is fairly easy too. Running out of physical memory is a risk, but with the capability of AGP/PCIE to access system memory, the polygon count needed to cause serious trouble on a PC is astronomical.
 
Chalnoth said:
My main concern is with the scene buffer. You've got this buffer whose size is unknowable before the frame is binned. If you're running with anti-aliasing, an unexpected buffer overflow could be catastrophic for performance (sure, it'd only be for a frame or two, if it's handled well, but it still wouldn't be pleasant).

So, you have one of two situations: you're always running with a large amount of memory space reserved for the scene buffer, the majority of which is never used, or you run the risk of a scene buffer overflow causing a drastic performance drop.
Well, let's see...

Imagine you're playing at 1280x1024x32bit with 4xAA and double buffering. An IMR with downsample per frame will need 4*5 MiB for sample color data, 4*5 MiB for sample Z/stencil data, a downsample and a front buffer with 5 MiB each (well, it might be possible to downsample to the front buffer directly, but then you have to synchronize with scanout or you get tearing). A TBDR can do with two 5 MiB buffers.
40 MiB saved that could be used for the scene buffer instead. Even more for higher resolutions or HDR rendering.

And what happens usually when a graphics card runs out of texture memory? Right, it swaps textures to main memory and takes a noticeable, but not necessarily catastrophic performance hit. How do you solve that? Use lower-res textures or buy a card with more memory. Slower memory is cheaper, so a chip that needs less bandwidth can be equipped with more memory at the same price.
 
Xmas said:
Well, let's see...

Imagine you're playing at 1280x1024x32bit with 4xAA and double buffering. An IMR with downsample per frame will need 4*5 MiB for sample color data, 4*5 MiB for sample Z/stencil data, a downsample and a front buffer with 5 MiB each (well, it might be possible to downsample to the front buffer directly, but then you have to synchronize with scanout or you get tearing). A TBDR can do with two 5 MiB buffers.
40 MiB saved that could be used for the scene buffer instead. Even more for higher resolutions or HDR rendering.
If you scene buffer is static/non-expandable, then the standard ImgTec scene manager procedure is:
  • Render the entire frame to a temporary buffer with the set of polygons currently in the scene buffer
  • Clear the scene buffer, thus preparing it for additional polygons
  • Keep the temporary buffer around; when you do the render pass for the remaining polygons, you fetch the temporary buffer for each tile so that you are rendering on top of 'correct' data.
What's nasty here is that if you're doing AA, you CANNOT downsample that temporary buffer :!: so it will take up just as much space as the full IMR multisample buffer would.
 
I know that, but this is a big if in more than one way:
arjan de lumens said:
If you scene buffer is static/non-expandable, then the standard ImgTec scene manager procedure is:
 
Can someone explain in a bit more detail how TBDR's work? Obviously you'll need a scene buffer to be deferred, but beyond that I see two obvious ways of implementing it:

1) Process all geometry twice (or more). In the first pass, store a draw call index and primitive index for each triangle in each bin, and exit the vertex shader as soon as the output position is determined. You can then make a table of the renderstate for each draw call in the frame, and access the triangles needed during the second pass.

The big pro seems to be lower memory needed, but the con is possibly horrible memory efficiency in accessing the triangles during the second pass due to very random VB access patterns. Moreover you need to run each vertex shader twice or more.

2) Process the geometry once. Store the entire output of the vertex shader along with a renderstate index as above for each triangle in each bin.

While you avoid the need to reprocess the geometry, here I could see massive storage space needed. After a vertex shader you could easily need over 150 bytes per vertex including iterator storage (normal, tangent, light, halfway, and eye vectors, etc), and you have data multiplication for each edge-tile intersection.



Am I wrong in assuming it's one of these two? Also, for both schemes, isn't renderstate change a big issue? I guess if your tile is big enough (say 100k pixels) then it won't be bad, but in that case your cache eats a lot of die space.

Anyway, I don't think TBDR's have much advantage over traditional renderers. These are the only advantages I can think of:
- They save overdraw on opaque objects, but do either a z-pass or rough sorting and that advantage is nearly gone. The 16-pipe Radeons cull 256 hidden pixels per clock.
- They save all Z-buffer bandwidth, which can't be too much with compression. Min-max hierarchical Z saves you a read as well. I wouldn't be surprised if a TBDR didn't store Z on chip for this reason.
- MSAA costs no bandwidth, but ATI's newest processors make that advantage moot.
- Alpha blending costs zero bandwidth

Yes, the last point is important, but 3DMark shows we already get 5 GPixels/s fillrate anyway under those circumstances. That's 2500fps at 1600x1200. I don't think bandwidth is the problem that it used to be for ordinary 3D rendering.

I think we're more starved for bandwidth during post-processing (esp. HDR), because we're accessing an uncompressed rendered texture many times per pixel. TBDR's don't help us there at all.

I just don't see these relatively minor advantages outweighing the cons, especially in the PC market.
 
Sorry, can't go into too much detail, but TBDR as a concept does not have a single, "best" implementation. There are lots of different ways to do things, just like with IMRs. And it's not static, there's a lot of research going on (ask Simon ;)).

Mintmaster said:
After a vertex shader you could easily need over 150 bytes per vertex including iterator storage (normal, tangent, light, halfway, and eye vectors, etc), and you have data multiplication for each edge-tile intersection.
Why?
 
Mintmaster said:
Can someone explain in a bit more detail how TBDR's work? Obviously you'll need a scene buffer to be deferred, but beyond that I see two obvious ways of implementing it:

1) Process all geometry twice (or more). In the first pass, store a draw call index and primitive index for each triangle in each bin, and exit the vertex shader as soon as the output position is determined. You can then make a table of the renderstate for each draw call in the frame, and access the triangles needed during the second pass.

The big pro seems to be lower memory needed, but the con is possibly horrible memory efficiency in accessing the triangles during the second pass due to very random VB access patterns. Moreover you need to run each vertex shader twice or more.
Actually, with this method, vertices that belong to backface-culled or frustum-culled polygons are never actually run through the 2nd vertex shading pass, so the total vertex shading load should be much less than that of running the entire shader twice on everything. Scattered memory accesses shouldn't be too big a problem; with an R520-style memory controller, you only run into efficiency problems when the data you need to fetch are much smaller than 16 bytes; a more serious problem is that the vertices for a tile are likely to have different vertex shader render states.
2) Process the geometry once. Store the entire output of the vertex shader along with a renderstate index as above for each triangle in each bin.

While you avoid the need to reprocess the geometry, here I could see massive storage space needed. After a vertex shader you could easily need over 150 bytes per vertex including iterator storage (normal, tangent, light, halfway, and eye vectors, etc), and you have data multiplication for each edge-tile intersection.
150 bytes corresponds to 37 iterators of FP32 precision - which is around the upper limit of what OpenGL2/DirectX9 are actually able to support and much more than I would expect a developer to actually use most of the time. The data multiplication you speak of may happen on pointers to iterator data, but if it happens on iterator data themselves, then the design is brain-damaged.
Am I wrong in assuming it's one of these two? Also, for both schemes, isn't renderstate change a big issue? I guess if your tile is big enough (say 100k pixels) then it won't be bad, but in that case your cache eats a lot of die space.
Those are the main choices you have, that is correct.

Typical tile size for TBDRs are AFAIK 16x16 to 32x32 pixels. Not one TBDR that I have ever worked on (Falanx) or heard about (PowerVR, GigaPixel) has actually had any problems with render-state-change performance, even though these architectures presumably have render state change rates ~100X what an IMR ever sees.
Anyway, I don't think TBDR's have much advantage over traditional renderers. These are the only advantages I can think of:
- They save overdraw on opaque objects, but do either a z-pass or rough sorting and that advantage is nearly gone. The 16-pipe Radeons cull 256 hidden pixels per clock.
The Radeon is generally much more dependent than TBDRs on large objects to achieve that kind of cull efficiency. If the cull mechanism is hierarchical, then pumping out numbers like 256 pixels per clock is easy - IIRC, some 3dlabs cards can do 1024 while still having shitty real-world gaming performance.
- They save all Z-buffer bandwidth, which can't be too much with compression. Min-max hierarchical Z saves you a read as well. I wouldn't be surprised if a TBDR didn't store Z on chip for this reason.
- MSAA costs no bandwidth, but ATI's newest processors make that advantage moot.
Z-buffer data get much less compressible as geometry gets more complex or if you have disturbances such as Alpha-test, and if you do non-ordered-grid multisampling with Z computed per sample (as opposed to per pixel), that too considerably increases the complexity of compressing the Z data. Block-based Z compression also has a tendency to produce read-modify-write passes for pixels that would otherwise not be read. (Something similar applies to block-based color compression as well).

Also worth mentioning here is that with AA, the external memory footprint of a TBDR's framebuffers doesn't increase with increasing levels of AA (except for the scene manager issue above, which is avoidable.)
- Alpha blending costs zero bandwidth

Yes, the last point is important, but 3DMark shows we already get 5 GPixels/s fillrate anyway under those circumstances. That's 2500fps at 1600x1200. I don't think bandwidth is the problem that it used to be for ordinary 3D rendering.

I think we're more starved for bandwidth during post-processing (esp. HDR), because we're accessing an uncompressed rendered texture many times per pixel. TBDR's don't help us there at all.

I just don't see these relatively minor advantages outweighing the cons, especially in the PC market.
The main cons I see are problems with framebuffer locking/readback, occlusion queries and an inherently more complex data flow - any others you have in mind?
 
Last edited by a moderator:
What about using embedded N-RAM, that's likely to be released in a couple of years (next year according to Nantero) and is as fast as S-RAM? That would sure solve a lot of bottlenecks... :)
 
Alejux said:
What about using embedded N-RAM, that's likely to be released in a couple of years (next year according to Nantero) and is as fast as S-RAM? That would sure solve a lot of bottlenecks... :)
I'll put that one squarely in the category of "I'll believe it when I see it", along with MRAM, holograhic storage, optical circuits, quantum computers, superconductor chips, DNA computers, etc (all of these seem to exist in the form of functional prototypes, but none of them are ready for mass-market introduction at this point).

With N-ram, it would seem to me that making the actual RAM cell small isn't the problem, rather it's an issue of getting the interconnect layers small enough to be able to take advantage of the small RAM cell. If this issue isn't solved, N-ram isn't lilely to get much smaller than DRAM (which AFAIK suffers the same problem)
 
arjan de lumens said:
I'll put that one squarely in the category of "I'll believe it when I see it", along with MRAM, holograhic storage, optical circuits, quantum computers, superconductor chips, DNA computers, etc (all of these seem to exist in the form of functional prototypes, but none of them are ready for mass-market introduction at this point).

With N-ram, it would seem to me that making the actual RAM cell small isn't the problem, rather it's an issue of getting the interconnect layers small enough to be able to take advantage of the small RAM cell. If this issue isn't solved, N-ram isn't lilely to get much smaller than DRAM (which AFAIK suffers the same problem)


Well, unless these statements from the Nantero people are lies just to attract investors I think they are well past the lab-semi-working prototype phase. If what they're projecting is true, it wouldn't be so far out to imagine this kind of technology being used 5 years from now on things like GPU's.


http://www.tgdaily.com/2006/02/03/nantero_cnt_memory/


I agree with you, that all this is still in the see-it-to-believe phase, but still, some revolutionary technologies will have to come up soon if we expect to see the same level of progress that's been happening in the past happen in the years to come.
 
Last edited by a moderator:
Alejux said:
If those claims are true, then N-RAM should become a highly attractive replacement for SRAM and NOR-Flash in the near future, but given that they don't seem to have developed new interconnect technology to go along with the new memory cell technology, better-than-DRAM densities sounds like a pipe dream at this point. They do appear to have low power consumption, though, which should make them much friendlier to stacked/3d chips than current DRAMs - so 'as-good-as-DRAM' density may well be enough for them to achieve market dominance. As for low latency operation, that should help CPUs tremendously, but much less so for GPUs.
 
Quick question: How much memory bandwidth would it save if we didn't have to refresh the DRAM? Or does GDDR do that internally?
 
Thanks, arjan. You make some good points.

Regarding the first method, it's impossible to see less vertex load than on an IMR. Compilers for current IMRs already exit early for culled primitives. You're also assuming the position shader can be separated from that calculating the iterators, which is not always the case. But it's probably less frequent than I first thought. This definately makes TBDR's seem more practical than storing all the post-VS data.

For the iterator storage in the second method, remember that plain old bump mapping requires three vec3's for normal/tangent, one for light, one for eye, and a vec2 for texcoords. That's the bare minimum, and once you throw in multiple lights, halfway vectors for specular lighting, attenuation factors, and more textures like lightmaps and shadowmaps, then the DX9 limit is not enough. If you remember the PS3.0 patch that FarCry had, one of the big reasons for it was the 10 iterators that helped to avoid multipassing (though that's a pretty lame reason for calling it a PS3.0 shader). Even though PS2.0 already had 8, they could use those extra 2. So even without going into extravagant algorithms, developers will easily need >100 bytes per vertex between the VS and PS.

100x the renderstate changes isn't an issue?!? Wow. I guess that con gets scratched off the list. I doubt 32x32 would be enough for a modern tile-based accelerator, but I'm sure there's room in a 300M transistor chip for a 256x256 tile.

Regarding compression, you're going to need block-based access to keep your memory efficiency high anyway, so it's not really a con for compression. That being said, a TBDR avoids this problem altogether. About early Z culling efficiency, the point was that even with a quarter of that rate or worse, overdraw is not an issue for opaque objects.

Okay, I guess you've convinced me that TBDR is more feasible than I originally thought. If you have a fast enough vertex shader, then the first method should minimize the memory costs, which were my biggest concern. I don't think the cons you mentioned are that big. At worst you could resolve the final framebuffer early for lock/readback, and occlusion queries could be solved by keeping the Z-buffer in external memory and/or using a Hi-Z buffer (say, 1/16th resolution) like ATI.

So what exactly is keeping ImgTec from releasing a high end TBDR? Resources? Risk? Graphics enthusiasts are a well informed bunch, and if you put out a superior product for a lower cost, then it will sell.

EDIT: I meant culled vertices, not objects.
 
Last edited by a moderator:
Mintmaster said:
Compilers for current IMRs already exit early for culled objects.
How can you early-exit the Vertex Shader for culled primitives when you don't store the primitives?
 
Back
Top