ddes said:Didn't TBDR die long time ago? I thought TBDR would not scale well with increasing triangle counts.
No it didn't, it's in a niche market right now (at least PowerVR is IMO).
There's no issue with high triangle count either.
ddes said:Didn't TBDR die long time ago? I thought TBDR would not scale well with increasing triangle counts.
ddes said:Didn't TBDR die long time ago? I thought TBDR would not scale well with increasing triangle counts.
I don't buy that. The only question is how high triangle count do you need before it's really a problem? It's simple math, really: each vertex requires quite a bit more storage than one pixel in the z-buffer. So once you start approaching a number of vertices close to the number of pixels in the z-buffer, you start having worse bandwidth characteristics for a TBDR.Ingenu said:There's no issue with high triangle count either.
Before you reach that point you already have other, bigger problems (not TBDR related).Chalnoth said:I don't buy that. The only question is how high triangle count do you need before it's really a problem? It's simple math, really: each vertex requires quite a bit more storage than one pixel in the z-buffer. So once you start approaching a number of vertices close to the number of pixels in the z-buffer, you start having worse bandwidth characteristics for a TBDR.
Why do you think so, apart from "high triangle count" which really isn't a problem with today's games?That's not to say there's nothing good about deferred rendering. I think it has great potential for more closed systems (like where it is now, with the MBX). But I don't think it has much potential, in the way TBDR is currently done, for the high-end PC segment.
My main concern is with the scene buffer. You've got this buffer whose size is unknowable before the frame is binned. If you're running with anti-aliasing, an unexpected buffer overflow could be catastrophic for performance (sure, it'd only be for a frame or two, if it's handled well, but it still wouldn't be pleasant).Xmas said:Why do you think so, apart from "high triangle count" which really isn't a problem with today's games?
Well, let's see...Chalnoth said:My main concern is with the scene buffer. You've got this buffer whose size is unknowable before the frame is binned. If you're running with anti-aliasing, an unexpected buffer overflow could be catastrophic for performance (sure, it'd only be for a frame or two, if it's handled well, but it still wouldn't be pleasant).
So, you have one of two situations: you're always running with a large amount of memory space reserved for the scene buffer, the majority of which is never used, or you run the risk of a scene buffer overflow causing a drastic performance drop.
If you scene buffer is static/non-expandable, then the standard ImgTec scene manager procedure is:Xmas said:Well, let's see...
Imagine you're playing at 1280x1024x32bit with 4xAA and double buffering. An IMR with downsample per frame will need 4*5 MiB for sample color data, 4*5 MiB for sample Z/stencil data, a downsample and a front buffer with 5 MiB each (well, it might be possible to downsample to the front buffer directly, but then you have to synchronize with scanout or you get tearing). A TBDR can do with two 5 MiB buffers.
40 MiB saved that could be used for the scene buffer instead. Even more for higher resolutions or HDR rendering.
arjan de lumens said:If you scene buffer is static/non-expandable, then the standard ImgTec scene manager procedure is:
Why?Mintmaster said:After a vertex shader you could easily need over 150 bytes per vertex including iterator storage (normal, tangent, light, halfway, and eye vectors, etc), and you have data multiplication for each edge-tile intersection.
Actually, with this method, vertices that belong to backface-culled or frustum-culled polygons are never actually run through the 2nd vertex shading pass, so the total vertex shading load should be much less than that of running the entire shader twice on everything. Scattered memory accesses shouldn't be too big a problem; with an R520-style memory controller, you only run into efficiency problems when the data you need to fetch are much smaller than 16 bytes; a more serious problem is that the vertices for a tile are likely to have different vertex shader render states.Mintmaster said:Can someone explain in a bit more detail how TBDR's work? Obviously you'll need a scene buffer to be deferred, but beyond that I see two obvious ways of implementing it:
1) Process all geometry twice (or more). In the first pass, store a draw call index and primitive index for each triangle in each bin, and exit the vertex shader as soon as the output position is determined. You can then make a table of the renderstate for each draw call in the frame, and access the triangles needed during the second pass.
The big pro seems to be lower memory needed, but the con is possibly horrible memory efficiency in accessing the triangles during the second pass due to very random VB access patterns. Moreover you need to run each vertex shader twice or more.
150 bytes corresponds to 37 iterators of FP32 precision - which is around the upper limit of what OpenGL2/DirectX9 are actually able to support and much more than I would expect a developer to actually use most of the time. The data multiplication you speak of may happen on pointers to iterator data, but if it happens on iterator data themselves, then the design is brain-damaged.2) Process the geometry once. Store the entire output of the vertex shader along with a renderstate index as above for each triangle in each bin.
While you avoid the need to reprocess the geometry, here I could see massive storage space needed. After a vertex shader you could easily need over 150 bytes per vertex including iterator storage (normal, tangent, light, halfway, and eye vectors, etc), and you have data multiplication for each edge-tile intersection.
Those are the main choices you have, that is correct.Am I wrong in assuming it's one of these two? Also, for both schemes, isn't renderstate change a big issue? I guess if your tile is big enough (say 100k pixels) then it won't be bad, but in that case your cache eats a lot of die space.
The Radeon is generally much more dependent than TBDRs on large objects to achieve that kind of cull efficiency. If the cull mechanism is hierarchical, then pumping out numbers like 256 pixels per clock is easy - IIRC, some 3dlabs cards can do 1024 while still having shitty real-world gaming performance.Anyway, I don't think TBDR's have much advantage over traditional renderers. These are the only advantages I can think of:
- They save overdraw on opaque objects, but do either a z-pass or rough sorting and that advantage is nearly gone. The 16-pipe Radeons cull 256 hidden pixels per clock.
Z-buffer data get much less compressible as geometry gets more complex or if you have disturbances such as Alpha-test, and if you do non-ordered-grid multisampling with Z computed per sample (as opposed to per pixel), that too considerably increases the complexity of compressing the Z data. Block-based Z compression also has a tendency to produce read-modify-write passes for pixels that would otherwise not be read. (Something similar applies to block-based color compression as well).- They save all Z-buffer bandwidth, which can't be too much with compression. Min-max hierarchical Z saves you a read as well. I wouldn't be surprised if a TBDR didn't store Z on chip for this reason.
- MSAA costs no bandwidth, but ATI's newest processors make that advantage moot.
The main cons I see are problems with framebuffer locking/readback, occlusion queries and an inherently more complex data flow - any others you have in mind?- Alpha blending costs zero bandwidth
Yes, the last point is important, but 3DMark shows we already get 5 GPixels/s fillrate anyway under those circumstances. That's 2500fps at 1600x1200. I don't think bandwidth is the problem that it used to be for ordinary 3D rendering.
I think we're more starved for bandwidth during post-processing (esp. HDR), because we're accessing an uncompressed rendered texture many times per pixel. TBDR's don't help us there at all.
I just don't see these relatively minor advantages outweighing the cons, especially in the PC market.
I'll put that one squarely in the category of "I'll believe it when I see it", along with MRAM, holograhic storage, optical circuits, quantum computers, superconductor chips, DNA computers, etc (all of these seem to exist in the form of functional prototypes, but none of them are ready for mass-market introduction at this point).Alejux said:What about using embedded N-RAM, that's likely to be released in a couple of years (next year according to Nantero) and is as fast as S-RAM? That would sure solve a lot of bottlenecks...
arjan de lumens said:I'll put that one squarely in the category of "I'll believe it when I see it", along with MRAM, holograhic storage, optical circuits, quantum computers, superconductor chips, DNA computers, etc (all of these seem to exist in the form of functional prototypes, but none of them are ready for mass-market introduction at this point).
With N-ram, it would seem to me that making the actual RAM cell small isn't the problem, rather it's an issue of getting the interconnect layers small enough to be able to take advantage of the small RAM cell. If this issue isn't solved, N-ram isn't lilely to get much smaller than DRAM (which AFAIK suffers the same problem)
If those claims are true, then N-RAM should become a highly attractive replacement for SRAM and NOR-Flash in the near future, but given that they don't seem to have developed new interconnect technology to go along with the new memory cell technology, better-than-DRAM densities sounds like a pipe dream at this point. They do appear to have low power consumption, though, which should make them much friendlier to stacked/3d chips than current DRAMs - so 'as-good-as-DRAM' density may well be enough for them to achieve market dominance. As for low latency operation, that should help CPUs tremendously, but much less so for GPUs.Alejux said:
How can you early-exit the Vertex Shader for culled primitives when you don't store the primitives?Mintmaster said:Compilers for current IMRs already exit early for culled objects.