Future solution to memory bandwidth

KimB · Feb 7, 2006

arjan de lumens said:
My point of the evil-pixel example was that compression would seemingly force me to do a full block read of the color buffer whereas without compression I would not need to do any color reads at all, only a write. While I have been told a couple of times that this is not the case, I have to ask: how do you lay out the compressed data in memory so that I can safely write the evil-pixel into it without needing to examine/decompress the compressed data?

Well, if you store a bit mask elesewhere in memory, you could query that to ask whether or not a particular pixel is compressed. Then you only need to do a read of the pixel block if the pixel is compressed, and you don't want to write to all the subsamples.

Xmas · Feb 7, 2006

arjan de lumens said:
Huh? How can the block be layed out in memory so that when I write that pixel (with NO knowledge of the prior contents of the block; I want to avoid reading it), it is guaranteed not to damage the rest of the block?

There's two ways you can store whether a block is compressed or not: Either store a flag at the start of the block, fetch the first bytes, check the flag, and read more data if required. This has obvious drawbacks regarding latency, and you have to steal a bit somewhere to store that flag. Or you store the compression flag on chip, in a large table. Which obviously requires more transistors, but depending on block size and max resolution, it's not that bad (24 KiB for 2048x1536 with 4x4 blocks). I am pretty sure that at least some chips use the latter method.

And with simple on/off compression, I meant that either you store one color for every pixel in the block (compressed block), or one color per sample (uncompressed block). In combination with the compression flag, you then always know where the color value(s) for a pixel are stored.

Mintmaster said:
I think you're mixing some things up. The two pass method, and arjan's proposed weeding out of culled triangles during the second pass, only applies for the first method (as per my post above). That method doesn't store anything, but I guess you've just created a new proposal that is sort of a hybrid.

I'm not sure if I understand you correctly. You're referring to this, right?

Mintmaster said:
1) Process all geometry twice (or more). In the first pass, store a draw call index and primitive index for each triangle in each bin, and exit the vertex shader as soon as the output position is determined. You can then make a table of the renderstate for each draw call in the frame, and access the triangles needed during the second pass.

I don't see the purpose of running the vertex shader until the output position is determined if you're not going to store that position. If you don't store position, then there's no point in calculating it, because, as pointed out before, the position of a single vertex does not tell you anything at all about culled primitives. And if you're going to store position calculated in a first pass, you might as well store all live temp registers along.

Mintmaster said:
I always figured it would be just too hard to get hardware vertex processing into the architecture, given its conspicuous absence in the Kyro at a time when that was all the rage. Discussing it with you, though, makes me less sure.

There's nothing that precludes hardware vertex processing in a TBDR architecture. MBX is available in combination with VGP, which offers about VS2.0 level functionality, IIRC.

psurge · Feb 7, 2006

Xmas - (regarding compression without read-modify-write) how does your scheme eliminate read-modify-write when the write destination is a compressed block? The only way I can think of to make this work is if a compressed block is one in which each pixel has identical color for each of its subsamples. If that's the scheme in use, I would expect blocks to almost never compress unless the scene being rendered really contains extremely simple geometry.

Xmas · Feb 7, 2006

psurge said:
Xmas - (regarding compression without read-modify-write) how does your scheme eliminate read-modify-write when the write destination is a compressed block? The only way I can think of to make this work is if a compressed block is one in which each pixel has identical color for each of its subsamples. If that's the scheme in use, I would expect blocks to almost never compress unless the scene being rendered really contains extremely simple geometry.

Yes, as I said, a compressed block is one that stores one color per pixel, while an uncompressed block stores one color per sample. The efficiency of such a scheme depends on the size of the blocks, resolution, and the geometry used. For current games in 1280x1024 and above, I would expect >80% of reads/writes to 4x4 blocks to be compressed.

You can approximate the number of compressed and uncompressed blocks if you take two wireframe renders of a typical game scene. One that is backface-culled, and another that is backface- and Z-culled (meaning you effectively fill the triangles black or do an Z-first pass). Then you scale both images down by the block size. The number of resulting black pixels tells you how many blocks could be compressed, and the first picture gives you a theoretical lower bound, the second one a practical upper one.

Mintmaster · Feb 8, 2006

Xmas said:
I don't see the purpose of running the vertex shader until the output position is determined if you're not going to store that position. If you don't store position, then there's no point in calculating it, because, as pointed out before, the position of a single vertex does not tell you anything at all about culled primitives. And if you're going to store position calculated in a first pass, you might as well store all live temp registers along.

The goal of that method was to minimize storage space during binning. This was my biggest concern, because if you use the second method, which is how the Kyro worked AFAIK, you could need hundreds of megs to bin the scene. You calculate position to figure out which bins it's part of, and you only need 2 indices to later figure out how to render a polygon in a bin: one to select a renderstate from a table, and one to select a primitive. I think you could get away with 2+4=6 bytes for this, so even a 10 million polygon scene (very possible in a year) would "only" need 60MB. I'm not sure how much accuracy you need for position after transformation, but lets say 8 bytes can store x, y, and z, with the latter being higher precision. You also need 3 pointers to the location of the transformed vertices for each primitive. Now we're over 20 bytes per primitive, which IMHO is well out of the comfort zone. As I described earlier, it's sort of lies between the two schemes I envisioned.

There's nothing that precludes hardware vertex processing in a TBDR architecture. MBX is available in combination with VGP, which offers about VS2.0 level functionality, IIRC.

I know that now, but back when the Kyro line just vanished it seemed like a good possibility to me. MBX is also in a different environment where there isn't an established way of doing things like in the PC space. Limitations and corner cases of TBDR with T&L, as we're discussing in this thread, are less meaningful for the handheld market. Arcade machines are similar, and even a semi-TBDR like Xenos only seems possible for a console.

psurge · Feb 8, 2006

Xmas,
Thanks for the reply - I had missed the part where you explained the compression scheme earlier. I must admit that I'm surprised such a simple scheme is so effective. Is this what current HW actually implements?

Xmas · Feb 8, 2006

Mintmaster said:
The goal of that method was to minimize storage space during binning. This was my biggest concern, because if you use the second method, which is how the Kyro worked AFAIK, you could need hundreds of megs to bin the scene. You calculate position to figure out which bins it's part of, and you only need 2 indices to later figure out how to render a polygon in a bin: one to select a renderstate from a table, and one to select a primitive. I think you could get away with 2+4=6 bytes for this, so even a 10 million polygon scene (very possible in a year) would "only" need 60MB. I'm not sure how much accuracy you need for position after transformation, but lets say 8 bytes can store x, y, and z, with the latter being higher precision. You also need 3 pointers to the location of the transformed vertices for each primitive. Now we're over 20 bytes per primitive, which IMHO is well out of the comfort zone. As I described earlier, it's sort of lies between the two schemes I envisioned.

First of all, I don't think you will see a 10 million polygon (more precisely, triangle) scene in any game a year from now, though maybe in other applications (even today).

Since the binning stage needs to have the coordinates for the triangle, it is also the obvious stage to perform culling of backfacing polygons, degenerate and pseudo-degenerate triangles (those that become degenerate when rounding to subpixel precision). The latter are more likely the more complex the scene becomes (especially when you have more polygons than pixels/AA samples). Typically, this will discard more than half of the triangles. Exceptions are when you need the backfaces, e.g. shadow volumes.

And you might want to think about this: you usually have more triangles than vertices (1.5 times for a cube, ~2 times for a rectangular grid), and triangles are rarely detached.

psurge said:
Xmas,
Thanks for the reply - I had missed the part where you explained the compression scheme earlier. I must admit that I'm surprised such a simple scheme is so effective. Is this what current HW actually implements?

I don't know for sure, but I think it's pretty likely. The logic involved is very simple, you only need a few KiB of on-chip storage for the compression flags.

However, there's one thing (and maybe I misunderstood arjan there and that's what he meant): when the pixel block you're writing contains semi-covered pixels, and the block already in memory is compressed, you really need to read that block first, uncompress it, replace the newly rendered samples, and store the uncompressed block in memory.

With very small polygons, you better disable any block based framebuffer compression.

Mintmaster · Feb 8, 2006

Xmas, I'm fully aware of all those factors.

I know the binning stage is where the culling occurs. (Duh! That's what we've been talking about for a page or two!) I'm skeptical about culling more than half in a good engine, because any alpha blended/tested objects (e.g. bushes, trees, grass) don't get culled, nor are there many backfaces in the background/terrain/room/whatever. I've seen 10:1 ratios for clipped to visible tris in a 3DMark test, but that was poorly/lazily written. Pseudo-degenerate tris are not common enough to make a big impact, especially with AA enabled, nor do I see an easy and robust way to detect these without actually rasterizing them, which is a waste of time during the binning stage.

I know about vertices per tri, and did consider it. I said "over 20 bytes" for a reason: 3 pointers, renderstate index, primitive index (for the rest of the VS), and vertex data divided by two = 3*4+2+4+8/2 = 22. If you're fed an indexed draw primitive (AFAIK the most popular type), then I don't see how you can get away with less than 3 pointers per triangle. Unless, I suppose, you do some sort of dynamic strip assembly, and even then the chopping effect of tiles will minimize your savings. 10 million was purposely a slightly extreme example, simply because 1 million has been a piece of cake since R300. I don't think an engineer should design for any less, if not more.

Look, arjan is clearly very knowledgeable, and said this is one of the main choices. There's merit in this method because the memory needed for the scene buffer and binning are not trivial. Even Xenos sort of works this way, though I admit it's a bit of a special case.

Xmas · Feb 8, 2006

Mintmaster said:
Xmas, I'm fully aware of all those factors.

Yet I seem to see much more optimization potential there.

I know the binning stage is where the culling occurs. (Duh! That's what we've been talking about for a page or two!) I'm skeptical about culling more than half in a good engine, because any alpha blended/tested objects (e.g. bushes, trees, grass) don't get culled, nor are there many backfaces in the background/terrain/room/whatever. I've seen 10:1 ratios for clipped to visible tris in a 3DMark test, but that was poorly/lazily written.

A detailed character model can easily have a polygon count similar to the environment. You're right about the alpha stuff, though. Backface culling is not going to remove half of the triangles, but maybe a third.

Pseudo-degenerate tris are not common enough to make a big impact, especially with AA enabled, nor do I see an easy and robust way to detect these without actually rasterizing them, which is a waste of time during the binning stage.

Pseudo-degenerate triangles can become quite common in a scene with 10 million triangles. They're actually quite easy to detect, either compare the screen space vertex positions to each other and kill the triangle if two are identical (this won't catch colinear vertices, except maybe if they're horizontal or vertical), or do it during backface culling using the (missing) winding information.

I know about vertices per tri, and did consider it. I said "over 20 bytes" for a reason: 3 pointers, renderstate index, primitive index (for the rest of the VS), and vertex data divided by two = 3*4+2+4+8/2 = 22. If you're fed an indexed draw primitive (AFAIK the most popular type), then I don't see how you can get away with less than 3 pointers per triangle. Unless, I suppose, you do some sort of dynamic strip assembly, and even then the chopping effect of tiles will minimize your savings. 10 million was purposely a slightly extreme example, simply because 1 million has been a piece of cake since R300. I don't think an engineer should design for any less, if not more.

An application can very well render an indexed triangle strip or fan when it makes sense (which it does most of the time), no need for the hardware to assemble them dynamically from an indexed triangle list. No need for 32bit pointers as well (though it might be better for alignment reasons). And if you want to save even more space, what else is an index buffer if not a list of pointers to vertices?

Look, arjan is clearly very knowledgeable, and said this is one of the main choices. There's merit in this method because the memory needed for the scene buffer and binning are not trivial. Even Xenos sort of works this way, though I admit it's a bit of a special case.

I'm not saying that this is necessarily a bad method of doing things. I just think there is room for optimizations.

Mintmaster · Feb 8, 2006

Okay, regarding culling, all I was trying to say is that it doesn't change the calculations a lot. You're not going to reduce the binned triangles by an order of magnitude. 6xAA at 1600x1200 is 12 million samples on the screen, so easily detectable pseudo degenerate tris will not be an appreciable percentage (I assume you'll have enough precision in screen space to take advantage of the AA, thus reducing the chance of identical coordinates). Of course you can put these optimizations in (even with an IMR), but the memory savings are low.

Triangle strips and fans are very restricted in their patterns. Fine for a simple perturbed grid terrain, for example, but for arbitrary geometry you either have to split up the object into multiple draw calls, which is very inefficient, or submit degenerate tris in some funky way. I doubt anyone does that today because regular indexed form is how the modelling programs export objects, and current accelerators give no reason for converting to strips. The patterns will also be disturbed by the tiles, further reducing the saving of the few strip/fan primitives that will be seen. I'm pretty sure the vast majority of vertices in most games are regular indexed triangle lists, so the point is that you have to design for this case.

I already thought about using the existing index buffer, but the problem is that you want to point to the newly generated optimized list of transformed vertices. Your original indices will still be needed to access non-position streams to complete the vertex shader in the second pass (and that's why a primitive index is stored), but they're no good for accessing the transformed vertices. There could be big gaps in the original index numbers from an app using an IHV-encouraged "mega VB" to store every character. I'll agree that 24-bit indices would work 99.9% of the time, and 16-bit might work given that the renderstate index could narrow down the location of transformed vertices.

Only after a thinking very thoroughly about the details does it seem like storing the post-transformed vertices (without iterator data) is feasible, and even then the memory requirements can be very hefty. With 512MB becoming the norm, though, it could work.

RoOoBo · Feb 8, 2006

Now that I read about all this stuff I think it would be trivial with the simulator to count how much per frame data is required to store the whole scene, even taking into account backface culling and degenerates for some 'modern' games (Doom3 or Quake4). I think I remember that the Quake4 trdemo (I don't remember the number) had peaks of around 700K vertices per frames while Doom3 was way lower. In any case I don't have the time right now ...

On the same topic about the problems of a TBDR do you think that having simultaneous rendering processes would hurt them too much compared with IMRs?

Xmas · Feb 8, 2006

Mintmaster said:
Triangle strips and fans are very restricted in their patterns. Fine for a simple perturbed grid terrain, for example, but for arbitrary geometry you either have to split up the object into multiple draw calls, which is very inefficient, or submit degenerate tris in some funky way. I doubt anyone does that today because regular indexed form is how the modelling programs export objects, and current accelerators give no reason for converting to strips. The patterns will also be disturbed by the tiles, further reducing the saving of the few strip/fan primitives that will be seen. I'm pretty sure the vast majority of vertices in most games are regular indexed triangle lists, so the point is that you have to design for this case.

I'm pretty sure the majority are triangle strips. They usually need less indices, which alone is enough reason to use them. And there's nothing funky about degenerate triangles. But this point is moot if you can reuse the index buffers.

I already thought about using the existing index buffer, but the problem is that you want to point to the newly generated optimized list of transformed vertices. Your original indices will still be needed to access non-position streams to complete the vertex shader in the second pass (and that's why a primitive index is stored), but they're no good for accessing the transformed vertices. There could be big gaps in the original index numbers from an app using an IHV-encouraged "mega VB" to store every character.

What do you consider an optimized list of transformed vertices?

An interesting detail from the documentation of D3D9 DrawIndexedPrimitive:

The MinIndex and NumVertices parameters specify the range of vertex indices used for each DrawIndexedPrimitive call. These are used to optimize vertex processing of indexed primitives by processing a sequential range of vertices prior to indexing into these vertices. It is invalid for any indices used during this call to reference any vertices outside of this range.

The required data to bridge any gaps – should they occur – is there.

Only after a thinking very thoroughly about the details does it seem like storing the post-transformed vertices (without iterator data) is feasible, and even then the memory requirements can be very hefty. With 512MB becoming the norm, though, it could work.

That is another point. If you have a scene with 5 million different vertices, your vertex buffers are likely eating up hundreds of MiB. In this light 40MB for the transformed positions of all vertices (8 byte per vertex) doesn't sound that bad any more.

RoOoBo · Feb 8, 2006

Xmas said:
I'm pretty sure the majority are triangle strips. They usually need less indices, which alone is enough reason to use them. And there's nothing funky about degenerate triangles. But this point is moot if you can reuse the index buffers.

That is what I would have also expected but for whatever reasons Doom3 and Quake4 use mostly indexed triangle lists. They may be a special case though.

3dcgi · Feb 9, 2006

Mintmaster said:
Triangle strips and fans are very restricted in their patterns. Fine for a simple perturbed grid terrain, for example, but for arbitrary geometry you either have to split up the object into multiple draw calls, which is very inefficient, or submit degenerate tris in some funky way.

Why wouldn't you just submit multiple strips in a single draw call? DX allows that.

Mintmaster · Feb 9, 2006

3dcgi said:
Why wouldn't you just submit multiple strips in a single draw call? DX allows that.

Xmas said:
I'm pretty sure the majority are triangle strips. They usually need less indices, which alone is enough reason to use them. And there's nothing funky about degenerate triangles. But this point is moot if you can reuse the index buffers.

I never said there's anything funky about degenerate triangles. I was referring to how one generates these strips from lists, i.e. how to walk across your mesh with one strip. Looking in the DXSDK, though, I see that Microsoft has done the work for us, so I guess I was wrong. I didn't think much about how to connect strips together or bridge islands, but now I see it's pretty easy. My bad.

I still have my doubts about whether devs do this, and RoOoBo's evidence supports this. If you're indeed right about the majority being strips, then that could significantly reduce the space needed (assuming your point isn't moot and I'm right about not being able to re-use index buffers - see below). Even so, I think the tiles would chop the strips enough to keep you from getting near the limit of one index per triangle.

What do you consider an optimized list of transformed vertices?

An interesting detail from the documentation of D3D9 DrawIndexedPrimitive:

The required data to bridge any gaps â€“ should they occur â€“ is there.

If you want to use the original indicies, then you must have a fixed stride between your transformed vertex data, and you also need filler space wherever you culled vertices. When I said "optimized list", I was referring to the reduced list of vertices needed in the second pass. Maybe I misunderstood you. Were you suggesting to store all the transformed vertices, regardless of if they are used by any primitive? I thought you were mentioning all that culling stuff because you were making a point about less storage needed.

I did consider those parameters in the DrawIndexedPrimitive command, but I'm not sure if developers use it properly, since it doesn't help today's ubiquitous IMR. I think the description there about transforming prior to indexing was written in reference to software processing (i.e. the IDirect3DDevice9:

rocessVertices method). More importantly, there's model LOD to think about. Don't you usually use the same VB but with different indices for lower LOD models? The indices only touch a fraction of the total number of vertices in the [min,max] range specified in the Draw call. The DXSDK has built-in support for progressive mesh creation, and I think it works in this way.

Imagine if you have 1000 creatures in the scene all using the same vertex buffer of 100,000 vertices. You have different index buffers so that the far away models only need, say, 1000 of those vertices, but the ones near the camera use an IB touching all of them. Each of the creatures obviously uses different tranformation matrices. There's no way to handle this without creating a new IB unless the developer goes through the trouble of reording the VB so that all the verts used by the low detail IB are together, but that makes for scattered access when rendering the high detail model.

In any case, this is only one example of why you may only use a fraction of the verts in the [min,max] range. It's another end-case that would have disasterous consequences. If max-min is too large, what do you do? If you reuse the original IB, you must allocate a space of (max-min)*stride for each draw call, where stride = 8 bytes assuming no "live registers".

I guess we could run an experiment. We'd have to fiddle with Colourless' D3D wrapper performance tool, but basically we could sum (max-min) for each draw call over a whole frame in current games, and compare it to the primitive count.

That is another point. If you have a scene with 5 million different vertices, your vertex buffers are likely eating up hundreds of MiB. In this light 40MB for the transformed positions of all vertices (8 byte per vertex) doesn't sound that bad any more.

You don't have unique data for every object you draw. 100 soldiers/monsters/F1 cars/whatever using the same model with 100K tris doesn't need hundreds of megs for the VB, but still processes 10M tris per scene. In fact, this is probably how you'd reach peaks in the 5-10M range.

RoOoBo · Feb 10, 2006

Using counters that I already had in place for a frame in Quake4 trdemo4:

Indices: 376239
Vertices shaded: 125404
Triangles assembled: 126460
Triangles clipped by frustrum: 37353
Triangles after clipping: 89107
Back Facing triangles: 46883
Front Facing triangles: 41199
Triangles Rasterized: 49164

I don't output right now the input and output attribute counters in a format I can use to easily count the total amount of input and output vertex data. But just by looking at the numbers I would say that in average there are 4 input attributes and 6 output attributes.

Clipping only discards triangles completely outside the frustrum and doesn't breaks partially outside triangles. Primitive assembly I think is counting the extra empty triangle I use to mark the end of a batch. Some triangles may be being culled at the face culling stage just because their areas are too small.

I'm pretty sure that there are frames from the same demo with double that number of vertices per frame but I don't have the data at hand.

KimB · Feb 10, 2006

RoOoBo said:
I don't output right now the input and output attribute counters in a format I can use to easily count the total amount of input and output vertex data. But just by looking at the numbers I would say that in average there are 4 input attributes and 6 output attributes.

Quick question on this: are these attributes all FP32 vec4's?

Demirug · Feb 10, 2006

Mintmaster, in the case of geometry LOD it is better to store every level in one block because even with an Indexbuffer you get only the best performance if you use all vertices in the block in a sequential order.

The GPUs donâ€™t need the min/max parameters in the call functions. The will only process the vertices that are indexed. But as the memory fetch works in larger blocks you will read many unnecessary vertices if you have â€œholesâ€ in your index list.

Maybe this is an interesting plugin for my Direct3D command stream analyzer.

Jawed · Feb 10, 2006

Would a D3D10 capable GPU with the ability to hide vertex fetch latency care much about the sparsity of vertex data?

Jawed

Demirug · Feb 10, 2006

Jawed said:
Would a D3D10 capable GPU with the ability to hide vertex fetch latency care much about the sparsity of vertex data?

Jawed

This is less a latency problem. Today GPUs already have a pre vertex cache to compensate this. It is more a problem of wasted bandwidth. Because you have to read chunks you will read unneeded data if you have holes in your index list. This cannot be changed with a D3D10 GPU.

What a D3D10 GPU can do is to generate dynamic LOD and store the result with stream output in another vertex buffer for later use.

Future solution to memory bandwidth

KimB

Xmas

Porous

psurge

Xmas

Porous

Mintmaster

psurge

Xmas

Porous

Mintmaster

Xmas

Porous

Mintmaster

RoOoBo

Xmas

Porous

RoOoBo

3dcgi

Mintmaster

RoOoBo

KimB

Demirug

Jawed

Demirug

Similar threads