3dcgi said:
Why wouldn't you just submit multiple strips in a single draw call? DX allows that.
Xmas said:
I'm pretty sure the majority are triangle strips. They usually need less indices, which alone is enough reason to use them. And there's nothing funky about degenerate triangles. But this point is moot if you can reuse the index buffers.
I never said there's anything funky about degenerate triangles. I was referring to how one generates these strips from lists, i.e. how to walk across your mesh with one strip. Looking in the DXSDK, though, I see that Microsoft has done the work for us, so I guess I was wrong. I didn't think much about how to connect strips together or bridge islands, but now I see it's pretty easy. My bad.
I still have my doubts about whether devs do this, and RoOoBo's evidence supports this. If you're indeed right about the majority being strips, then that could significantly reduce the space needed (assuming your point isn't moot and I'm right about not being able to re-use index buffers - see below). Even so, I think the tiles would chop the strips enough to keep you from getting near the limit of one index per triangle.
What do you consider an optimized list of transformed vertices?
An interesting detail from the documentation of D3D9 DrawIndexedPrimitive:
The required data to bridge any gaps – should they occur – is there.
If you want to use the original indicies, then you must have a fixed stride between your transformed vertex data, and you also need filler space wherever you culled vertices. When I said "optimized list", I was referring to the reduced list of vertices needed in the second pass. Maybe I misunderstood you. Were you suggesting to store
all the transformed vertices, regardless of if they are used by any primitive? I thought you were mentioning all that culling stuff because you were making a point about less storage needed.
I did consider those parameters in the DrawIndexedPrimitive command, but I'm not sure if developers use it properly, since it doesn't help today's ubiquitous IMR. I think the description there about transforming prior to indexing was written in reference to software processing (i.e. the IDirect3DDevice9:
rocessVertices method). More importantly, there's model LOD to think about. Don't you usually use the same VB but with different indices for lower LOD models? The indices only touch a fraction of the total number of vertices in the [min,max] range specified in the Draw call. The DXSDK has built-in support for progressive mesh creation, and I think it works in this way.
Imagine if you have 1000 creatures in the scene all using the same vertex buffer of 100,000 vertices. You have different index buffers so that the far away models only need, say, 1000 of those vertices, but the ones near the camera use an IB touching all of them. Each of the creatures obviously uses different tranformation matrices. There's no way to handle this without creating a new IB unless the developer goes through the trouble of reording the VB so that all the verts used by the low detail IB are together, but that makes for scattered access when rendering the high detail model.
In any case, this is only one example of why you may only use a fraction of the verts in the [min,max] range. It's another end-case that would have disasterous consequences. If max-min is too large, what do you do? If you reuse the original IB, you must allocate a space of (max-min)*stride for each draw call, where stride = 8 bytes assuming no "live registers".
I guess we could run an experiment. We'd have to fiddle with Colourless' D3D wrapper performance tool, but basically we could sum
(max-min) for each draw call over a whole frame in current games, and compare it to the primitive count.
That is another point. If you have a scene with 5 million different vertices, your vertex buffers are likely eating up hundreds of MiB. In this light 40MB for the transformed positions of all vertices (8 byte per vertex) doesn't sound that bad any more.
You don't have unique data for every object you draw. 100 soldiers/monsters/F1 cars/whatever using the same model with 100K tris doesn't need hundreds of megs for the VB, but still processes 10M tris per scene. In fact, this is probably how you'd reach peaks in the 5-10M range.