Future solution to memory bandwidth

arjan de lumens · Feb 10, 2006

Jawed said:
Would a D3D10 capable GPU with the ability to hide vertex fetch latency care much about the sparsity of vertex data?

Jawed

The main reason that vertex data sparsity is a problem is not fetch latency, but fetch throughput. Small scattered memory accesses tend to be unable to take advantage of full data bursts (important, since all DRAM traffic these days are arranged as bursts with a somewhat large minimum burst length), and they tend to suffer large numbers of page breaks. This is independent of whether the GPU is 'D3D10' capable or not.

In the case of a TBDR that does 2-pass vertex shading, sparse updates to a large vertex buffer object is very problematic too.

Jawed · Feb 10, 2006

One of the things I wonder about is the performance of streamout - naturally it should be performed in blocks to optimise RAM access patterns - but as far as I can tell streamout actually allows a true "scatter" in addition to supporting a literal stream write to sequential locations in memory. Whenever scattered writes are performed, streamout could cause problems, presumably.

Jawed

RoOoBo · Feb 10, 2006

Chalnoth said:
Quick question on this: are these attributes all FP32 vec4's?

All the attributes are set as type GL_FLOAT from what I can see with a simple search on the trace. But not all attributes use 4 components. The vertex buffers seem to be arranged with position, color, texture and other attributes interleaved with a stride from vertex to vertex of 64 bytes (in those buffers associated with calls that seem shader heavy) or 16 bytes (likely the buffers for stencil shadows). The average number of components should be between 2 and 3. I think only the buffers for stencil use 4 components for the vertex position (the ones stored with stride 16).

All the indices are 32 bits.

Demirug · Feb 10, 2006

Jawed said:
One of the things I wonder about is the performance of streamout - naturally it should be performed in blocks to optimise RAM access patterns - but as far as I can tell streamout actually allows a true "scatter" in addition to supporting a literal stream write to sequential locations in memory. Whenever scattered writes are performed, streamout could cause problems, presumably.

Jawed

If I understand the D3D10 documentation right stream output is strict sequential. There are only two HLSL methods to control output. Append and RestartStrip.

Mintmaster · Feb 10, 2006

Demirug said:
Mintmaster, in the case of geometry LOD it is better to store every level in one block because even with an Indexbuffer you get only the best performance if you use all vertices in the block in a sequential order.

I think you're thinking of a different scheme than I am. I'm talking about progressive meshes, where each LOD level shares the same set of vertices from the full model. You can have a quasi-continuous model in this way.

If you put all the vertices involved in the lowest LOD together in a block, then higher LOD levels would have scattered access. You may be right in that lower LOD should have the higher priority over higher LOD, but I'm not sure.

Anyway, the point is not what's the best way for a developer to do this, because IHV's don't write the games. The point is that there are situations where a program doesn't touch a contiguous block of vertices, so using the original index buffer to access transformed vertices is a risky idea.

Demirug · Feb 10, 2006

Yes IHVs donâ€™t write games but they tell you what you should do to get good performance from there hardware. They donâ€™t like progressive meshes. If you need some LOD you should store some different levels as separate mesh.

Future solution to memory bandwidth

arjan de lumens

Jawed

RoOoBo

Demirug

Mintmaster

Demirug

Similar threads