Nvidia,pre-T&L or post-T&L Cache?

Post T&L cache stores computed vertices and only available when indexing is used. (Thats very clear from nVidia papers).

This makes an obvious improvement for using (non indexed) triangle strips instead of triangle list, as the vertices can be reused at connecting triangles in the strip. (The post T&L cache is not needed for this.)

When indexing is used triangle strips aren't providing any real advantage over triangle lists as the vertex reusal is avalable in the lists case as well (and the index bandwidth is very small).

On nVidia cards the post T&L cache is FIFO, not an LRU which mean that reusing a vertex doesn't bring it at the front of the queue (I guess they saved hardware by this and/or doesn't thought LRU would allow to much performance improvement.)

To make the post T&L cache efficient a complicated optimization algorithm is required (which is included in D3DX as well in nvTriStrip).

The pre T&L cache is a plain data cache with 32 byte lines. Details are not known but I'd assume its a 2-way data cache. The cache organization supports efficient random reading of vertices when the vertices are multiple of 32 bytes. 32 is a practical vertex size (X Y Z Nx Ny Nz U V).

When the vertex size is not a multiple of 32 vertices should be sorted by first apperance in the (optimized) intex buffer to allow the pre T&L cache to be efficient.

Pre T&L cache has nothing to improve in non-indexed modes since in these modes vertices are always read sequentially.
 
Hyp-X said:
When indexing is used triangle strips aren't providing any real advantage over triangle lists as the vertex reusal is avalable in the lists case as well (and the index bandwidth is very small).

I'm not so sure that must be true. Since it is possible to build geometry with almost twice as many triangles as vertices, but triangle strips only approach a 1:1 ratio, a post-triangle cache combined with intelligently-rendered and indexed strips might just prove to be even better than plain strips (and by a good margin...depending on hardware and drivers, of course).
 
Hyp-X said:
On nVidia cards the post T&L cache is FIFO, not an LRU which mean that reusing a vertex doesn't bring it at the front of the queue (I guess they saved hardware by this and/or doesn't thought LRU would allow to much performance improvement.)

I think you'll find the reasoning behind this in Hughes Hoppes' paper: Optimisation of Mesh locality..etc..
Around the 4th or 5th page he said he tried LRU but found that a FIFO works better.

I mentioned this in another thread, but FWIW I think the re-ordering/optimisation algorithm Hoppes uses needs to know the size of 'cache' for it to work efficiently. A better scheme is presented by Gotsman: Universal Rendering sequences...etc ...
 
Chalnoth said:
Hyp-X said:
When indexing is used triangle strips aren't providing any real advantage over triangle lists as the vertex reusal is avalable in the lists case as well (and the index bandwidth is very small).

I'm not so sure that must be true. Since it is possible to build geometry with almost twice as many triangles as vertices, but triangle strips only approach a 1:1 ratio, a post-triangle cache combined with intelligently-rendered and indexed strips might just prove to be even better than plain strips (and by a good margin...depending on hardware and drivers, of course).

Maybe I wasn't clear.

I'm not saying that indexed triangle strips aren't better than plain triangle strips. (They obviously are.)

I'm saying that indexed triangle strips aren't better than indexed triangle lists.
 
Hyp-X said:
I'm saying that indexed triangle strips aren't better than indexed triangle lists.

Assuming your triangles maintain the same vertex 'locality' as a strip would, you'd be close. (But then they'd be nearly strips, no?) Plus, you'd be sending 3x as much index information over the AGP bus
 
Hyp-X said:
Maybe I wasn't clear.

I'm not saying that indexed triangle strips aren't better than plain triangle strips. (They obviously are.)

I'm saying that indexed triangle strips aren't better than indexed triangle lists.

Yes, I suppose that's true, though you'd have to be absolutely certain that your rendering algorithm (Well, triangle order could be built into the model) optimized for a post-TnL triangle cache.

Of course, perhaps it is actually easier to increase cache coherency with triangle lists because you have more freedom in selecting the rendering order, but, at the same time, strips enforce a minimum cache coherency.
 
The GPU have post-T&L caches and pre-T&L caches.

The nvidia's vertex caches are post-T&L caches,and it is only
available when using indexed data.

The pre-T&L caches are "just" a memory caches.
 
Chalnoth said:
Yes, I suppose that's true, though you'd have to be absolutely certain that your rendering algorithm (Well, triangle order could be built into the model) optimized for a post-TnL triangle cache.

Yes, of course I assumed the data is optimized.
It's relatively straightforward to extract geomentry data from 3D softwares such as 3DS Max in the form of indexed triangle lists, but that data needs optimization.

Of course, perhaps it is actually easier to increase cache coherency with triangle lists because you have more freedom in selecting the rendering order, but, at the same time, strips enforce a minimum cache coherency.

Yes I'm just thinking about playing with such algorithms. The advantage of lists is that there's no need to create degenerate triangles.
Degenerate triangles might be free, but it depends much on the speed of the triangle setup, the shader complexity, and the "bloat" ratio of the optimizer.

If I understand correctly nv30 will be the first card to support restart markers in the index stream, allowing indexed triangle strips to be generated without degenerate triangles.
 
Hyp-X said:
Yes I'm just thinking about playing with such algorithms. The advantage of lists is that there's no need to create degenerate triangles.
Degenerate triangles might be free, but it depends much on the speed of the triangle setup, the shader complexity, and the "bloat" ratio of the optimizer.

If I understand correctly nv30 will be the first card to support restart markers in the index stream, allowing indexed triangle strips to be generated without degenerate triangles.

Well, as for the triangle setup of a degenerate triangle, you would think it would be exceedingly simple, or possibly bypassed entirely with a smart TnL engine.

As for shader complexity, I suppose it would matter what data the shaders were dependent upon for processing. If the only data used in the vertex program that varies from triangle to triangle is based entirely upon the vertex positions, then you would think that any vertex program would be entirely skipped by just using the post-TnL cache.

But, anyway, I suppose this is just one of many situations where a smart architecture can do a lot better than a basic one. It is for this reason that I have an aversion to certain websites' apparent claims that all differences between architectures running at the same clock speed and with similar specs are based upon drivers (See xbitlabs' recent article benchmarking a Quadro, GF4, and R9700 in professional apps...).
 
Chalnoth said:
Well, as for the triangle setup of a degenerate triangle, you would think it would be exceedingly simple, or possibly bypassed entirely with a smart TnL engine.

I think the smartest thing to filter degenerate triangles by index. If at least two indices of the triangle are equal than it's guaranteed to be degenerate.
It could even be implemented for strips only.
The alorithm is simple: in an indexed strip if the current index (describing a triangle) is equal to the previous one, then this and the followup triangle is skipped (not outputted to the triangle setup).

AFAIK none of the available hardware does this.

As for shader complexity, I suppose it would matter what data the shaders were dependent upon for processing. If the only data used in the vertex program that varies from triangle to triangle is based entirely upon the vertex positions, then you would think that any vertex program would be entirely skipped by just using the post-TnL cache.

The shader makes doesn't treat vertex position any differently than any other data contained within a vertex.
You assume post-TnL cache could be looked up by input data.
I know nVidia doesn't do this, but I'm not aware of the details of other cards.
(The amount of information you can get on ATI's site on how to optimize for their cards is 0)

But, anyway, I suppose this is just one of many situations where a smart architecture can do a lot better than a basic one. It is for this reason that I have an aversion to certain websites' apparent claims that all differences between architectures running at the same clock speed and with similar specs are based upon drivers (See xbitlabs' recent article benchmarking a Quadro, GF4, and R9700 in professional apps...).

Well, theres always some tricks you can do in the drivers.

When a game becomes a popular benchmark its up to IHVs to try to do the optimizations that the original author forgot to do.
Actually thats true for 3dmark too.
 
Back
Top