Simon F said:
use degenerate triangles to fill post-T&L vertex cache
That sounds grossly inefficient to me.
- I would suspect that any decent graphics accelerator (when drawing triangles) would detect that you've passed in a degenerate triangle and throw it away well before it even got around to fetching the vertices.
- The idea of the FIFO approach is that you can reuse vertices of previous triangles efficiently and that as vertices get "older" they are less likely to be re-used.
IMHO you should at least be using an ordering like: "0,1,15; 1, 16, 15; 1, 2, 16;..." say with N/2(-1?) unique vertices (where N is the size of the FIFO), and then come back for the second layer, eg. 15, 16, 30; 16,31,30; 16, 17, 31... etc (or possibly doing that from right to left instead of left-to-right)
That way many of the earlier points will still be in the FIFO when you get back to them. There are no doubt much better orders to use, but this would probably still be good for systems that just use strips. I'll leave it up to you to express those as strips.
Of course, it is dangerous to optimise for just one size of FIFO since different HW systems are likely to have different size caches. I remember reading a great paper which described a method for optimising the triangle order so that it was optimal for any cache/FIFO size. I posted details to it on B3D a while back - it's probably worth a search.
Notes on the post-T&L cache
It's worth noting that tristrips automatically organize vertices so as to get a large part of the benefit of the post-T&L cache. That is, each vertex can be used up to six times on average, and a tristrip automatically uses each vertex three times in the space of three triangles, so with tristrips each vertex at worst gets fetched and shaded only twice, rather than the trilist worst-case of six times. In truth, tristrips use the primitive-assembly cache, discussed next, rather than the post-T&L cache, to accomplish this, but that doesn't change the basic point that tristrips eliminate a lot of the post-T&L-oriented mesh optimization that has to be done with trilists.
It also doesn't change the fact that it's still important with tristrips to optimize meshes so that the post-T&L cache can eliminate that one remaining potential refetch and reshade case for as many vertices as possible. The priming example below shows how it's possible to use the post-T&L cache to eliminate every single redundant fetch and shade in a mesh that's 30 triangles across; in other words, it's possible even in the case of a 30-wide, infinitely-long mesh to fetch and shade each vertex exactly once—the theoretical minimum. Most meshes won't be quite so amenable to optimization, but it's possible to come surprisingly close to ideal vertex processing by deft use of the post-T&L cache.
One handy feature of the post-T&L cache is that it persists between primitives. In other words, if you draw a tristrip, and then immediately, without changing any renderstates, draw another primitive, for example a trilist, any vertices left in the post-T&L cache at the end of the first primitive are available to the second tristrip.
Another interesting aspect of the post-T&L cache is that it is possible to preload it (and the pre-T&L cache as well, as a side effect), via PrimeVertexCache(). This function issues vertices outside the scope of any primitive, and has no effect other than to cause those vertices to be loaded into the pre-T&L cache, processed by the vertex shader, and loaded into the post-T&L cache. This is useful for initializing the contents of the post-T&L cache in order to take best advantage of the post-T&L cache's FIFO eviction order. For example, consider the following tristrip:
Normal tristrip drawing won't result in optimal post-T&L caching in this case. Consider the drawing order:
0, 16, 1, 17, 2, 18, 3, 19, …, 14, 30, 15, 31
At this point, degenerate triangles (as discussed below) can be used to move up to the next row of triangles, but there's no way that all the vertices along the middle will still be in the post-T&L cache as the top row is drawn, because at this point vertices 16 through 23 have already been evicted. Thus many vertices will have to be refetched and retransformed, at a considerable cost in performance.
The problem here is that vertices from the bottom edge are taking up space in the post-T&L cache, even though they'll never be used again; here's the state of the cache at this point:
8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31
(This assumes the cache really is 16 entries in size; it's may actually be effectively anywhere between 14 and 18, as mentioned above, but 16 is fine for illustration.)
If we could somehow get the vertices from the bottom edge to be evicted from the post-T&L cache before any of the middle vertices, we'd be a lot better off—and that's what PrimeVertexCache() lets us do. Suppose we first prime the post-T&L cache with the following vertices:
0, 1, 2, 3, …, 13, 14, 15
The primed vertices—the entire bottom edge—will be the first ones evicted from the cache. Now, after we draw the bottom row of triangles with a normal tristrip:
0, 16, 1, 17, 2, 18, 3, 19, …, 14, 30, 15, 31
the contents of the vertex cache, from oldest to newest, will be:
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
The key here is that all the middle vertices are in the cache, and all the bottom vertices have either been evicted or are next in line to be tossed out. Now when we draw the top row of triangles as follows:
16, 32, 17, 33, 18, 34, 19, 35, …, 28, 44, 29, 45, 30, 46, 31, 47
all the middle vertices are in the post-T&L cache, no refetching or reshading is required, and maximum performance is achieved.