optimized by post-T&L vertex cache

use degenerate triangles to fill post-T&L vertex cache
That sounds grossly inefficient to me.

  1. I would suspect that any decent graphics accelerator (when drawing triangles) would detect that you've passed in a degenerate triangle and throw it away well before it even got around to fetching the vertices.
  2. The idea of the FIFO approach is that you can reuse vertices of previous triangles efficiently and that as vertices get "older" they are less likely to be re-used.

IMHO you should at least be using an ordering like: "0,1,15; 1, 16, 15; 1, 2, 16;..." say with N/2(-1?) unique vertices (where N is the size of the FIFO), and then come back for the second layer, eg. 15, 16, 30; 16,31,30; 16, 17, 31... etc (or possibly doing that from right to left instead of left-to-right)
That way many of the earlier points will still be in the FIFO when you get back to them. There are no doubt much better orders to use, but this would probably still be good for systems that just use strips. I'll leave it up to you to express those as strips. :)

Of course, it is dangerous to optimise for just one size of FIFO since different HW systems are likely to have different size caches. I remember reading a great paper which described a method for optimising the triangle order so that it was optimal for any cache/FIFO size. I posted details to it on B3D a while back - it's probably worth a search.
 
Dio said:
Yep. It uses degenerate triangles to pre-fill a FIFO style cache. Then the second row does the actual drawing, and pre-fills the second line of the cache.

My uncertainty in it's efficiency is that I'm not sure that many degenerate triangles are at all a good idea.

render degenerate triangles,resulting in no pixels(in degenerate triangles) actually being drawn.The cost that save process shading vertex by use degenerate triangles is more index.
 
Simon F said:
use degenerate triangles to fill post-T&L vertex cache
That sounds grossly inefficient to me.

  1. I would suspect that any decent graphics accelerator (when drawing triangles) would detect that you've passed in a degenerate triangle and throw it away well before it even got around to fetching the vertices.
  2. The idea of the FIFO approach is that you can reuse vertices of previous triangles efficiently and that as vertices get "older" they are less likely to be re-used.

IMHO you should at least be using an ordering like: "0,1,15; 1, 16, 15; 1, 2, 16;..." say with N/2(-1?) unique vertices (where N is the size of the FIFO), and then come back for the second layer, eg. 15, 16, 30; 16,31,30; 16, 17, 31... etc (or possibly doing that from right to left instead of left-to-right)
That way many of the earlier points will still be in the FIFO when you get back to them. There are no doubt much better orders to use, but this would probably still be good for systems that just use strips. I'll leave it up to you to express those as strips. :)

Of course, it is dangerous to optimise for just one size of FIFO since different HW systems are likely to have different size caches. I remember reading a great paper which described a method for optimising the triangle order so that it was optimal for any cache/FIFO size. I posted details to it on B3D a while back - it's probably worth a search.

Notes on the post-T&L cache
It's worth noting that tristrips automatically organize vertices so as to get a large part of the benefit of the post-T&L cache. That is, each vertex can be used up to six times on average, and a tristrip automatically uses each vertex three times in the space of three triangles, so with tristrips each vertex at worst gets fetched and shaded only twice, rather than the trilist worst-case of six times. In truth, tristrips use the primitive-assembly cache, discussed next, rather than the post-T&L cache, to accomplish this, but that doesn't change the basic point that tristrips eliminate a lot of the post-T&L-oriented mesh optimization that has to be done with trilists.

It also doesn't change the fact that it's still important with tristrips to optimize meshes so that the post-T&L cache can eliminate that one remaining potential refetch and reshade case for as many vertices as possible. The priming example below shows how it's possible to use the post-T&L cache to eliminate every single redundant fetch and shade in a mesh that's 30 triangles across; in other words, it's possible even in the case of a 30-wide, infinitely-long mesh to fetch and shade each vertex exactly once—the theoretical minimum. Most meshes won't be quite so amenable to optimization, but it's possible to come surprisingly close to ideal vertex processing by deft use of the post-T&L cache.

One handy feature of the post-T&L cache is that it persists between primitives. In other words, if you draw a tristrip, and then immediately, without changing any renderstates, draw another primitive, for example a trilist, any vertices left in the post-T&L cache at the end of the first primitive are available to the second tristrip.

Another interesting aspect of the post-T&L cache is that it is possible to preload it (and the pre-T&L cache as well, as a side effect), via PrimeVertexCache(). This function issues vertices outside the scope of any primitive, and has no effect other than to cause those vertices to be loaded into the pre-T&L cache, processed by the vertex shader, and loaded into the post-T&L cache. This is useful for initializing the contents of the post-T&L cache in order to take best advantage of the post-T&L cache's FIFO eviction order. For example, consider the following tristrip:

2003221171640742.jpg


Normal tristrip drawing won't result in optimal post-T&L caching in this case. Consider the drawing order:

0, 16, 1, 17, 2, 18, 3, 19, …, 14, 30, 15, 31

At this point, degenerate triangles (as discussed below) can be used to move up to the next row of triangles, but there's no way that all the vertices along the middle will still be in the post-T&L cache as the top row is drawn, because at this point vertices 16 through 23 have already been evicted. Thus many vertices will have to be refetched and retransformed, at a considerable cost in performance.

The problem here is that vertices from the bottom edge are taking up space in the post-T&L cache, even though they'll never be used again; here's the state of the cache at this point:

8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31

(This assumes the cache really is 16 entries in size; it's may actually be effectively anywhere between 14 and 18, as mentioned above, but 16 is fine for illustration.)

If we could somehow get the vertices from the bottom edge to be evicted from the post-T&L cache before any of the middle vertices, we'd be a lot better off—and that's what PrimeVertexCache() lets us do. Suppose we first prime the post-T&L cache with the following vertices:

0, 1, 2, 3, …, 13, 14, 15

The primed vertices—the entire bottom edge—will be the first ones evicted from the cache. Now, after we draw the bottom row of triangles with a normal tristrip:

0, 16, 1, 17, 2, 18, 3, 19, …, 14, 30, 15, 31

the contents of the vertex cache, from oldest to newest, will be:

16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

The key here is that all the middle vertices are in the cache, and all the bottom vertices have either been evicted or are next in line to be tossed out. Now when we draw the top row of triangles as follows:

16, 32, 17, 33, 18, 34, 19, 35, …, 28, 44, 29, 45, 30, 46, 31, 47

all the middle vertices are in the post-T&L cache, no refetching or reshading is required, and maximum performance is achieved.
 
ultrafly said:
Notes on the post-T&L cache...
Whose notes are those?

PrimeVertexCache()
Ahh I see there is a special function to pre-load the FIFO. I presume this allows you to 'load' vertices without the associated triangles - that would be a way of avoiding the 'throw away degenerates' situation. Are you using such a function?
 
Err there doesn't seem to be a "prime vertex cache" in the DX docs. Is this some special function in some library by Abrash?
 
Simon F said:
Err there doesn't seem to be a "prime vertex cache" in the DX docs. Is this some special function in some library by Abrash?

use
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14,15,15,0 (degenerate triangles to prime the vertex cache)
 
I'm definitely with Simon on this one. I think that some hardware might bin off the degenerates before issuing the fetches. I suspect that this is Mike A describing a quirk of the X-box design that it doesn't bin off degenerates until post-shading, and it's not generally portable to other hardware.

As Simon says (ha!), if you assume a particular VC size, you can get pretty close to the max by doing your strips in runs of just under half the VC size. This brings your fetches/vertex well under 1.5, but only issues one degenerate triangle per run.

Another point: remember that classic memory caches are usually used pre-shader and these make a big difference in the fetch bandwidth and latency, so it can be important to have good locality of reference on your indexing, particularly if your vertex buffers don't have exact 32 strides or you are using a lot of streams. Going (0, 100, 200), (100, 200, 563), (200, 563, 1021) may be inefficient.
 
ultrafly said:
Simon F said:
Err there doesn't seem to be a "prime vertex cache" in the DX docs. Is this some special function in some library by Abrash?

use
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14,15,15,0 (degenerate triangles to prime the vertex cache)

That's the point, though. You're passing in dud triangles and (IMHO as SW/HW graphics designer) any piece of HW is probably going to cull those as early as possible in the graphics pipeline. (You wouldn't believe the amount of redundant rubbish that some "supposedly efficient" applications pass to the driver/HW which, unless filtered out, would waste time).

Now if DX or OGL had a specific function to intialise the FIFO then that would be a different matter, but AFAICS all you are doing is eating up cycles.
 
One thing particularly interests me. Why do you assume that what is optimal for the X-box is optimal for all hardware?
 
Dio said:
One thing particularly interests me. Why do you assume that what is optimal for the X-box is optimal for all hardware?

The post-T&L vertex cache of XBOX GPU is FIFO,the ATI And NVIDIA is too.And the theory of optimized is all-purpose,i think. :LOL: :LOL: :LOL:
 
ultrafly said:
The post-T&L vertex cache of XBOX GPU is FIFO,the ATI And NVIDIA is too.
Again, can I ask where you got this information about ATI's hardware?

ultrafly said:
And the theory of optimized is all-purpose,i think.
Unfortunately, I think you are wrong.... :(
 
ultrafly said:
The post-T&L vertex cache of XBOX GPU is FIFO,the ATI And NVIDIA is too.And the theory of optimized is all-purpose,i think.
I'm afraid you're being a bit naive in assuming just because one HW solution (i.e. XBOX) does (or in this case doesn't do!) something that all systems behave the same way.
Obviously all will render the same triangles but they do it in different ways.
 
Simon F said:
ultrafly said:
The post-T&L vertex cache of XBOX GPU is FIFO,the ATI And NVIDIA is too.And the theory of optimized is all-purpose,i think.
I'm afraid you're being a bit naive in assuming just because one HW solution (i.e. XBOX) does (or in this case doesn't do!) something that all systems behave the same way.
Obviously all will render the same triangles but they do it in different ways.

First ,after using 20,000 triangles,I got two results,optimized code is 255 fps, and no optimized code is 177 fps.
If the method of optimized is no useful,why the optimized code is faster than no optimized code?

I am sorry for my pool english.
 
ultrafly said:
First ,after using 20,000 triangles,I got two results,optimized code is 255 fps, and no optimized code is 177 fps.
If the method of optimized is no useful,why the optimized code is faster than no optimized code?
You will be getting some benefit because once you start definining (say) the second row of real triangles they'll start to re-use the previous row's vertices. This allows you to get >1+ triangles per vertex.

The unoptimised example you gave just keeps running down the long strip which means that, at best, you get, at best, slightly less than 1 t/v.

I am sorry for my pool english.
Gosh! No need to apologise. I'm always impressed by people who can speak multiple languages. I can only do English and C :)
 
Simon F said:
ultrafly said:
First ,after using 20,000 triangles,I got two results,optimized code is 255 fps, and no optimized code is 177 fps.
If the method of optimized is no useful,why the optimized code is faster than no optimized code?
You will be getting some benefit because once you start definining (say) the second row of real triangles they'll start to re-use the previous row's vertices. This allows you to get >1+ triangles per vertex.

The unoptimised example you gave just keeps running down the long strip which means that, at best, you get, at best, slightly less than 1 t/v.

I am sorry for my pool english.
Gosh! No need to apologise. I'm always impressed by people who can speak multiple languages. I can only do English and C :)

First,the optimised code is faster.The strange thing is when i reduce the size of triangles,the unoptimised code is faster.
What is wrong?I don't understand.
 
Ultrafly:
How's the mesh organized?
Do you do one PrimeVertexCache, and the rest of the 20K triangles are 2tri/vertex. Or do you have several meshes and do a PrimeVertexCache to start each of them? In that case, how large are the individual meshes?
 
ultrafly said:
First,the optimised code is faster.The strange thing is when i reduce the size of triangles,the unoptimised code is faster.
What is wrong?I don't understand.
How small are the triangles in this second case? 10s of pixels, 1 pixel, smaller than a pixel?
 
Basic said:
Ultrafly:
How's the mesh organized?
Do you do one PrimeVertexCache, and the rest of the 20K triangles are 2tri/vertex. Or do you have several meshes and do a PrimeVertexCache to start each of them? In that case, how large are the individual meshes?

my optimize method is:

suppose the post-T&L wertex should be 15 vertexs,
the triangles:
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ |
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ |
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14


the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0, 0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29,29, 15, 15 ,30, 16...

Simon F said:
How small are the triangles in this second case? 10s of pixels, 1 pixel, smaller than a pixel?

Bigger than a pixel.
 
Back
Top