optimized by post-T&L vertex cache

ultrafly

Newcomer
hi.

I use the method raised by Mike Abrash(<<Xbox Vertex Performance>>).But I find some strange things.

I suppose the post-T&L wertex should be 11 wertexs in my R9500 pro 64M.

First, after using 20,000 triangles,I got two results,optimized code is 255 fps, and no optimized code is 177 fps. But when I reduce the size of triangles(no change the numbers of triangles), I found the fps under no optimized code is faster than optimized code. why?

Second,I found use D3DFILL_SOLID is faster than D3DFILL_WIREFRAME,why?

I am sorry for my pool english.
 
ultrafly said:
Second,I found use D3DFILL_SOLID is faster than D3DFILL_WIREFRAME,why?
Solid generates fewer primitives than wireframe. For every triangle sent, in wireframe you get 3 lines.
 
my optimize method is:

sample,suppose the post-T&L wertex should be 15 vertexs,
the triangles:
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ |
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ |
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14


the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0, 0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29,29, 15, 15 ,30, 16...

the index of the no optimized code is:
0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29 ,29,15,15,30,16...
 
ultrafly said:
my optimize method is:

sample,suppose the post-T&L wertex should be 15 vertexs,
the triangles:
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ |
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ |
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14


the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0, 0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29,29, 15, 15 ,30, 16...

the index of the no optimized code is:
0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29 ,29,15,15,30,16...

I'm no software guru, but it looks like you're reusing a LOT of vertices. Why?

The unoptimised is using half as many entries for the same data...
 
The no-optimised order certainly isn't _bad_.

I presume what you're trying to do with the optimal order is to fill in cache entries per line, then use them, then make sure the next line's LRU is updated so they don't get evicted. Problem is your line is too large; you've got twice as many vertices in your cache (if you check, you will see you need 30 vertices for the size you have).

I'm not sure I'd recommend sending that many degenerate triangles (more than 50%). I'd stick with the 'no-optimised order' myself, with a shorter line repeat rate maybe.
 
Dio said:
The no-optimised order certainly isn't _bad_.

I presume what you're trying to do with the optimal order is to fill in cache entries per line, then use them, then make sure the next line's LRU is updated so they don't get evicted. Problem is your line is too large; you've got twice as many vertices in your cache (if you check, you will see you need 30 vertices for the size you have).

I'm not sure I'd recommend sending that many degenerate triangles (more than 50%). I'd stick with the 'no-optimised order' myself, with a shorter line repeat rate maybe.

the cache is FIFO,not LRU

after
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0

the cache(15 vertexs) is:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14

after
0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29

the cache(15 vertexs) is:
15,16,17,18,19,20,21,22,23,24,25,26,27,28,29

after
15,30,16,31,17,32,18,33,19,34,20,35,21,36,22,37,23,38,24,39,25,40,26,41,27,42,28,43,29,44

the cache(15 vertexs) is:
30,31,32,33,34,35,36,37,38,39,40,41,42,43,44

.............................

every vertex process once.

No optimized code:

after
0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29

the cache(15 vertexs) is:
22,8,23,9,24,10,25,11,26,12,27,13,28,14,29

when process the second line:
15,30,16,31,17,32,18,33,19,34,20,35,21,36,22,37,23,38,24,39,25,40,26,41,27,42,28,43,29,44

15:not in cache,reload and process again
30:not in cache,reload and process again
...............

one vertex process more then one times.
 
Dio said:
Where do you get the information that the R9500 Pro cache is FIFO?

I suppose.I couldn't find any information about the ATI's vertex cache.
The NVIDIA GPU's vertex cache is FIFO,so i suppose the ATI's vertex cache is also FIFO.
 
I wouldn't assume one way or the other... I must admit I don't know myself, maybe one of the other ATI chaps on here might?
 
Dio said:
I wouldn't assume one way or the other... I must admit I don't know myself, maybe one of the other ATI chaps on here might?
Probably a FIFO, as that should actually perform better than an LRU (at least according to Hoppe).

ultrafly said:
the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, ...
I was a bit confused by this. I'm assuming that you are forming triangles but why aren't these ordered more like..... 0,1,15; 1, 16, 15; 1, 2, 16;... ?
 
Simon F said:
Dio said:
I wouldn't assume one way or the other... I must admit I don't know myself, maybe one of the other ATI chaps on here might?
Probably a FIFO, as that should actually perform better than an LRU (at least according to Hoppe).

Why the no optimized code is faster than optimized code after changed the size in my project?

thanks.
 
Simon F said:
I was a bit confused by this. I'm assuming that you are forming triangles but why aren't these ordered more like..... 0,1,15; 1, 16, 15; 1, 2, 16;... ?
Tristrips.

I think the algorithm's valid, but requires a lot of assumptions about how the hardware works. Personally, I wouldn't use an algorithm of this kind - I'd take a few % inefficiency in exchange for portability, but what do I know? :)
 
Dio said:
Simon F said:
I was a bit confused by this. I'm assuming that you are forming triangles but why aren't these ordered more like..... 0,1,15; 1, 16, 15; 1, 2, 16;... ?
Tristrips.

I think the algorithm's valid, but requires a lot of assumptions about how the hardware works. Personally, I wouldn't use an algorithm of this kind - I'd take a few % inefficiency in exchange for portability, but what do I know? :)

I think the method should be efficient.
But i don't understand why get inverse result by reduce the size of triangles?
 
Dio said:
Simon F said:
I was a bit confused by this. I'm assuming that you are forming triangles but why aren't these ordered more like..... 0,1,15; 1, 16, 15; 1, 2, 16;... ?
Tristrips.
But surely they don't even form tristrips. The vals given would make triangles (0,1,1) (1,1,2) (1,2,2)... etc!
 
Yep. It uses degenerate triangles to pre-fill a FIFO style cache. Then the second row does the actual drawing, and pre-fills the second line of the cache.

My uncertainty in it's efficiency is that I'm not sure that many degenerate triangles are at all a good idea.
 
Simon F said:
Dio said:
Simon F said:
I was a bit confused by this. I'm assuming that you are forming triangles but why aren't these ordered more like..... 0,1,15; 1, 16, 15; 1, 2, 16;... ?
Tristrips.
But surely they don't even form tristrips. The vals given would make triangles (0,1,1) (1,1,2) (1,2,2)... etc!

use degenerate triangles to fill post-T&L vertex cache by especial order of the vertexs.
 
Back
Top