optimized by post-T&L vertex cache

ultrafly said:
....
the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0, 0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29?29, 15, 15 ,30, 16...


Simon F said:
How small are the triangles in this second case? 10s of pixels, 1 pixel, smaller than a pixel?

Bigger than a pixel.

Sorry I meant to also ask how large are the triangles with the "large" case. Are any being clipped or off-screen culled?
I've been trying to think if might be due to fill-rate behaviour but I need a bit more information. Do you know the frame rates for four situations? What's the resolution of the image?

This probably won't have much effect, but could you try removing your "FIFO pre-load" of the vertices or, if you still want to have some effect on XBox (or perhaps all Nvidia chips???), just preload say, vertices 1 to 4? I suspect it's not helping you with the ATI HW.
 
Do you just do two strips next to each other (+ the first cache filling "strip")?

What I'm interested in is what happens after vertex 44 in your case. New PrimeVertexCache or a new strip like 44, 30, 30, 45, 31, 46, ... .
If it's the second case, how many parallel strips do you do between each PrimeVertex? (I would call the example you gave 1+2 strips.)

Btw, what's the framerates in the two small-poly cases?

I have a theory, but don't know if it holds water.
 
Simon F said:
ultrafly said:
....
the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0, 0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29?29, 15, 15 ,30, 16...


Simon F said:
How small are the triangles in this second case? 10s of pixels, 1 pixel, smaller than a pixel?

Bigger than a pixel.

Sorry I meant to also ask how large are the triangles with the "large" case. Are any being clipped or off-screen culled?
I've been trying to think if might be due to fill-rate behaviour but I need a bit more information. Do you know the frame rates for four situations? What's the resolution of the image?

This probably won't have much effect, but could you try removing your "FIFO pre-load" of the vertices or, if you still want to have some effect on XBox (or perhaps all Nvidia chips???), just preload say, vertices 1 to 4? I suspect it's not helping you with the ATI HW.

Before and after reduce the size of triangles,All triangles are in screen.No clip and no cull.
I think the fill-rate of optimized code is the same as unoptimized code,and changed at the same time.
 
Basic said:
Do you just do two strips next to each other (+ the first cache filling "strip")?

What I'm interested in is what happens after vertex 44 in your case. New PrimeVertexCache or a new strip like 44, 30, 30, 45, 31, 46, ... .
If it's the second case, how many parallel strips do you do between each PrimeVertex? (I would call the example you gave 1+2 strips.)

Btw, what's the framerates in the two small-poly cases?

I have a theory, but don't know if it holds water.

That is a sample.
In my code,parallel strips between each PrimeVertex are 100;

"two small-poly cases" What is your means?

I am very interested with your theory.Please share with me,thx.
 
First,the optimised code is faster.The strange thing is when i reduce the size of triangles,the unoptimised code is faster.
What is wrong?I don't understand.

OK I'm guessing here.....

Your performance probably becomes partially setup bound as you reduce the tri size, and the optimised version has more degenerate tris in it than the none optimised version. Abrash's suggestion is spoecifically for NV2X which has a special case for degenerate setup which will in most cases prevent this, R9XXX may or may not.
 
ERP said:
First,the optimised code is faster.The strange thing is when i reduce the size of triangles,the unoptimised code is faster.
What is wrong?I don't understand.

OK I'm guessing here.....

Your performance probably becomes partially setup bound as you reduce the tri size, and the optimised version has more degenerate tris in it than the none optimised version. Abrash's suggestion is spoecifically for NV2X which has a special case for degenerate setup which will in most cases prevent this, R9XXX may or may not.

When i reduce the tri size,only changed is the fill-rate,the throughput of T&L is not changed.The optimised code should be faster after changed the size,but the result is reverse.
 
ultrafly said:
When i reduce the tri size,only changed is the fill-rate,the throughput of T&L is not changed.The optimised code should be faster after changed the size,but the result is reverse.
Do you mean the optimized code with reduced tri size is slower than the unoptimized code with small tris, or slower than the optimized code with large tris?
 
Xmas said:
ultrafly said:
When i reduce the tri size,only changed is the fill-rate,the throughput of T&L is not changed.The optimised code should be faster after changed the size,but the result is reverse.
Do you mean the optimized code with reduced tri size is slower than the unoptimized code with small tris, or slower than the optimized code with large tris?

The result of my test is:the optimized code with big tris is faster than the unoptimized code with big tris,and the optimized code with small tris is slower than the unoptimized code with small tris.All tris could been seen in screen,no clip,no cull.
 
ultrafly said:
The result of my test is:the optimized code with big tris is faster than the unoptimized code with big tris,and the optimized code with small tris is slower than the unoptimized code with small tris.All tris could been seen in screen,no clip,no cull.
Then ERP's explanation is perfectly possible.
 
Dio said:
Again, can I ask where you got this information about ATI's hardware?

[...]

Unfortunately, I think you are wrong.... :(

Ok.

Would you be so kind enough to tell us how the post-T&L vertex cache of the R300 operates?

1. How many vertices can the cache hold? (I know that the DX9 optimization docs say to query the driver but the result is "not-supported".)

2. Does the cache has a FIFO or LRU organization, or something else?

3. Does the R300 benefiting from using optimized triangle strips instead of optimized triangle lists?

4. Are the degenerate triangles are rejected with triangle setup rate, or are they rejected much faster by detecting repeated indices?

5. Is the content of the cache preserved between DIP calls if there's no render state / VB changes between them?

If you cannot give the answers (or get them), please say so, I'll try dev-support next.
(But I think it would be better to publish the info - it's in ATI's interest!)
 
Hyp-X said:
Dio said:
Again, can I ask where you got this information about ATI's hardware?

[...]

Unfortunately, I think you are wrong.... :(

Ok.

Would you be so kind enough to tell us how the post-T&L vertex cache of the R300 operates?

1. How many vertices can the cache hold? (I know that the DX9 optimization docs say to query the driver but the result is "not-supported".)

2. Does the cache has a FIFO or LRU organization, or something else?

3. Does the R300 benefiting from using optimized triangle strips instead of optimized triangle lists?

4. Are the degenerate triangles are rejected with triangle setup rate, or are they rejected much faster by detecting repeated indices?

5. Is the content of the cache preserved between DIP calls if there's no render state / VB changes between them?

If you cannot give the answers (or get them), please say so, I'll try dev-support next.
(But I think it would be better to publish the info - it's in ATI's interest!)

I test the efficiency of degenerate triangles on R300,the result is:

first 20000 normal triangles+ 396 degenerate triangles, 515 fps
second 2 normal triangles+20394 degenerate triangles, 578 fps

The result is disillusionary.
 
When it comes to ATI hardware I only comment on things I have read on the public internet, I sleep better that way, so I'm afraid you'll have to try dev support unless one of the other ATI guys here can answer.

I would observe that if a DX9 query says 'not supported' then you cannot assume anything.
 
Back
Top