optimized by post-T&L vertex cache

Simon F · Feb 21, 2003

ultrafly said:
....
the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0, 0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29?29, 15, 15 ,30, 16...

Simon F said:

How small are the triangles in this second case? 10s of pixels, 1 pixel, smaller than a pixel?

Click to expand...

Bigger than a pixel.

Sorry I meant to also ask how large are the triangles with the "large" case. Are any being clipped or off-screen culled?
I've been trying to think if might be due to fill-rate behaviour but I need a bit more information. Do you know the frame rates for four situations? What's the resolution of the image?

This probably won't have much effect, but could you try removing your "FIFO pre-load" of the vertices or, if you still want to have some effect on XBox (or perhaps all Nvidia chips???), just preload say, vertices 1 to 4? I suspect it's not helping you with the ATI HW.

Basic · Feb 21, 2003

Do you just do two strips next to each other (+ the first cache filling "strip")?

What I'm interested in is what happens after vertex 44 in your case. New PrimeVertexCache or a new strip like 44, 30, 30, 45, 31, 46, ... .
If it's the second case, how many parallel strips do you do between each PrimeVertex? (I would call the example you gave 1+2 strips.)

Btw, what's the framerates in the two small-poly cases?

I have a theory, but don't know if it holds water.

ultrafly · Feb 21, 2003

Simon F said:
ultrafly said:

....
the index of the optimized code is:
0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14 ,0, 0 ,15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 6, 21, 7, 22, 8, 23, 9, 24, 10, 25, 11, 26, 12, 27, 13, 28, 14, 29?29, 15, 15 ,30, 16...

Simon F said:

How small are the triangles in this second case? 10s of pixels, 1 pixel, smaller than a pixel?

Click to expand...

Bigger than a pixel.

Click to expand...

Sorry I meant to also ask how large are the triangles with the "large" case. Are any being clipped or off-screen culled?
I've been trying to think if might be due to fill-rate behaviour but I need a bit more information. Do you know the frame rates for four situations? What's the resolution of the image?

This probably won't have much effect, but could you try removing your "FIFO pre-load" of the vertices or, if you still want to have some effect on XBox (or perhaps all Nvidia chips???), just preload say, vertices 1 to 4? I suspect it's not helping you with the ATI HW.

Before and after reduce the size of triangles,All triangles are in screen.No clip and no cull.
I think the fill-rate of optimized code is the same as unoptimized code,and changed at the same time.

ultrafly · Feb 21, 2003

Basic said:
Do you just do two strips next to each other (+ the first cache filling "strip")?

What I'm interested in is what happens after vertex 44 in your case. New PrimeVertexCache or a new strip like 44, 30, 30, 45, 31, 46, ... .
If it's the second case, how many parallel strips do you do between each PrimeVertex? (I would call the example you gave 1+2 strips.)

Btw, what's the framerates in the two small-poly cases?

I have a theory, but don't know if it holds water.

That is a sample.
In my code,parallel strips between each PrimeVertex are 100;

"two small-poly cases" What is your means?

I am very interested with your theory.Please share with me,thx.

ultrafly · Feb 21, 2003

up...............

ERP · Feb 21, 2003

First,the optimised code is faster.The strange thing is when i reduce the size of triangles,the unoptimised code is faster.
What is wrong?I don't understand.

OK I'm guessing here.....

Your performance probably becomes partially setup bound as you reduce the tri size, and the optimised version has more degenerate tris in it than the none optimised version. Abrash's suggestion is spoecifically for NV2X which has a special case for degenerate setup which will in most cases prevent this, R9XXX may or may not.

ultrafly · Feb 21, 2003

ERP said:
First,the optimised code is faster.The strange thing is when i reduce the size of triangles,the unoptimised code is faster.
What is wrong?I don't understand.

Click to expand...

OK I'm guessing here.....

Your performance probably becomes partially setup bound as you reduce the tri size, and the optimised version has more degenerate tris in it than the none optimised version. Abrash's suggestion is spoecifically for NV2X which has a special case for degenerate setup which will in most cases prevent this, R9XXX may or may not.

When i reduce the tri size,only changed is the fill-rate,the throughput of T&L is not changed.The optimised code should be faster after changed the size,but the result is reverse.

Xmas · Feb 21, 2003

ultrafly said:
When i reduce the tri size,only changed is the fill-rate,the throughput of T&L is not changed.The optimised code should be faster after changed the size,but the result is reverse.

Do you mean the optimized code with reduced tri size is slower than the unoptimized code with small tris, or slower than the optimized code with large tris?

ultrafly · Feb 22, 2003

Xmas said:
ultrafly said:

When i reduce the tri size,only changed is the fill-rate,the throughput of T&L is not changed.The optimised code should be faster after changed the size,but the result is reverse.

Click to expand...

Do you mean the optimized code with reduced tri size is slower than the unoptimized code with small tris, or slower than the optimized code with large tris?

The result of my test is:the optimized code with big tris is faster than the unoptimized code with big tris,and the optimized code with small tris is slower than the unoptimized code with small tris.All tris could been seen in screen,no clip,no cull.

Xmas · Feb 22, 2003

ultrafly said:
The result of my test is:the optimized code with big tris is faster than the unoptimized code with big tris,and the optimized code with small tris is slower than the unoptimized code with small tris.All tris could been seen in screen,no clip,no cull.

Then ERP's explanation is perfectly possible.

Hyp-X · Feb 22, 2003

Dio said:
Again, can I ask where you got this information about ATI's hardware?

[...]

Unfortunately, I think you are wrong....

Ok.

Would you be so kind enough to tell us how the post-T&L vertex cache of the R300 operates?

1. How many vertices can the cache hold? (I know that the DX9 optimization docs say to query the driver but the result is "not-supported".)

2. Does the cache has a FIFO or LRU organization, or something else?

3. Does the R300 benefiting from using optimized triangle strips instead of optimized triangle lists?

4. Are the degenerate triangles are rejected with triangle setup rate, or are they rejected much faster by detecting repeated indices?

5. Is the content of the cache preserved between DIP calls if there's no render state / VB changes between them?

If you cannot give the answers (or get them), please say so, I'll try dev-support next.
(But I think it would be better to publish the info - it's in ATI's interest!)

ultrafly · Feb 22, 2003

Hyp-X said:
Dio said:

Again, can I ask where you got this information about ATI's hardware?

[...]

Unfortunately, I think you are wrong....

Click to expand...

Ok.

Would you be so kind enough to tell us how the post-T&L vertex cache of the R300 operates?

1. How many vertices can the cache hold? (I know that the DX9 optimization docs say to query the driver but the result is "not-supported".)

2. Does the cache has a FIFO or LRU organization, or something else?

3. Does the R300 benefiting from using optimized triangle strips instead of optimized triangle lists?

4. Are the degenerate triangles are rejected with triangle setup rate, or are they rejected much faster by detecting repeated indices?

5. Is the content of the cache preserved between DIP calls if there's no render state / VB changes between them?

If you cannot give the answers (or get them), please say so, I'll try dev-support next.
(But I think it would be better to publish the info - it's in ATI's interest!)

I test the efficiency of degenerate triangles on R300,the result is:

first 20000 normal triangles+ 396 degenerate triangles, 515 fps
second 2 normal triangles+20394 degenerate triangles, 578 fps

The result is disillusionary.

Dio · Feb 23, 2003

When it comes to ATI hardware I only comment on things I have read on the public internet, I sleep better that way, so I'm afraid you'll have to try dev support unless one of the other ATI guys here can answer.

I would observe that if a DX9 query says 'not supported' then you cannot assume anything.

optimized by post-T&L vertex cache

Simon F

Tea maker

Basic

ultrafly

ultrafly

ultrafly

ERP

ultrafly

Xmas

Porous

ultrafly

Xmas

Porous

Hyp-X

Irregular

ultrafly

Dio

Similar threads