Can R420 perform more z-only/stencil ops/clk than NV40 w/AA?

Pete

Moderate Nuisance
Moderator
Legend
Looking through nV's latest GPU Programming Guide pdf, I came across this:

3.6.1. Double-Speed Z-Only and Stencil Rendering

The GeForce FX and GeForce 6 Series GPUs render at double speed when rendering only depth or stencil values. To enable this special rendering mode, you must follow the following rules:

o Color writes are disabled
o 2x or 4x antialiasing is not enabled
o Texkill has not been applied to any fragments
o Depth replace has not been applied to any fragments
o Alpha test is disabled
o No color key is used in any of the active textures
o No user clip planes are enabled
o No floating-point render targets are in use
o Pixel shaders are disabled
o Render to a non-power-of-2 texture

Now, I remember that, according to ATi, R420 can achieve double z-only/stencil ops per clock only with AA, and here we see nV can only achieve that without AA. Is this true? Is this ability new on the R420, or is it also in RV3x0? What affect do you think this will have in Doom 3 (i.e., will the game be otherwise bottlenecked so that double stencil ops per clock won't matter much with AA?)?
 
Re: Can R420 perform more z-only/stencil ops/clk than NV40 w

Pete said:
Looking through nV's latest GPU Programming Guide pdf, I came across this:

3.6.1. Double-Speed Z-Only and Stencil Rendering

The GeForce FX and GeForce 6 Series GPUs render at double speed when rendering only depth or stencil values. To enable this special rendering mode, you must follow the following rules:

o Color writes are disabled
o 2x or 4x antialiasing is not enabled
o Texkill has not been applied to any fragments
o Depth replace has not been applied to any fragments
o Alpha test is disabled
o No color key is used in any of the active textures
o No user clip planes are enabled
o No floating-point render targets are in use
o Pixel shaders are disabled
o Render to a non-power-of-2 texture

Now, I remember that, according to ATi, R420 can achieve double z-only/stencil ops per clock only with AA, and here we see nV can only achieve that without AA. Is this true? Is this ability new on the R420, or is it also in RV3x0? What affect do you think this will have in Doom 3 (i.e., will the game be otherwise bottlenecked so that double stencil ops per clock won't matter much with AA?)?

I'm thinking that it just means that in the rendering pass where you want to use the double z-only/stencil ops per clock, you need to disable color writes, 2X, 4X AA..........
 
Multisampling require multiple Z/stencil operations per pixel.
Both nVidia and ATI cards (NV30+ & R300+) can handle 2 Z operations per pixel pipeline.
The difference is that ATI cards can only use 1 Z operations per pixel pipeline when there's no AA.

Theoretical fillrate numbers with Z/stencil ops only (Mps):
Code:
       X800XT    G6800U
NoAA   8400      12800 
2xAA   8400       6400
4xAA   4200       3200
 
Stencil Buffers

A stencil buffer is created and stored together with a depth buffer, so it should not be treated any differently from a depth buffer. Make sure to clear stencil buffer together with a depth buffer and avoid partial surface clears. Check section on HYPER Z to see how stencil operations can affect graphics performance.
what's the reason?
 
Hyp-X said:
Multisampling require multiple Z/stencil operations per pixel.
Both nVidia and ATI cards (NV30+ & R300+) can handle 2 Z operations per pixel pipeline.
The difference is that ATI cards can only use 1 Z operations per pixel pipeline when there's no AA.

Theoretical fillrate numbers with Z/stencil ops only (Mps):
Code:
       X800XT    G6800U
NoAA   8400      12800 
2xAA   8400       6400
4xAA   4200       3200

Those AA + Z numbers just don't seem to get reflected (yet?) in synthetic applications so far. With or w/o MSAA and with the NV40 having significantly lower clockspeed than the R420.
 
Elminster said:
Stencil Buffers

A stencil buffer is created and stored together with a depth buffer, so it should not be treated any differently from a depth buffer. Make sure to clear stencil buffer together with a depth buffer and avoid partial surface clears. Check section on HYPER Z to see how stencil operations can affect graphics performance.
what's the reason?
Well if you decide to keep stencil values during clear, then you can't just say "OK z-buffer is cleared now", since z-buffer is z-buffer AND stencil buffer in one large buffer. Kinda like only clearing red and green, but leaving blue intact.
 
christoph said:
Whoops, I forgot about that, even though I participated in it--sorry. Still, IIRC, the relevance of front-to-back rendering performance WRT 3DM03 GT2/3 was never made fully clear to me. I'm re-reading the thread as I type.

Bjorn, is it possible to disable AA per pass? Will it the ultimate output be affected by disabling AA during the "zixel" pass?

99160, are you sure those nV results aren't because of the lower bandwidth requirements of zixel ops, sort of like running 3DM fillrate tests in 16-bit mode. MikeC's zixel numbers don't seem to exceed the 6800U's theoretical ones as posted by Hyp-X, though I don't know how to calculate theoretical color values with MSAA:

No AA:
Z Fill : 11790.19 M-Pixel/s /12800
Color + Z Fill : 4326.005 M-Pixel/s

2X AA:
Z Fill : 6173.177 M-Pixel/s /6400
Color + Z Fill : 4512.232 M-Pixel/s

4X AA:
Z Fill : 3178.444 M-Pixel/s /3200
Color + Z Fill : 2370.62 M-Pixel/s

Ailuros, I wonder if the memory controller is the culprit to that extent? Given the theoretical differences, the benchmarked ones seem overly large. F-B rendering may be the clue to the puzzle, though it still eludes me.
 
No AA:
Z Fill : 11790.19 M-Pixel/s /12800
Color + Z Fill : 4326.005 M-Pixel/s

2X AA:
Z Fill : 6173.177 M-Pixel/s /6400
Color + Z Fill : 4512.232 M-Pixel/s

4X AA:
Z Fill : 3178.444 M-Pixel/s /3200
Color + Z Fill : 2370.62 M-Pixel/s


I have a layman question, does that mean 2X AA is practically free on NV40?
 
Yep, that's the same question I had. I think there are other factors at play with multi-sample AA (apparently not a straightforward doubling of work) that makes determining theoretical pixel fillrate less straightforward than theoretical zixel fillrate, like memory bandwidth (not enough to support theoretical GPU numbers) and color compression (may account for better relative performance).

But NV30 was also described as offering "free" 2x AA, so maybe there are architectural tweaks beyond what I mentioned that make 2x AA more efficient (in synthetic benchmarks) than none at all, if not necessarily faster (in real-world usage).
 
Elminster said:
Stencil Buffers

A stencil buffer is created and stored together with a depth buffer, so it should not be treated any differently from a depth buffer. Make sure to clear stencil buffer together with a depth buffer and avoid partial surface clears. Check section on HYPER Z to see how stencil operations can affect graphics performance.
what's the reason?
To expand a little on MDolenc's responce. Depth buffers are typically split between 24 bits of Z and 8 bits of stencil.

Also in responce to phenix, 2X AA is free on recent GeForce and Radeon cards if memory bandwidth is not the bottleneck.
 
I think the answer to your question is in Wavey´s NV40 and R420 (p-)reviews under the ROP category.
 
Elminster said:
Stencil Buffers

A stencil buffer is created and stored together with a depth buffer, so it should not be treated any differently from a depth buffer. Make sure to clear stencil buffer together with a depth buffer and avoid partial surface clears. Check section on HYPER Z to see how stencil operations can affect graphics performance.
what's the reason?

Many reasons:
- clearing Z only when you've got interleaved Z and Stencil
is like doing blending. Need a read then modify then write.
- the one cited above that you can't just consider it "cleared" if it's only partially cleared.
- probably some other internal representation problems like hierarchical Z and early rejection stuff.
 
Ailuros said:
I think the answer to your question is in Wavey´s NV40 and R420 (p-)reviews under the ROP category.
... and I overlook something I've read yet again. I'm batting 1,000! :oops:
This also signifies that NV4x is only capable of 2 FSAA Multi-Sample samples per clock cycle, and indeed David Kirk confirmed this to be the case - as it has, in fact, been since NV20. To achieve 4X Multi-Sampling FSAA a second loop must be completed through the ROP over two cycles – memory bandwidth makes it prohibitive to output more samples in a cycle anyway.
 
phenix said:
No AA:
Z Fill : 11790.19 M-Pixel/s /12800
Color + Z Fill : 4326.005 M-Pixel/s

2X AA:
Z Fill : 6173.177 M-Pixel/s /6400
Color + Z Fill : 4512.232 M-Pixel/s

4X AA:
Z Fill : 3178.444 M-Pixel/s /3200
Color + Z Fill : 2370.62 M-Pixel/s
I have a layman question, does that mean 2X AA is practically free on NV40?
No. You can see the Z only pixel rate is halved when going to 2x AA.

-FUDie
 
Despite the memory bandwidth, I think NV40's ROP in each pipeline can do:
no AA: 1 color(+ 1 depth)ops or 2 depth ops/clock
2x AA: 2 color(+ 2 depth)ops or 2 depth ops/clock
4x AA: 2 color(+ 2 depth)ops or 2 depth ops/clock

only in no AA mode can the ROP run depth ops faster than color ops, that's why the document says "double speed" is only achievable in no AA mode. Note that the pure speed of ROP running depth ops doesn't change in different modes.

And for R420, its ROP can do:
no AA: 1 color(+1 depth) ops or 1 depth ops/clock
2x AA: 1 color(+1 depth)(?) ops or 2 depth ops/clock
4x AA: 2 color(+2 depth) ops or 2 depth ops/clock
6x AA: 2 color(+2 depth) ops or 2 depth ops/clock

the 2x AA number may not be very accurate though.
 
It's been about a year since I played with GL_MULTISAMPLE on both NVidia and ATi cards, but at the time you could toggle it mid-pass on a GF4, but not on a 9700 Pro.

I think I was trying to disable it for text.
 
991060 said:
Despite the memory bandwidth, I think NV40's ROP in each pipeline can do:
no AA: 1 color(+ 1 depth)ops or 2 depth ops/clock
2x AA: 2 color(+ 2 depth)ops or 2 depth ops/clock
4x AA: 2 color(+ 2 depth)ops or 2 depth ops/clock

only in no AA mode can the ROP run depth ops faster than color ops, that's why the document says "double speed" is only achievable in no AA mode. Note that the pure speed of ROP running depth ops doesn't change in different modes.

And for R420, its ROP can do:
no AA: 1 color(+1 depth) ops or 1 depth ops/clock
2x AA: 1 color(+1 depth)(?) ops or 2 depth ops/clock
4x AA: 2 color(+2 depth) ops or 2 depth ops/clock
6x AA: 2 color(+2 depth) ops or 2 depth ops/clock

the 2x AA number may not be very accurate though.
These don't seem correct to me. You're comments seem correct though. I think Nvidia and Ati are identical for 2x and 4x AA modes. Also, where did you get your color op numbers from? Shouldn't they all be 1 color op?
 
R420 is able to do up to 2 Z/Stencil compares per cycle / per pixel, in 2XAA or above, regardless of the color operations.

R420 is able to match the number of color operations that match the number of fragments in flight (that's the whole point of MSAA), so can do up to "6 fragments" per pixel per cycle in 6xAA, or 2 frags in 2xAA.
 
Back
Top