R350/NV35 Z Fillrate with FSAA

Dave Baumann

Gamerscore Wh...
Moderator
Legend
Here's some interesting numbers using Marko's fillrate tester application.

First up, NV35:

Code:
                              1X       2X       4X       4xS  
FFP - Pure fillrate         1772.39  1743.24  1686.74   878.64
FFP - Z pixel rate          3373.76  3148.66  1652.25  1571.46
FFP - Single texture        1666.31  1203.08  1097.91   734.83
FFP - Dual texture          1428.69  1110.55  1020.72   705.65
FFP - Triple texture         755.90   683.86   628.10   365.22
FFP - Quad texture           516.99   481.82   452.06   260.82
PS 1.1 - Simple              892.08   885.75   878.28   443.93
PS 1.4 - Simple              839.77   833.81   827.33   417.83
PS 2.0 - Simple              339.45   338.20   337.08   169.32
PS 2.0 PP - Simple           339.45   338.26   337.15   169.32
PS 2.0 - Longer              154.27   153.89   153.65    77.01
PS 2.0 PP - Longer           188.71   188.20   187.86    94.19
PS 2.0 - Longer 4 Reg        148.66   148.31   148.11    74.22
PS 2.0 PP - Longer 4 Reg     212.22   211.64   211.28   105.94
PS 2.0 - Per Pixel Light      68.99    68.23    67.05    45.14
PS 2.0 PP - Per Pixel Light   88.84    88.24    86.63    44.22

Note the Z fillrates - without FSAA there is twice the pixel fillrate, as we would expect from their optimised Z/Stencil pipeline. The same is the case with 2X FSAA. With 4X FSAA enabled the Z sampling rate is consistent with the fillrate, indicating that all the Z sampling units are being utilised. With 4xS (2X MSAA + 2X SSAA) the fillrate is still at the standard fillrate (roughly) whereas the actual fillrate is halved because super sampling is being used.

Contrast that with R350:

Code:
                              1X       2X       4x      6X  
FFP - Pure fillrate         2742.52  1844.94  1652.42  1379.90
FFP - Z pixel rate          2536.93  2401.23  1449.13   742.06
FFP - Single texture        2605.49  1631.09  1434.26  1231.84
FFP - Dual texture          1365.43  1158.88  1148.74  1115.08
FFP - Triple texture         734.46   688.75   685.56   690.39
FFP - Quad texture           598.83   580.48   558.93   563.07
PS 1.1 - Simple             1490.28  1468.88  1452.69  1368.10
PS 1.4 - Simple             1490.28  1468.89  1452.98  1367.33
PS 2.0 - Simple             1490.29  1468.89  1452.63  1367.37
PS 2.0 PP - Simple          1490.30  1468.89  1452.76  1366.10
PS 2.0 - Longer              749.96   744.26   740.97   738.57
PS 2.0 PP - Longer           749.96   744.26   740.97   738.56
PS 2.0 - Longer 4 Reg        749.96   744.25   740.95   738.57
PS 2.0 PP - Longer 4 Reg     749.96   744.25   741.01   738.56
PS 2.0 - Per Pixel Light     111.58   111.28   111.08   110.92
PS 2.0 PP - Per Pixel Light  111.58   111.28   111.08   110.92

Note that at 2X FSAA the Z fillrate is is still at the standard pixel fillrate however, at 4X FSAA it drops to half the theoretical fillrate of 9800 PRO, and at 6x FSAA it drops to a quarter.
 
A little help for the less educated here? :) What's the significance of this? Is AA impinging on memory bandwidth or the physical "Z units" (for lack of a better term/understanding)? Or is this related to ATi's HyperZ efficiency? Or none of the above? :)
 
Some of your results are strange. I mean that I don't understand them.


Note the Z fillrates - without FSAA there is twice the pixel fillrate, as we would expect from their optimised Z/Stencil pipeline.

In fact the Z fillrate is always twice the pixel fillrate as FX5800/5900's Z units number is twice the pixel pipelines.


With 4X FSAA enabled the Z sampling rate is consistent with the fillrate, indicating that all the Z sampling units are being utilised.

I think you make an error. I would say : 2X FSAA is fillrate free, indicating that all the Z sampling units are being utilised. ???

Each additionnal sample needs just a Z operation so the 4 more Z units are always used with FSAA on.


Your results seem to show that R350 and NV35 have 16 Z units. It should be 8. I don't understand :( Maybe the Z-test is not done on all samples ?
 
Damien, for what I've always thought for NV30/NV35, and these results bear out in some sence, is that the optimised Z/Stencil pipeline use the MSAA pipeline. i.e. it seem that for NV30/35 there are 4 Z units per pixel pipeline, and these have been optimised such that under normal rendering conditions (without AA) two of these can be used for Z and two for stencil. When 4X FSAA is applied the optimised Z/Stencil pipeline is lost.

With R3x0 the results are showing something different. Because it has a 6X MSAA option the assumption was that it had 6 Z samplers per pipe - this doesn't appear to be the case. What does appear to be happening is that the are 2 Z units per pipe for MSAA, but if 2X FSAA is exceeded then it goes through multiple cycles for the other MSAA samples - at 4X FSAA the Z rendering is cut in half because its now taking two cycles per pixel.

Furthering that, you can see that this mechanism isn't harming R350's Shader Fillrate, so the Z units appear to be decoupled from the rest of the pipeline - in long shader type situations (or even with Trilinear enabled!) 4X FSAA can still be produced with no loss of performance, but without the extra silicon demands of 2 or 3 times the Z sampling units. It might also mean that R3x0 has an arbitarily high number of samples that can be taken per pixel, but is just limited to 6X for performance reasons.
 
DaveBaumann said:
Damien, for what I've always thought for NV30/NV35, and these results bear out in some sence, is that the optimised Z/Stencil pipeline use the MSAA pipeline. i.e. it seem that for NV30/35 there are 4 Z units per pixel pipeline, and these have been optimised such that under normal rendering conditions (without AA) two of these can be used for Z and two for stencil.
I've always thought it was 2 Z-units per pixel pipelines. But you're right, 4 makes sense.

DaveBaumann said:
When 4X FSAA is applied the optimised Z/Stencil pipeline is lost.
It makes sense too. So the NV35 has 16 Z pipelines and 4 pixel pipelines.

DaveBaumann said:
With R3x0 the results are showing something different. Because it has a 6X MSAA option the assumption was that it had 6 Z samplers per pipe - this doesn't appear to be the case. What does appear to be happening is that the are 2 Z units per pipe for MSAA, but if 2X FSAA is exceeded then it goes through multiple cycles for the other MSAA samples - at 4X FSAA the Z rendering is cut in half because its now taking two cycles per pixel.
I thought the R3x0 has 1 Z unit per pipe. But 2 Z units would explain your results (and mine). So R300 has 16 Z pipelines and 8 pixel pipelines.

DaveBaumann said:
Furthering that, you can see that this mechanism isn't harming R350's Shader Fillrate, so the Z units appear to be decoupled from the rest of the pipeline - in long shader type situations (or even with Trilinear enabled!) 4X FSAA can still be produced with no loss of performance, but without the extra silicon demands of 2 or 3 times the Z sampling units. It might also mean that R3x0 has an arbitarily high number of samples that can be taken per pixel, but is just limited to 6X for performance reasons.

Here I don't fully agree with you. I mean that there is another possible explanation. As the shader complexity increase, the influence of MSAA decrease and tend to be nil. It's why MSAA is not harming shader fillrate. It's also the case with one Z unit 'per pipeline'. IE : if a shader needs 30 pass in the pipeline (-> 30 instructions), the more samples just need 1 pass in the pipeline.
-> with 8 Z and 8 pixel : 30 cycles without AA, 31 cycles with 2X MSAA, 33 cycles with 4X MSAA (and maybe 30 cycles with 2X and 4X MSAA if Z pipelines are decoupled)
-> with 16 Z and 8 pixel : 30 cycles without AA, 30 cycles with 2X MSAA, 31 cycles with 4X MSAA (and maybe 30 cycles with 4X MSAA if Z pipelines are decoupled)

Your results seems to say that the Z units are decoupled but if I remember well, I have results that seems to say they are not decoupled. Maybe they are decoupled with the fixed function and not with pixel shader ?
 
Dave,

What´s the case when stencil ops and MSAA get combined? NV3x seems to loose quite a bit less in performance against the R3xx in such cases.
 
It makes sense too. So the NV35 has 16 Z pipelines and 4 pixel pipelines.

I´d say 16 ROPs. And I´m not so sure the amount of pixel pipes has any relevance to the first number.
 
Ailuros said:
Dave,

What´s the case when stencil ops and MSAA get combined? NV3x seems to loose quite a bit less in performance against the R3xx in such cases.

Why ?

NV3x : 8 z ops and 8 stencil ops (without AA and/or with stencil) or 16 z ops (with MSAA and without stencil)
R3x0 : 8 z ops and 8 stencil ops (without AA and/or with stencil) or 16 z ops (with MSAA and without stencil)

No difference in theory ;)

But it could be interesting to check this.
 
With that oversimplification it should be theoretically the same but it doesn´t seem to be in reality. Besides who says that both chips operate in the exact same way under conditions like these?

***edit:

Fablemark (which is fillrate limited) 1024*768

R300:

0xAA -->2xAA = -23,6%
0xAA -->4xAA = -53,2%
0xAA -->6xAA = -69,4%

NV35:

0xAA -->2xAA = -31,2%
0xAA -->4xAA = -40,5%

(warning: tests were not conducted on the same system)
 
Ailuros said:
With that oversimplification it should be theoretically the same but it doesn´t seem to be in reality. Besides who says that both chips operate in the exact same way under conditions like these?

Nobody... it's just some quick speculations at 3:00 AM :p

But I can't see why NV3x would loose less ???
 
DaveBaumann said:
Furthering that, you can see that this mechanism isn't harming R350's Shader Fillrate, so the Z units appear to be decoupled from the rest of the pipeline - in long shader type situations (or even with Trilinear enabled!) 4X FSAA can still be produced with no loss of performance, but without the extra silicon demands of 2 or 3 times the Z sampling units. It might also mean that R3x0 has an arbitarily high number of samples that can be taken per pixel, but is just limited to 6X for performance reasons.

.....!!!!!!!!!!

This is basically what 3dfx intended for Rampage's AA! Except that it relied on texture load to stall the pipeline while the Z units resampled.
 
Tridam said:
Ailuros said:
With that oversimplification it should be theoretically the same but it doesn´t seem to be in reality. Besides who says that both chips operate in the exact same way under conditions like these?

Nobody... it's just some quick speculations at 3:00 AM :p

But I can't see why NV3x would loose less ???
R300 can't burst its stencil buffer access (when depth is masked off). NV30 can, I believe even NV20 can.
An appropriate benchmark for this kind of stuff is ... er ... in the works :p
 
Tagrineth said:
.....!!!!!!!!!!

This is basically what 3dfx intended for Rampage's AA! Except that it relied on texture load to stall the pipeline while the Z units resampled.

If it´s any relief the very same idea is present (with some evolutionary changes I believe) in NV3x too.
 
zeckensack said:
R300 can't burst its stencil buffer access (when depth is masked off). NV30 can, I believe even NV20 can.
An appropriate benchmark for this kind of stuff is ... er ... in the works :p
This was one of the first things I remembered when checking out the Radeon 9700 Pro. Neverwinter Nights and Tenebrae both seemed to perform more poorly than I would have expected with FSAA enabled, compared to my GeForce4 (doesn't mean that it performed worse in these games than the GF4, just not anywhere close to as much better as it did in other games).
 
Ailuros said:
I´d say 16 ROPs. And I´m not so sure the amount of pixel pipes has any relevance to the first number.

I'd also say 16 ROPs (or 4 ROPs which handle up to 4 z-samples), same as NV2x. IMO its both the ROPs and the samplers that limit the z/4x AA case here. NV3x can't neither sample more than 16 z-(sub)samples per clock nor write more than 16 z-(sub)samples per clock.

So the enhancement from NV2x -> NV30/35 seems pretty small to me. But the question remains - what exactly did they enhance, and why it isn't possible to write 16 z-samples in 0xAA mode?
 
A few things, some of them IMO, some of them AFAIK:

- NVIDIA's 4 "empty" pipelines trick for MSAA was invented by ex-3DFX employees AFAIK.

- The trick Rampage was to use is certainly going to be used more and more in the future. I expect that in the 2 generations time ( NV50 and R600 ), all parts will support 16x MSAA+ with maximum 4 units per pipeline, maybe even only 1 for mainstream parts. The R300 already using 6x MSAA with 2 units is certainly a sight of the time, a good proof ATI got some Rampage engineers in the process as was rumored before.

- Considering the heavy focus on FSAA for the NV40, it is nearly impossible for NVIDIA to keep their "double Z/Stencil" system. This NV40 will most likely not be capable of 16 zixels. Expect a focus on at least 8x MSAA considering they got 40GB+ of bandwidth and new algorithms. I expect nearly twice the AA performance of the NV35.

- I have not yet had the opportunity to verify 16x MSAA on the NV30GL and NV35GL. I wish I had one of these, hehe ;) I wouldn't be surprised if only the GL core had the ability to loopback like that like the Rampage/R300, but I frankly got no idea about it. Although:
http://www.nvnews.net/vbulletin/showthread.php?s=&threadid=15327

Those performance numbers certainly make me think 8x MSAA has not half the fillrate of 4x MSAA, considering it's still 66% of the performance. Although as he says, his config is atypical - more serious investigation is certainly required on the front.

-Dave:
It might also mean that R3x0 has an arbitarily high number of samples that can be taken per pixel, but is just limited to 6X for performance reasons.
I'd bet on the fact that they could easily expand it for a transistor fee, but that the R3xx can't support more than 6x MSAA. Remember their AA algorithm is quite complex - all positions of all samples are 100% programmable through drivers. Crazy stuff, that was certainly an amazing design choice! So I think the limit here is the number of samples that can be stored & used.

- BTW, I say we all call this the "Rampage Trick" just ot make ATI and nVidia angry like hell at us attributing that thing to 3DFX, hehe ;) I know I will! :)


Uttar
 
Tridam said:
Here I don't fully agree with you. I mean that there is another possible explanation. As the shader complexity increase, the influence of MSAA decrease and tend to be nil. It's why MSAA is not harming shader fillrate. It's also the case with one Z unit 'per pipeline'. IE : if a shader needs 30 pass in the pipeline (-> 30 instructions), the more samples just need 1 pass in the pipeline.

Thats essentially as I was saying, however in the examples you give my assumption would be that the rest of the pipeline is working on the next set of pixels, while the MSAA samples for the "current" are worked on at the back end.

Tridam said:
Your results seems to say that the Z units are decoupled but if I remember well, I have results that seems to say they are not decoupled. Maybe they are decoupled with the fixed function and not with pixel shader ?

Well, this is actually why I looked at this initially, since I;ve seen some in-game shader tests that ATI seems to be much more variant to FSAA and I wasn't sure if that was just through bandwidth overhead or not. The tests in the first post are mostly PS tests though, and none of them are showing any apparant variant in performance with FSAA enabled.

Uttar said:
The trick Rampage was to use is certainly going to be used more and more in the future. I expect that in the 2 generations time ( NV50 and R600 ), all parts will support 16x MSAA+ with maximum 4 units per pipeline, maybe even only 1 for mainstream parts.

Ramgae was fundamentally different in the that MSAA pipelines were also the pixel pipelines under certain conditions, it was just the case that if MSAA was used then it effectively only had one pixel pipe per chip.

Uttar said:
I'd bet on the fact that they could easily expand it for a transistor fee, but that the R3xx can't support more than 6x MSAA. Remember their AA algorithm is quite complex - all positions of all samples are 100% programmable through drivers. Crazy stuff, that was certainly an amazing design choice! So I think the limit here is the number of samples that can be stored & used.

Its sparce sampling, meaning there is a maximum number od samples withing a grid it could do, but the fact that none of the sample positions at 6X are inline with each other indicate that at 6X they are nowhere near reaching that limit. Judging from the sample pattern of 6X I'd gues they could probably go to 8 samples without any sampl point being in line, but beyond that they are probably reaching the point of diminishing returns as samples will start to be in-line.
 
DaveBaumann said:
Uttar said:
The trick Rampage was to use is certainly going to be used more and more in the future. I expect that in the 2 generations time ( NV50 and R600 ), all parts will support 16x MSAA+ with maximum 4 units per pipeline, maybe even only 1 for mainstream parts.

Ramgae was fundamentally different in the that MSAA pipelines were also the pixel pipelines under certain conditions, it was just the case that if MSAA was used then it effectively only had one pixel pipe per chip.

Not exactly... but close. It's more like the pixel pipelines were also MSAA pipelines under certain conditions. While looping back to fetch and apply new textures, the TMU's would do their job, and instead of staying idle, the Z checker in each pipe would go back and work some more. Not exactly down to one pixel pipe per chip... in theory you could get near-free 2x AA with 2 texture layers, and pretty-close-to-free 4x AA with 4 texture layers... and possibly also 'performance' AF.

DaveBaumann said:
Uttar said:
I'd bet on the fact that they could easily expand it for a transistor fee, but that the R3xx can't support more than 6x MSAA. Remember their AA algorithm is quite complex - all positions of all samples are 100% programmable through drivers. Crazy stuff, that was certainly an amazing design choice! So I think the limit here is the number of samples that can be stored & used.

Its sparce sampling, meaning there is a maximum number od samples withing a grid it could do, but the fact that none of the sample positions at 6X are inline with each other indicate that at 6X they are nowhere near reaching that limit. Judging from the sample pattern of 6X I'd gues they could probably go to 8 samples without any sampl point being in line, but beyond that they are probably reaching the point of diminishing returns as samples will start to be in-line.

He was referring to on-chip cache storage for the sample positions, not space on a sample grid. You could just expand the sample grid and use a recursive method to determine any number of sparse samples on a really huge grid.
 
Not exactly down to one pixel pipe per chip... in theory you could get near-free 2x AA with 2 texture layers, and pretty-close-to-free 4x AA with 4 texture layers... and possibly also 'performance' AF.

Not to sound picky, but it should rather state more likely fillrate free for those conditions.

The only thing we get for free, even on today's accelerators is bilinear filtering, to the contrary what nifty PR yadda yadda will state.
 
Back
Top