R350/NV35 Z Fillrate with FSAA

Ailuros said:
What I meant was something along the lines of, once you get over the hit from multitexturing, the added loss is barely noticeable. Hence "Effectively free with XXX in use".

Fillrate free under conditionals, yet the performance penalty would have still been there. Want me to crank out old 3dfx performance claims about the VSA-100´s FSAA? By the minute you can define the difference between marketing hype and reality we might even start to get at least some common ground here.

Actually, Ailuros, I think Tag's right there.
Remember: Rampage was a 4x1, so if you had 2 textures, it took 2 clocks ( yet another example of how the NV20 had some serious advantages over Rampage, even if Rampage had some over the NV20 too ) , that's +100%.
So if you use 2x MSAA, and you assume the 200Mhz DDR memory rumor to be roughly accurate, I'd say the hit of 2x MSAA most likely wouldn't be +100% memory bandwidth, because you wouldn't have more texture fetches, and a few other things wouldn't be doubled.
So, you could say that the performance hit for 2x MSAA when using dual texturing, and the real one, not the fillrate hit, is quite small indeed!

Sure, it wouldn't be inexistant probably. But saying it's 'free' actually makes semi-sense, considering that term probably came from marketing...

Rev: I agree fully with you :) For me, Rampage is an interesting part. I don't care about whether it'd have crushed the NV20 or not, and whether it'd have been released on time, and so on. That's history. Some of the techs in it, even if often outdated or already implemented/modified in current products, are interesting, though.


Uttar
 
Ailuros said:
Only Parhelia and P10 have true 256bit busses. On the other hand do you need an extra analysis for the differences in NV35/R3xx and their memory controllers?

OK, I'm game.

R300 uses two 128-bit busses, and NV3x uses four 64-bit busses to achieve 256-bit-ness.

NV2x uses four 32-bit busses, by the way. 8)

Textures are actually repeated less than you may expect. SLI evolved beyond the old even, odd scanline V2 days... even V5 is able to use bands up to 64 pixels thick, and Rampage increased that to 128. Textures would in fact very often fall within those bands, allowing them to be stored only in one 'repository'.

Sure who would even dare to doubt you guys in the first place? Did you get an idea in the meantime what Fearless stood for?

Nobody I know has any idea what you're talking about. 8)

Fillrate free under conditionals, yet the performance penalty would have still been there. Want me to crank out old 3dfx performance claims about the VSA-100´s FSAA? By the minute you can define the difference between marketing hype and reality we might even start to get at least some common ground here.

See also Uttar's post, probably right above this one.

GP-1 was a low speced TBDR. Again TBDRs handle Multisampling with no fillrate and bandwidth costs and those graphs actually show that. Would you guess in that case a minimal performance penalty? No. Now give me a fair explanation why Spectre would have been more effienct with MSAA on then a TBDR. Yes that thing has only a 100MHz clockspeed, but noAA performance should be still significantly higher than 26fps, don´t you think? I´m talking about pure percentages here.

Again, see Uttar's post. Plus you said yourself that GP-1 is low speced - it's something like one pixel pipeline at 100MHz, sorta like Neon-250. And whaddya know, it's actually a bit faster than N250. Cool!

Spectre would have had a performance penalty with MSAA on non period.

See Uttar's post 8)

On its own, though, yes, Spectre would've taken a pretty good hit with MSAA...

Whether or not it was a joke, and sure, it probably was, it COULD WORK in CONTROLLED conditions.

No serious IHV would use a flawed software HSR implementation officially in it´s drivers. The thing was hidden in the last official driver sets.

No serious IHV would ever use a flawed, obviously bad quality AF implementation officially in its drivers... nor would they not allow you to use trilinear filtering no matter how hard you try... OH, WAIT! BUT THAT IS DONE! :eek:

The visual error can be totally unnoticeable if you control the settings carefully enough. People on their Voodoo5's have gained 20-30 fps (up FROM ~40fps) with careful tweaking of the unlocked HSR settings. What makes you think that would be impossible to improve with some professional tweaking from people who actually KNOW what the settings mean, rather than blindly experimenting?
 
Tagrineth said:
No serious IHV would use a flawed software HSR implementation officially in it´s drivers. The thing was hidden in the last official driver sets.
No serious IHV would ever use a flawed, obviously bad quality AF implementation officially in its drivers... nor would they not allow you to use trilinear filtering no matter how hard you try... OH, WAIT! BUT THAT IS DONE! :eek:
A flawed HSR technique can result in much more massive problems than a flawed anisotropic filtering technique. Specifically, if the algorithm screws up on the HSR, triangles will go a-missing.
 
Textures are actually repeated less than you may expect. SLI evolved beyond the old even, odd scanline V2 days... even V5 is able to use bands up to 64 pixels thick, and Rampage increased that to 128. Textures would in fact very often fall within those bands, allowing them to be stored only in one 'repository'.

Triangle setup is still not completely shared between chips.

Nobody I know has any idea what you're talking about.

Typical answer I'd expect from any of the ex-3dhq wizzards.

See also Uttar's post, probably right above this one.

What about it? NV20 had the same multitexturing fillrate as dual chip Spectre and the only other significant advantage over it would have been the filter on scanout trick, which by the way works under conditionals too even on NV25. It saves bandwidth only AFAIK when the average performance rate doesn't drop below 2/3rd of the resolution refresh rate.

Rampage made it barely into OGL debug mode, so other than theory there's no real performance numbers to refer to (even off the record), let alone a Rampage+Sage combination.

The peak numbers touted from either/or was Sage 50MVPS vs 40MVPS on NV20. Where each of the two would have been able to sustain a higher number in reality is open to everyone's imagination, yet the NV20's "texture shader units" weren't particularly weak either, rather the contrary.

Again, see Uttar's post. Plus you said yourself that GP-1 is low speced - it's something like one pixel pipeline at 100MHz, sorta like Neon-250. And whaddya know, it's actually a bit faster than N250. Cool!

Series2 used Supersampling. Go and recheck your facts up to which resolution SSAA was possible on Series2 and 3 (locked in drivers) and maybe it might ring a bell why SSAA can only be bandwidth free on a TBDR. Again MSAA is fillrate and bandwidth free on a TBDR, else explain how the hell it managed to even get a decent framerate for it's specs in 1280*1024 with AA on with a meger 200MTexels/sec fillrate.

In terms of Supersampling how many fps did the K2 yield with 4xSSAA on in the same game in 32bpp and how many a dual chip V5? 350MTexels/sec for the K2 and 667MTexels/sec for the V5.


No serious IHV would ever use a flawed, obviously bad quality AF implementation officially in its drivers... nor would they not allow you to use trilinear filtering no matter how hard you try... OH, WAIT! BUT THAT IS DONE!

3d-FX ---> GF-FX :p

Seriously though NV3x have different anisotropic filtering modes, where it's left to the user's choice which mode he prefers, instead of being restriced to a mediocre sollution that operates only in conjuction with MSAA.

The visual error can be totally unnoticeable if you control the settings carefully enough. People on their Voodoo5's have gained 20-30 fps (up FROM ~40fps) with careful tweaking of the unlocked HSR settings. What makes you think that would be impossible to improve with some professional tweaking from people who actually KNOW what the settings mean, rather than blindly experimenting?

They dropped geometry when the rendering took too long. Anyone with a basic understanding how the q3a engine's BSP operates and what true hardware HSR can or cannot do would know the difference.
 
Ailuros said:
Triangle setup is still not completely shared between chips.

True. How much space does that burn up, then? Can't be *that* bad... or am I wrong?

Typical answer I'd expect from any of the ex-3dhq wizzards.

'kay. =P

What about it? NV20 had the same multitexturing fillrate as dual chip Spectre and the only other significant advantage over it would have been the filter on scanout trick, which by the way works under conditionals too even on NV25. It saves bandwidth only AFAIK when the average performance rate doesn't drop below 2/3rd of the resolution refresh rate.

Recursive texturing and massive way-ahead-of-its-time bandwidth not an advantage?

The peak numbers touted from either/or was Sage 50MVPS vs 40MVPS on NV20. Where each of the two would have been able to sustain a higher number in reality is open to everyone's imagination, yet the NV20's "texture shader units" weren't particularly weak either, rather the contrary.

SAGE was pretty well known for being efficient with added lights (much like Naomi2's Elan) at the time. I wouldn't be surprised it some of its more creative optimisations are already in NV3x's stunningly fast fixed-function TCL... but it would have to be, considering it was relying on the AGP bus for most of its bandwidth (I think 3dfx put it at a realistic, sustained 10 MPolys/sec, but I could be a few million off).

Again, see Uttar's post. Plus you said yourself that GP-1 is low speced - it's something like one pixel pipeline at 100MHz, sorta like Neon-250. And whaddya know, it's actually a bit faster than N250. Cool!

Series2 used Supersampling. Go and recheck your facts up to which resolution SSAA was possible on Series2 and 3 (locked in drivers) and maybe it might ring a bell why SSAA can only be bandwidth free on a TBDR. Again MSAA is fillrate and bandwidth free on a TBDR, else explain how the hell it managed to even get a decent framerate for it's specs in 1280*1024 with AA on with a meger 200MTexels/sec fillrate.

That wasn't what I was referring to. I was referring to any performance at all.

In terms of Supersampling how many fps did the K2 yield with 4xSSAA on in the same game in 32bpp and how many a dual chip V5? 350MTexels/sec for the K2 and 667MTexels/sec for the V5.

Both were playable with 4x AA at 800x600. K2 doesn't let you go any higher without dropping to 2x... but K2 and V5's frame rates are never that different, with or without AA. K2 is generally slightly faster, though not by much.

No serious IHV would ever use a flawed, obviously bad quality AF implementation officially in its drivers... nor would they not allow you to use trilinear filtering no matter how hard you try... OH, WAIT! BUT THAT IS DONE!

3d-FX ---> GF-FX :p

Seriously though NV3x have different anisotropic filtering modes, where it's left to the user's choice which mode he prefers, instead of being restriced to a mediocre sollution that operates only in conjuction with MSAA.

Recursive texturing and an intelligent cache make Rampage do AF with a much lower bandwidth hit than usual, plus the TMU's are capable of trilinear per tick, unlike NV2x.

They dropped geometry when the rendering took too long. Anyone with a basic understanding how the q3a engine's BSP operates and what true hardware HSR can or cannot do would know the difference.

You could do the same thing in many strictly front-to-back games. And AFAIR Rampage(/SAGE) has some kind of ordering assist, not sure if it's in the driver or hardware level, but I'm 90% sure it's possible.


I don’t know if she is sure but I do know that she is wrong R3x0 is 4*64bit.

I pulled that from memory, oh well, my point was that I know that they aren't truly 256-bit either.
 
Tagrineth said:
I pulled that from memory, oh well, my point was that I know that they aren't truly 256-bit either.

You seem to be missing a major point here....
They ARE truly 256bit - rampage wasnt. LKook, in one case, you have a 256bit bus to one chip (undeniable) that is broken into 4 64bit channels simply for efficiency - they can be used (IIRC) to transfer either 4 64bit datums, or 2 128bit datums, or one 256bit datum, or some combo of the above, in each cycle.
In the other case, you are trying to say that two entirely seperate 128bit busses to two seperate chips are a 256bit bus. Please tell me you get the difference.

In one case, its like you have 4 train tracks going to one station - each can carry a load, and they can work together to carry a bigger load. In the other, you've got one train going to one station, and the toehr going to another station....
 
ram said:
I´d say 16 ROPs. And I´m not so sure the So the enhancement from NV2x -> NV30/35 seems pretty small to me. But the question remains - what exactly did they enhance, and why it isn't possible to write 16 z-samples in 0xAA mode?
May be better say "1xAA" ;) because it calculates just one subpixel.

NV30 can support 8 texel or "zexel" per clock, not 16 "zexel". The table shows, that NV30 can write 16 z-samples, of course only with 2x AA, because with 2x AA the 16 z-samples are belong to 8 "zexel". To process a Z-pass without textures, more work is needed but just the Z-test.


Xmas said:
NVidia went the other way with GF3, implementing Z-compression and putting 16 (iirc, maybe 8 ) ROPs in the chip. But they also found another task for them: early Z. A GF3 can discard 16 pixels per clock when running without AA.
According to Demirug, the GeForces probably only discards complete 2x2 tiles with its Early Z.


Uttar said:
I expect that in the 2 generations time ( NV50 and R600 ), all parts will support 16x MSAA+
I honestly doubt that. The additional gain in egde smoothing gets significantly low after 4x RG, 6x sparsed or even 8x sparsed. I think, 8x is the maximum in the conceivable future.
 
aths said:
NV30 can support 8 texel or "zexel" per clock, not 16 "zexel".

Aths, I think you missed my point. The question was why NV30/35 can write 8 independent zixels, while NV2x only can write 4 of them. FX can sample and write 16 zixel per clock, but the later only with MSAA. Assuming the ROPs are decoupled from the rest of the pipeline, the question remains what part of the chip restricts the chip of handling only 16 >independent< zixels.

Among other things, the purpose of a ROP is to do framebuffer blending operations and z-/stencil-tests. Obviously, there is enough logic to handle 16 color, z-/stencil values / clock, as 4X MSAA is possible without a fillrate drop if there is no bandwith bottonleck.
To process a Z-pass without textures, more work is needed but just the Z-test.

Obviously. The question is what exactly has been enhanced, if not the ROPs. And what exactly limits the ROPs from handling 16 indepentent zixels. The limitation to 8 indepentend zixel still could be just a limitation in the ROPs, which might have something to do the amount of read ports at the ROPs. If the ROPs are organized as 4x4, there might be only read ports for four x/y-values (+offset/aa mode), which have been enhanced to eight with GFFX. Another place would be the samplers, which can only output 8 x/y values. Or even a limitation with z-buffer compression logic.
 
True. How much space does that burn up, then? Can't be *that* bad... or am I wrong?

Are you familiar with the term redundancy at all?

Recursive texturing and massive way-ahead-of-its-time bandwidth not an advantage?

As I said Tilers had loopback functions way before that and nowadays chips have advanced even beyond that. As for bandwidth I'd rather have a very effective bandwidth saving technique (if not a full TBDR str8 away) with 10GB/sec raw bandwidth, than 30GB/sec raw bandwidth with no advanced bandwidth saving techniques at all.

SAGE was pretty well known for being efficient with added lights (much like Naomi2's Elan) at the time. I wouldn't be surprised it some of its more creative optimisations are already in NV3x's stunningly fast fixed-function TCL... but it would have to be, considering it was relying on the AGP bus for most of its bandwidth (I think 3dfx put it at a realistic, sustained 10 MPolys/sec, but I could be a few million off).

Fixed function what on NV3x's? LOL :D What on God's green earth do we have Vertex Shaders for nowadays?


That wasn't what I was referring to. I was referring to any performance at all.

I love it when people try to elegantly flip out of it......

Both were playable with 4x AA at 800x600. K2 doesn't let you go any higher without dropping to 2x... but K2 and V5's frame rates are never that different, with or without AA. K2 is generally slightly faster, though not by much.

With the only other difference that you needed two chips to achieve what a single chip TBDR was able to. KYRO2 was able for 4xOGSS up to 1024*768 and 2x Vertical up to 1280*1024; where the K2 supposedly flips back to 2x in 1024 is beyond me. Shall I scratch another one on the scoreboard milady ;)

Recursive texturing and an intelligent cache make Rampage do AF with a much lower bandwidth hit than usual, plus the TMU's are capable of trilinear per tick, unlike NV2x.

With the only other difference that it hardly deserves the term AF as a starter. Did you know that Xabre has antialiasing for free? 1x AA over the whole scene which is effectively what I could call in that sense AA for free too.

You could do the same thing in many strictly front-to-back games. And AFAIR Rampage(/SAGE) has some kind of ordering assist, not sure if it's in the driver or hardware level, but I'm 90% sure it's possible.

In wild imaginations anything is possible.

I pulled that from memory, oh well, my point was that I know that they aren't truly 256-bit either.

Gee my oh my.... :rolleyes:
 
ram said:
Aths, I think you missed my point. The question was why NV30/35 can write 8 independent zixels, while NV2x only can write 4 of them.
"Because" NV30 can work as 4x2 or 8x0. Afaik, you need some additional interpolators to write more Z-values, but I have no in-depth knowledge of actual Z-logic.

ram said:
FX can sample and write 16 zixel per clock, but the later only with MSAA.
So NV25 also can :) but only with 4x MSAA, and without textures.

ram said:
Assuming the ROPs are decoupled from the rest of the pipeline, the question remains what part of the chip restricts the chip of handling only 16 >independent< zixels.
Some simplifications in the control logic, I assume. (Or may be the Triangle Setup cannot handle 16 "pixel" or "zixel" per Clock anyway?)

ram said:
Obviously. The question is what exactly has been enhanced, if not the ROPs. And what exactly limits the ROPs from handling 16 indepentent zixels. The limitation to 8 indepentend zixel still could be just a limitation in the ROPs, which might have something to do the amount of read ports at the ROPs. If the ROPs are organized as 4x4, there might be only read ports for four x/y-values (+offset/aa mode), which have been enhanced to eight with GFFX. Another place would be the samplers, which can only output 8 x/y values. Or even a limitation with z-buffer compression logic.
Who still plays with AA disabled? Your question is interessting indeed, but I think this is not a serious limitation of NV30.
 
ram said:
So NV25 also can :) but only with 4x MSAA, and without textures.

AFAIK, assuming no bandwith bottonleck, NV25's ability to sample and write 16 (partly dependend) zixel/clock in the 4XMSAA case isn't limited by the usage of textures. Basicly, there could be up to two textures per pixel. Its the same with the NV30/35.
 
ram said:
ram said:
So NV25 also can :) but only with 4x MSAA, and without textures.

AFAIK, assuming no bandwith bottonleck, NV25's ability to sample and write 16 (partly dependend) zixel/clock in the 4XMSAA case isn't limited by the usage of textures. Basicly, there could be up to two textures per pixel. Its the same with the NV30/35.
Yeah, but there is a serious bandwith bottleneck. 4x MSAA with 32-Bit-Rendering an textures means 16 32-Bit-Subpixels (512 Bit) alone. In fact, NV25 dont need its 16 ROPs.
 
aths said:
Yeah, but there is a serious bandwith bottleneck. 4x MSAA with 32-Bit-Rendering an textures means 16 32-Bit-Subpixels (512 Bit) alone. In fact, NV25 dont need its 16 ROPs.

Right. I'd say it needs at least parts of them to do the 16 z-compares/clock. As 16 z-writes aren't needed in practice, the write ability probably is limited to 8 or even just 4 zixel/clock with the nv25. It should be possible to check that with one of the benchmarks doing z-only passes, 4xAA, B2F rendering @ halfed core clock.
 
ram said:
aths said:
Yeah, but there is a serious bandwith bottleneck. 4x MSAA with 32-Bit-Rendering an textures means 16 32-Bit-Subpixels (512 Bit) alone. In fact, NV25 dont need its 16 ROPs.
Right. I'd say it needs at least parts of them to do the 16 z-compares/clock. As 16 z-writes aren't needed in practice, the write ability probably is limited to 8 or even just 4 zixel/clock with the nv25. It should be possible to check that with one of the benchmarks doing z-only passes, 4xAA, B2F rendering @ halfed core clock.
There's no reason that NV25 would be designed to do 16 Z-compares per clock because there's not enough bandwidth.
 
OpenGL guy said:
There's no reason that NV25 would be designed to do 16 Z-compares per clock because there's not enough bandwidth.
Nvidia states that NV25 can calculate 16 AA-Samples per Clock. With MSAA, every subpixel has to be z-tested, so I deduce that NV25 got 16 ROPs. Indeed, there is not enough bandwith to practically utilize this.

But Integer logic like a ROP seems to be small, on the other hand, and NV can advertise with insane AA-Fillrate. NV cuts out the possibility to calculate 2 AF-Samples per Pipe on NV25. So bilinear AF costs the performance of trilinear AF (which means, to lower the fillrate by 75%.)
 
aths said:
OpenGL guy said:
There's no reason that NV25 would be designed to do 16 Z-compares per clock because there's not enough bandwidth.
Nvidia states that NV25 can calculate 16 AA-Samples per Clock. With MSAA, every subpixel has to be z-tested, so I deduce that NV25 got 16 ROPs. Indeed, there is not enough bandwith to practically utilize this.
They also claimed that 4x AA was free on the NV3x, and this hasn't been shown to be the case. In fact, the results shown in this thread show that only 2x AA is "free" on the NV3x.
But Integer logic like a ROP seems to be small, on the other hand, and NV can advertise with insane AA-Fillrate. NV cuts out the possibility to calculate 2 AF-Samples per Pipe on NV25. So bilinear AF costs the performance of trilinear AF (which means, to lower the fillrate by 75%.)
It doesn't matter how large you think it is: You don't over engineer things by principle.
 
Back
Top