Why do we still have fixed function AA and filtering?

cho said:
hmm, the NVIDIA 16XAA = [4X(4X OGMS)] AA

You mean 16X True MSAA through a dirty trick, in the NV3xGL?
Just making sure that's what you meant - still didn't have the opportunity to verify it.

Ailuros: Well, practically, it's exactly what I was thinking for the NV3x: 4 Z/Stencil units per pipeline, but capable of looping back.

This trick first appeared in Rampage ( or rather, was first set to appear in Rampage ) - in Rampage's case, it was extreme, since you had 1 Z/Stencil unit/pipeline, and up to 8x MSAA.

The idea is that you loopback, and the loopback is free if you need to loopback anyway for multitexturing or complex shading programs. So, with time, this will become significantly cheaper.

There is, however, one area where, AFAIK, it will always create a dramatic performance hit: Games like Doom3, where you're fillrate limited in cases of 0 textures and other stuff. Because here, there's no required loopback, so you take the exact same performance hit as SSAA ( in that specific situation of course, not the whole frame - it's still faster than SSAA, of course ).

I'm sure you could do even more clever things to reduce that though. What about using idle FP or Texturing units to calculate additionnal Z values? Although unless each pipeline is absolutely godly in terms of FP power, it is unlikely you'll cancel the performance hit - but you'd still reduce it.

I'm unsure as to whether this method could cause any transistor count increase, but I doubt it would create a signifiant one anyway.


Uttar
 
But maybe, with all the bad press they got, they're focusing on beating ATI in this category, not just being equal...

Is this thr R3** AA they're are trying to beat (Which is the minimum I'd expect from them) or whatever tech ATi is planning for their next gen part?
 
Uttar said:
There is, however, one area where, AFAIK, it will always create a dramatic performance hit: Games like Doom3, where you're fillrate limited in cases of 0 textures and other stuff.
At some point, even in the rather exceptional cases like the z-only pass in DOOM3, too many z test units for MSAA and they will become limited by memory bandwidth.
 
Ailuros: Well, practically, it's exactly what I was thinking for the NV3x: 4 Z/Stencil units per pipeline, but capable of looping back.

Unless let's say a future chip reconfigures it's Z test units dynamically independent of the AA mode, in the case of looping each Z/unit 4x times in order to get 16x sample AA, it means also 4x the processing effort (please correct me if I'm wrong).

This trick first appeared in Rampage ( or rather, was first set to appear in Rampage ) - in Rampage's case, it was extreme, since you had 1 Z/Stencil unit/pipeline, and up to 8x MSAA.

There's not a single whitepaper or powerpoint presentation not even a hint that even high end Spectre was to support in the drivers more than 4xRGMS. Albeit it would have been possible with no loops on dual chip Spectre.

What's there so extreme? 2 loops are half of 4 loops and half the procession time I figure (in the case of 8xRGMS on single chip).


The idea is that you loopback, and the loopback is free if you need to loopback anyway for multitexturing or complex shading programs. So, with time, this will become significantly cheaper.

In that case unless we're talking about dynamic calculations (which should be a childsplay for a TBDR to implement :p ) it should cost in processing time.

There is, however, one area where, AFAIK, it will always create a dramatic performance hit: Games like Doom3, where you're fillrate limited in cases of 0 textures and other stuff. Because here, there's no required loopback, so you take the exact same performance hit as SSAA ( in that specific situation of course, not the whole frame - it's still faster than SSAA, of course ).

I doubt chip designers ignore that fact and the likelyness of heavy stenciling to appear more and more in games. Are we talking about a usable sampling pattern here or a marketing fluff number?

I'm sure you could do even more clever things to reduce that though. What about using idle FP or Texturing units to calculate additionnal Z values? Although unless each pipeline is absolutely godly in terms of FP power, it is unlikely you'll cancel the performance hit - but you'd still reduce it.

Don't know I have my own Pandora's box of NV40 tidbits, but they don't help me as of yet going as far.

By the way the NV25 (for those that forget all possible rumour mill exaggerations) was supposed to have 10x sample AA prior to it's release.

I'm not excluding anything yet, but I think the real usable modes will have quite a bit less samples than 16x.
 
Ooops forgot at supposedly 8 pipelines and 4Z units/pipe, it's only 2 loops for 16x sample. Shoot me but I'd still prefer 8x sparse dozens of times more than blah 16xOGMS.
 
Ailuros said:
Ailuros: Well, practically, it's exactly what I was thinking for the NV3x: 4 Z/Stencil units per pipeline, but capable of looping back.
Unless let's say a future chip reconfigures it's Z test units dynamically independent of the AA mode, in the case of looping each Z/unit 4x times in order to get 16x sample AA, it means also 4x the processing effort (please correct me if I'm wrong).
If the z-test units were decoupled from the pixel pipelines and placed before them, with a small buffer between, then there wouldn't need to be 4x the processing effort (Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...). As long as each pixel took 4 clock cycles in the pixel shader unit, the 4-sample per clock z-test unit would have that much time to perform 16 z-tests before it held up the pipeline.
 
You're assuming that all chips at present actually have the number of Z units per pipe equal to that of their (presently exposed) maximum native MSAA mode.
 
That lighted a lightbulb. I totally forgot about that one and it truly originates from the 3dfx technology books.
 
Ailuros said:
Ooops forgot at supposedly 8 pipelines and 4Z units/pipe, it's only 2 loops for 16x sample. Shoot me but I'd still prefer 8x sparse dozens of times more than blah 16xOGMS.

Any day of the week and twice on Sunday, as the old saying goes.
 
John Reynolds said:
Ailuros said:
Ooops forgot at supposedly 8 pipelines and 4Z units/pipe, it's only 2 loops for 16x sample. Shoot me but I'd still prefer 8x sparse dozens of times more than blah 16xOGMS.

Any day of the week and twice on Sunday, as the old saying goes.

Never On Sunday (no I don't expect you to get the hint if you haven't seen the movie) :LOL:
 
Chalnoth said:
If the z-test units were decoupled from the pixel pipelines and placed before them, with a small buffer between, then there wouldn't need to be 4x the processing effort (Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...). As long as each pixel took 4 clock cycles in the pixel shader unit, the 4-sample per clock z-test unit would have that much time to perform 16 z-tests before it held up the pipeline.

Thank you Chalnoth, that was precisely what I was talking about :)
Rampage did this originally as I said, and it's certainly a very nifty trick for future generations of hardware.

Dave: Well, I was actually thinking that NVIDIA was very unclear about all this stuff, because they once said there were 4 Z units per pipeline. They said that while pretending there were 8 pipelines, as if there were 32 units. And yet, as you proved yourself, when you got 4x FSAA, the 8 zixels trick doesn't work anymore, so we clearly got 16 units here.

So unless you're refering to that, which I doubt since it seems incorrect tome, you would be talking about the R300. Hmm, 4 Z units/pipeline for the R300 & R350, so you got potential loopback for 6x MSAA? Would seem odd to me. Although I must admit I got no idea about the number of Z units in the R300.

What about the RV350 then? The 4x MSAA performance is fairly bad in it. While there are many potential explanations, having only 2 Z units/pipeline and potential loopback certainly seems like an attractive, although unlikely, explanation to me.


Uttar
 
Chalnoth said:
(Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...)
I've only heard ATI claim with the R300 that it supports early z tests per pixel. This does require a greater number of transistors because it lengthens the pipeline. I'm wondering if this is why they can do centrioid sampling in the texture units and Nvidia can't.
 
3dcgi said:
Chalnoth said:
(Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...)
I've only heard ATI claim with the R300 that it supports early z tests per pixel. This does require a greater number of transistors because it lengthens the pipeline. I'm wondering if this is why they can do centrioid sampling in the texture units and Nvidia can't.
What's centroid sampling?
 
Chalnoth said:
3dcgi said:
Chalnoth said:
(Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...)
I've only heard ATI claim with the R300 that it supports early z tests per pixel. This does require a greater number of transistors because it lengthens the pipeline. I'm wondering if this is why they can do centrioid sampling in the texture units and Nvidia can't.
What's centroid sampling?
I guess you missed this thread.

Centroid sampling is used in conjunction with MSAA to assure that texture samples are not taken outside of the current triangle being rasterized.
 
Rampage did this originally as I said, and it's certainly a very nifty trick for future generations of hardware.

He was referring to ROPs to be exact and he mentioned specifically under which conditions it works best. Texel fillrate remains intact for those conditions, while pixel fillrate gets reduced by 1/4th and that´s the reason why on Spectre you got with X layers of multitexturing Y samples of MSAA essentially for "free" etc etc.

It is a clever trick but it has as much conditionals as the filter on scanout trick apparently has. If my memory shouldn´t betray me again it works only on NV3x when the minimum framerate doesn´t drop below 3/5th of the resolution refresh rate or something like that.
 
Back
Top