We're not talking about current AA in this context.cho said:hmm, the NVIDIA 16XAA = [4X(4X OGMS)] AA
We're not talking about current AA in this context.cho said:hmm, the NVIDIA 16XAA = [4X(4X OGMS)] AA
cho said:hmm, the NVIDIA 16XAA = [4X(4X OGMS)] AA
But maybe, with all the bad press they got, they're focusing on beating ATI in this category, not just being equal...
At some point, even in the rather exceptional cases like the z-only pass in DOOM3, too many z test units for MSAA and they will become limited by memory bandwidth.Uttar said:There is, however, one area where, AFAIK, it will always create a dramatic performance hit: Games like Doom3, where you're fillrate limited in cases of 0 textures and other stuff.
Ailuros: Well, practically, it's exactly what I was thinking for the NV3x: 4 Z/Stencil units per pipeline, but capable of looping back.
This trick first appeared in Rampage ( or rather, was first set to appear in Rampage ) - in Rampage's case, it was extreme, since you had 1 Z/Stencil unit/pipeline, and up to 8x MSAA.
The idea is that you loopback, and the loopback is free if you need to loopback anyway for multitexturing or complex shading programs. So, with time, this will become significantly cheaper.
There is, however, one area where, AFAIK, it will always create a dramatic performance hit: Games like Doom3, where you're fillrate limited in cases of 0 textures and other stuff. Because here, there's no required loopback, so you take the exact same performance hit as SSAA ( in that specific situation of course, not the whole frame - it's still faster than SSAA, of course ).
I'm sure you could do even more clever things to reduce that though. What about using idle FP or Texturing units to calculate additionnal Z values? Although unless each pipeline is absolutely godly in terms of FP power, it is unlikely you'll cancel the performance hit - but you'd still reduce it.
If the z-test units were decoupled from the pixel pipelines and placed before them, with a small buffer between, then there wouldn't need to be 4x the processing effort (Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...). As long as each pixel took 4 clock cycles in the pixel shader unit, the 4-sample per clock z-test unit would have that much time to perform 16 z-tests before it held up the pipeline.Ailuros said:Unless let's say a future chip reconfigures it's Z test units dynamically independent of the AA mode, in the case of looping each Z/unit 4x times in order to get 16x sample AA, it means also 4x the processing effort (please correct me if I'm wrong).Ailuros: Well, practically, it's exactly what I was thinking for the NV3x: 4 Z/Stencil units per pipeline, but capable of looping back.
Ailuros said:Ooops forgot at supposedly 8 pipelines and 4Z units/pipe, it's only 2 loops for 16x sample. Shoot me but I'd still prefer 8x sparse dozens of times more than blah 16xOGMS.
John Reynolds said:Ailuros said:Ooops forgot at supposedly 8 pipelines and 4Z units/pipe, it's only 2 loops for 16x sample. Shoot me but I'd still prefer 8x sparse dozens of times more than blah 16xOGMS.
Any day of the week and twice on Sunday, as the old saying goes.
Chalnoth said:If the z-test units were decoupled from the pixel pipelines and placed before them, with a small buffer between, then there wouldn't need to be 4x the processing effort (Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...). As long as each pixel took 4 clock cycles in the pixel shader unit, the 4-sample per clock z-test unit would have that much time to perform 16 z-tests before it held up the pipeline.
I've only heard ATI claim with the R300 that it supports early z tests per pixel. This does require a greater number of transistors because it lengthens the pipeline. I'm wondering if this is why they can do centrioid sampling in the texture units and Nvidia can't.Chalnoth said:(Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...)
What's centroid sampling?3dcgi said:I've only heard ATI claim with the R300 that it supports early z tests per pixel. This does require a greater number of transistors because it lengthens the pipeline. I'm wondering if this is why they can do centrioid sampling in the texture units and Nvidia can't.Chalnoth said:(Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...)
I guess you missed this thread.Chalnoth said:What's centroid sampling?3dcgi said:I've only heard ATI claim with the R300 that it supports early z tests per pixel. This does require a greater number of transistors because it lengthens the pipeline. I'm wondering if this is why they can do centrioid sampling in the texture units and Nvidia can't.Chalnoth said:(Btw, current designs likely already have z-test units a fair bit before the pixel pipelines, to throw out pixels early...)
Rampage did this originally as I said, and it's certainly a very nifty trick for future generations of hardware.