PDA

View Full Version : Z3 re-visited


Reverend
14-Mar-2003, 11:57
Okay, so I was wrong about expecting this to be implemented in the NV30.

Will we likely see this in the next-gen-hardware AA algo from various IHVs? How expensive can this be in terms of gates?

Ailuros
14-Mar-2003, 12:22
Rev,

Dumb question: why Z3 (or a similar algorithm) and not a Fragment AA algorithm instead (or even a combination of both with modifications - like SA had mentioned once)?

Arun
14-Mar-2003, 12:57
Don't worry about being wrong on it being implemented on the NV30.

I supposed the NV30 would support Wu Antialiasing, combined with traditional antialiasing, to determine, based on analyctical coverage, whether a nearby subpixel should be filled / not filled, even if the traditional algorithm says the opposite.
So, there's no Z problem, nothing. It probably got few disadvantage - beside the high transistor count required ( I suppose - maybe there could be miraculous solutions to make it cheap, but it's unlikely )

It could, if implemented properly, give quality as high as the optimal sampling pattern for each pixel. So, it's like if you had dynamic sampling patterns, even though they're really ordered.


Back on topic...
Z3 is IMO a very nice algorithm, but in worst-case scenarios it ain't that great. The problem may be that companies like nVidia & ATI like to derive workstation products from their standard products, and Z3 disadvantages are "out of question" for a film studio.
And retrieving Z3 from an architecture would be a fairly substantial modification.

nVidia is very fond of their workstation strategy: They think that since it's developped on fundamentally the same architecture, it also means it's optimized for that architecture. Which gives them a performance advantage...

So, you'd have to implement both a Z3 & Traditional path. It's wasted silicon, so I fear GPUs truly supporting Z3 might be quite rare, even in the future.


Uttar

3dcgi
14-Mar-2003, 16:14
I don't expect we'll ever see Z3 exactly in hardware, but something similar is possible. As Ailuros mentioned we could see a combination of FAA and Z3. Because both are based on coverage masks the fundamental pipelines are similar. I don't think Matrox has the stomach to risk it anymore and I think the disadvantages will keep Ati away. I think Ati is generally happy with the combination of MSAA and compression. Nvidia, I'm not so sure of. Maybe 3dlabs could modify superscene for a gaming chip where true 16x quality isn't necessary for micro-polygons.

Frankly I think Z3 has some flaws that weren't found yet because they didn't run enough tests, but I don't have any proof of that. Less flaws than FAA had however.

As far as the gate cost goes its not too bad. There is some extra logic above the requirements for MSAA. Man power/design time is the main issue there. The main gate cost will come in regards to the data structure. i.e. Is a separate cache needed.

SA
15-Mar-2003, 03:02
There are better algorithms than Z3 for hardware AA that solve the same problems. I prefer to think of Z3 as an approach to AA that involves using sorted fragment AA with bit masks and an upper limit on the mask depth. As to when these will actually be available in hardware, well that's another issue.

psurge
16-Mar-2003, 17:03
SA: care to provide a link or two for these better algorithms (if they are public that is)?

Regards,
Serge

3dcgi
17-Mar-2003, 16:26
What's everyones opinion on the need for sort independent transparency? Obviously this is a Z3 feature, but the same quality antialiasing can be done without supporting this feature. I couldn't find the paper, but Nvidia has developed a method called depth peeling that does sort-independent transparency with current cards.

demalion
17-Mar-2003, 18:44
What about an AA FIFO buffer...except storing sample state instead of shader state? Recalling some of my speculations from the R400 guessing thread regarding occlusion culling, it seems to me now that proportional area weighting could be stored as part of the data in this buffer. The sample count would not be based on position, but on primitive association, and would vary for overdraw...I think as few as 4 discrete values could serve if exposure area sorting allowed samples to be displaced, though the trade off of overdraw error versus index (z buffer) checking for this rejection might call for as few as 2.. Hmm...actually, I think some of the things I was thinking of in that thread about the implications of the occlussion culling calculations and a unified shader model facilitate this when this type of buffer is considered. Seems this is a natural fit for an architecture with something like the F Buffer already being considered.

In any case, by tracking things like:

x0 intersection : x value of intersection with "top"
y0 intersection : y value of intersection with "left"
x1 intersection : x value of intersection with "bottom"
y1 intersection : y value of intersection with "right"
bias : which way the polygon extends from this edge line
xc : x value of corner
yc : y value of corner
w_trans : transparency weighting for the color data, to determine how much weighting is given the portion of "behind" color data that is occluded

for each buffer color, couldn't an "infinite resolution" blend occur at the end due to the coherency of the color data for the pixel pipeline the fifo buffer allows? With 4 bit accuracy for the x/y values, the equivalent of 256 sample OGMS would occur, wouldn't it? And that should be less expensive than a 4xMSAA/2xSSAA method in both memory and bandwidth usage, since the bit value total for that to occur would just have to be <= 64 (assuming a cap of 4 discrete values), and opportunities for compression would exist. Another question is if this would be feasible as an addition to pre-existing AA methods rather than a replacement...depends on the flexibility of the execution of the more traditional methods whether that would make sense, I think.

The trick would then be the optimize the evaluation of coverage interactions...and it seems to me that the latency for the evaluations for the data to be stored for final sampling could be hidden in waits for texture fetches and pixel shading calculations. Also, I think some opportunities exist for "threshold blocking" factors based on which x/y 0/1/c values are 0 above to speed this up...all in all, I think it might even be hidden by the basic pipelining latency.

As is often the case, I get the distinct feeling we've discussed these details, or very similar ones, before. I'm sure if I could search for "Z3" I'd find something addressing atleast parts of this idea, but in the absence of that I apologize for any thoughts, and errors, I repeat from the past. Also, sorry advance for any "Monday Math" type errors in my assumptions.

micron
17-Mar-2003, 23:50
<looks at above post>"I am simply not smart enough to hang out here"

SA
18-Mar-2003, 01:20
Concerning transparency sorting, I much prefer a method like Z3 where it comes for free as a opposed to a method like depth peeling that is expensive and must be explicitly coded for.

However, one important rendering aspect that Z3 does not address is providing the programmer precise control over what the rendering order of transparent surfaces should be, since it always renders them in z order.

PurplePigeon
18-Mar-2003, 02:21
I couldn't find the paper, but Nvidia has developed a method called depth peeling that does sort-independent transparency with current cards.

This looks like it:

http://developer.nvidia.com/docs/IO/1316/ATT/order_independent_transparency.pdf

demalion
18-Mar-2003, 02:57
<looks at above post>"I am simply not smart enough to hang out here"

That was most definitely not the simplest way to describe my thoughts, just a way that provides a lot of details and speculations for errors to be pointed out by others.

Just because you can't bridge the gap from your understanding to what was said in it right now doesn't mean you aren't capable of doing so. Just hang around and keep an open mind and learn what you can. My $0.02

micron
18-Mar-2003, 04:10
Thank you Demalion.

Simon F
18-Mar-2003, 08:20
I couldn't find the paper, but Nvidia has developed a method called depth peeling that does sort-independent transparency with current cards.

This looks like it:

http://developer.nvidia.com/docs/IO/1316/ATT/order_independent_transparency.pdf
Gosh I'd forgotten how complicated that was. The Dreamcast method was so much easier to use.

MfA
18-Mar-2003, 18:33
Complicated and slow, useless.

psurge
18-Mar-2003, 21:08
SA, what kind of drawing orders (other than Z) are necessary?

Would the addition of a primary sort key to transparent fragments address this ? (i.e. each transparent object generates fragments with a some specifiable "layer id", non-interpolated). Sort transparent fragments by id, then sort fragments with identical id in z...

SA
19-Mar-2003, 13:14
The issue of render order is more of programmatic control of the rendering and letting the programmer decide what special effects they want to generate whether physically plausible or not. It is also a question of compatibility since older applications may have rendered transparency in a particular order for a special effect and this would be lost. It is important therefore when the hardware provides a means of ordering the rendering of transparency that it be done under programmatic control with the default being (input) sequential ordering.

A more important consideration is being able to defer calculations, since the calculations for a transparent surface may need to be deferred until all the transparent fragments are present. Since the calculations may vary from surface to surface for different surface types, you may need to defer several sets of calculations and associate each with its corresponding surface. With Z3, this is a bit of a problem since it must merge fragments on the fly, it must perform the calculations of the merged fragments before all the fragments are present. However, the final results are generally good enough.

Fred
19-Mar-2003, 14:45
If quality is a problem for Z3, I wonder how useful it would be to take it to second order by storing additional 2nd derivatives.

arjan de lumens
19-Mar-2003, 17:43
Given that 2nd derivatives of Z tend to give very small values, I wouldn't expect that including 2nd derivatives makes much of a difference quality-wise. AFAIK, the problem with Z3 is that if you get too many fragments affecting the color of a pixel (the case with very small polygons and/or very many transparent layers), you get a buffer overflow and need to combine or remove fragments. While this can be made to work satisfactorily most (>99%?) of the time, working with a renderer that it is known can be made to glitch, even if only in unusual corner cases, just doesn't feel quite right for professional/cinematic use.

psurge
19-Mar-2003, 21:57
well, thinking about it, it seems like the fb should be optimized such that it has fixed storage for say 1-3(?) fragments per pixel, with a fall back to an arbitrary sized fragment list for those pixels with very high transparent overdraw or which contain lots of opaque fragments. I.e. basically, buffer overflow spills to an A-buffer like structure instead of triggering a fragment merge operation.

Even if overflow is pretty rare, being able to operate on a complete fragment list for a pixel (for programmatic control of fragment ordering, blending in the PS) seems really expensive. This (once again) sounds like something a tiler would be able to handle rather well compared to an IMR.

3dcgi
20-Mar-2003, 16:30
I believe 3dlabs 16x Superscene antialiasing has an overflow buffer like you mention. They apparently don't keep track of full coverages masks to allow for sort-independent transparency though. If they did that feature would be advertised. I think one of the B3D articles made mention of this overflow buffer.

arjan de lumens
20-Mar-2003, 18:02
The problem with such overflow buffers is, of course, that you have to dimension them for worst-case behaviour (=when maximum # of samples needed is reached for too many pixels) or else add the infrastructure needed to do the 'stall rendering->allocate/swap out memory->resume rendering' roundtrip whenever the overflow buffer overflows. If you do neither, you get unpredictable gliches during rendering.