ATI Hierarchical-Z issue with Doom 3

Xmas · Aug 8, 2004

Mintmaster said:
For your card, we get over 25 Gpix/s for all AA settings, a few percent shy of 64 pixels per clock. That means Xmas is right. With 4xAA, NVidia can reject 256 samples per clock. I doubt they have this many Z-units, so does it mean NVidia has a form of HiZ as well? We saw a hint of this in another thread where pixels could only be rejected at a rapid rate when there is enough difference in Z value.

R420 can do 256 pixels per clock when Hi-Z is enabled, and thus 1024 samples when 4xAA is enabled, but only 32 samples per clock when HiZ is disabled, I think. It could be possible that ATI's early Z can do all samples in a quad at once as well. If not, this could definately be a big reason NV40 does so well in Quake3 even with AA.

Has anyone done benchmarks without shadows enabled? We could probably get a good idea of shading speed from that.

NVidia has a form of early Z, but no hierarchical Z AFAIK. It can reject up to 16 quads per clock.

I'm not sure about ATI's hierZ.

The big difference between the two (apart from the hierarchical buffer) is that ATI only stores one value, while NVidia uses both min and max, which, btw, is another very much Doom3-centric feature.

ERP · Aug 8, 2004

The big difference between the two (apart from the hierarchical buffer) is that ATI only stores one value, while NVidia uses both min and max, which, btw, is another very much Doom3-centric feature.

AFAIK this isn't exactly accurate, although it may have changed since I last saw low level docs on an NVidia chip. It's true that NVidia can determine the min and max of the samples, in the quad, but that's a side effect, rather than a design.

It used to be that the early Z was implemented the obvious way, relying on the Z Compression (which has some of the Z buffer on chip) to save the bandwidth. It's possible that this has changed in later chips.

swaaye · Aug 8, 2004

So, is the rest of HyperZ working? Z clear and Z compression? It would be interesting to compare a 8500 to a 9000. The 9000 doesn't have HeirZ, unlike the 8500.......see what it does in Doom3 (if you can even notice considering the general performance of those cards here anyway.)

Mintmaster · Aug 8, 2004

Chalnoth said:
I think it does, but I believe the thread showed that nVidia's optimizations gave more performance when surfaces were closer to one another, which seems a strange result indeed.

I missed the end of that thread. Interesting.

Chalnoth said:
Anyway, it's probably not so much that they can reject 256 samples per clock, but that they can reject 16 quads per clock, such that their z optimization either rejects a quad or doesn't, with nothing inbetween. I would expect that would be the only way that they would be able to have the rejection rate nearly independent of FSAA.

I know, I was just comparing with ATI. For ATI, HiZ doesn't depend on FSAA due to it's use of min/max, but early Z does, AFAIK. They can only test 2 individual samples per pipe per clock.

It's too bad ATI didn't opt for a min/max HiZ implementation for R420. Maybe they just underestimated NVidia. The X800XT PE's 133 Gpix/s z-rejection rate is not applicable in Doom3. I'll try to remain hopeful that they can separate the sense of early Z and HiZ so that the former can be used during shading (which requires many cycles, so even 16 pix/cycle rejection is good enough) and HiZ can be used for stencil rendering by storing the max Z instead.

jvd · Aug 8, 2004

Uttar said:
Mintmaster: I just did that, thanks for the idea I'm on a 6800GT at 410/1000 btw. The tests were run at 1280x1024 60hz, V-Sync disabled, 32bpp.
www.notforidiots.com/0AA.txt
www.notforidiots.com/2AA.txt
www.notforidiots.com/4AA.txt

Key numbers for Doom 3 probably are:
0x AA: Overdraw factor 3, front to back: 1902.61 fps
2x AA: Overdraw factor 3, front to back: 1749.25 fps
4x AA: Overdraw factor 3, front to back: 1567.55 fps
Seems like that could be quite an advantage...

Uttar

Same settings

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 789.74 fps
Overdraw factor 3, front to back: 2513.51 fps
Overdraw factor 3, random order: 1315.10 fps

Overdraw factor 8, back to front: 286.83 fps
Overdraw factor 8, front to back: 2216.37 fps
Overdraw factor 8, random order: 848.26 fps

2x

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 597.02 fps
Overdraw factor 3, front to back: 1286.57 fps
Overdraw factor 3, random order: 869.00 fps

Overdraw factor 8, back to front: 251.77 fps
Overdraw factor 8, front to back: 1206.48 fps
Overdraw factor 8, random order: 633.02 fps

tcchiu · Aug 26, 2004

http://www.3dcenter.org/artikel/2004/07-30_english.php

Let's look into it a bit closer: In the beginning Doom 3 produces the Z-buffer, whereas the Z-test is considered to be successful, if the value to be compared is smaller than the value already in the Z-buffer. With the first pass, working out the shading, the test is "successful" if the new value is higher than the old one.

I don't understand. The z-test mode doesn't have to be changed even if z-fail shadow volume algorithm is used, right?

Z-fail algorithm increments and decrements the stencil values when z-test fails, but in the both pass (1st z-only pass, and 2nd stencil shadow pass) the same z-test mode LT (less than) can be used.

How could it break ATI's Hierarchical Z?

KimB · Aug 26, 2004

It may be that ATI doesn't implement a depth fail routine directly, but switches the depth function instead (and always assumes that a depth fail means do nothing).

Thowllly · Aug 26, 2004

tcchiu said:
I don't understand. The z-test mode doesn't have to be changed even if z-fail shadow volume algorithm is used, right?

Z-fail algorithm increments and decrements the stencil values when z-test fails, but in the both pass (1st z-only pass, and 2nd stencil shadow pass) the same z-test mode LT (less than) can be used.

How could it break ATI's Hierarchical Z?

With hierarchical Z you store one z value for a tile of (maybe 8x8 pixel) of pixels. That one value is the same as the one pixel in the tile that is closest to the camera. If that pixel passes, they all passes. If you want to do a depth fail test, then you can't use the same 'closest z' value, even if that value passes the 'depth fail' test (iow, that pixel is closer than the new fragment your testing against), there might be other pixels in the tile that do not pass the 'depth fail' test (iow, those pixels are further away than the new fragment your testing against). To do a hierarchical Z depth fail test, you also need to store an additional value for each tile, the depth of the pixel furthest away. That would require a larger hierarchical Z buffer, and since ATI has that buffer on chip, they probably don't want to waste space on something so rarely used.

Scali · Aug 26, 2004

To do a hierarchical Z depth fail test, you also need to store an additional value for each tile, the depth of the pixel furthest away. That would require a larger hierarchical Z buffer, and since ATI has that buffer on chip, they probably don't want to waste space on something so rarely used.

In that case, it may have been better to change the depth function and keep the stencil function on zpass.
I wonder how this will affect ATi and NV cards.

Edit: With a simple test in my 3d engine on my Radeon 9600Pro I see no difference in performance between using zfail stenciling and a zless compare or zpass and a zgreaterequal compare.
So either the hierarchical buffer works in both cases, or it works in neither.

Thowllly · Aug 26, 2004

Scali said:
Edit: With a simple test in my 3d engine on my Radeon 9600Pro I see no difference in performance between using zfail stenciling and a zless compare or zpass and a zgreaterequal compare.
So either the hierarchical buffer works in both cases, or it works in neither.

I think I was somewhat wrong in my previous post. Even if you only store one value per tile it should help no matter what z test your doing, and storing both largest and smallest z would help for normal rendering too, not only the unusual cases. nevermind...

Xmas · Aug 26, 2004

ATI stores the farthest Z value. So if an incoming tile is completely behind this value, you can trivially reject it. Trivial rejection is much more useful than trivial accept. (note that this also means that if a triangle partly covers a tile marked as cleared, that tile is effectively "lost")

The Z-fail method now means that, if (the stencil test passes and) the Z test fails, you increase/decrease the stencil value. So even if hierZ knows that all pixels fail the Z test, it can't reject pixels because there's still work to be done for those pixels.

Reznor007 · Aug 27, 2004

Scali said:
To do a hierarchical Z depth fail test, you also need to store an additional value for each tile, the depth of the pixel furthest away. That would require a larger hierarchical Z buffer, and since ATI has that buffer on chip, they probably don't want to waste space on something so rarely used.

Click to expand...

In that case, it may have been better to change the depth function and keep the stencil function on zpass.
I wonder how this will affect ATi and NV cards.

Edit: With a simple test in my 3d engine on my Radeon 9600Pro I see no difference in performance between using zfail stenciling and a zless compare or zpass and a zgreaterequal compare.
So either the hierarchical buffer works in both cases, or it works in neither.

I heard that H-Z was removed from RV350/RV360...

ATI Hierarchical-Z issue with Doom 3

Xmas

Porous

ERP

swaaye

Entirely Suboptimal

Mintmaster

jvd

tcchiu

KimB

Thowllly

Scali

Thowllly

Xmas

Porous

Reznor007

Similar threads