PDA

View Full Version : HSR discussion (ATI vs. NVidia vs. PVR)


Mintmaster
03-May-2002, 20:08
I thought it might be nice if we had a discussion on what we thought was the best method of hidden surface removal for a graphics card. I currently know of three methods:


1. ATI - Hierarchical Z
This method uses progressively smaller Z-buffers, each one one quarter or one sixteenth the size of the previous (I'll use the example of one quarter in this explanation). Each value in a Z-buffer contains the minimum of the 4 corresponding pixels in the larger Z-buffer. For example, location (3,6) has the min. value of (6,12), (6,13), (7,12), (7,13) in the larger buffer. Then, by checking (3,6) in the smaller buffer, you have a good idea of 4 Z values. Continuing the hierarchy, you get one value for the min of the entire screen, then one for each quadrant, one for each sixteenth, etc for each progressively smaller tile.
If the Z values of the vertices of a polygon are all less than the min's stored in the hierarchical Z buffer for one region, that polygon can be eliminated. A polygon can also be subdivided to fit it into a tile, and then that portion can be eliminated. Another optimization can be made by using the max, so that the Z-buffer can be ignored and each pixel can be written

ADVANTAGES:
-entire polygons can be eliminated
-currently ATI claims 64 pixels removal per clock (this seems to be due to polygon splitting as I mentioned), or 16 per pipe
-no reason why it can't be expanded further

DISADVANTAGES:
-ascending and descending the heirarchy during both reads and writes to the Z-buffer requires bandwidth, so low overdraw may hinder performance


2. NVidia's method - Early Z-check
I didn't see how this worked until I read in a post that NVidia has 4 Z-checks per pipe. NVidia just has the ordinary Z-buffer, but can check larger areas at a time. With 4 Z-checks per pipe, it checks a block of 16 locations each clock beforehand, and then discards the pixels that fail. This can be done because the GF3/4 (and Radeon 8500 for that matter) have around 70 bits of bandwidth per pipe, per clock. Assuming 2:1 compression (I think NVidia's quote of 4:1 is best case, as it must be lossless) and a 32-bit Z-buffer, these 4 checks require 64 bits of bandwidth per pipe per clock, so it fits nicely.

ADVANTAGES:
-Simple, no worst case scenario like HiZ, just like normal Z-check but 4 times faster for hidden pixels
-The extra Z-check hardware can be used for Multisampling (although I personally don't really think Multisampling is the right way to go about FSAA)

DISADVANTAGES:
-not expandable, assuming bandwidth continues to be 70 bits per pipe, per clock, which is very likely. Therefore pixel rejection is limited to 4 per clock. If we ever move to 64-bit Z, it will be reduced further.


3. PowerVR - Deffered Rendering through Tiling
I am not sure what is done by the CPU and what is done by the graphics chip, but what Kyro does is divide each polygon so that each part fits into a tile on the screen (I think they are 32x16 in size on the Kyro). When all the polygons in a scene are given, each tile is sorted from front to back and then each tile is rendered. Because each tile has its polygons presorted, only the polygons in the front are rendered, and so very few clocks are needed beyond those required to render the visible pixels in each polygon, and nearly all hidden surfaces are removed.

ADVANTAGES:
-Extremely efficient HSR that is independent of render order
-Essentially infinite hidden pixel rejection capabilities

DISADVANTAGES:
-Each polygon must split and placed in a bin beforehand, and its render information (textures, coordinates, colors, filtering info, blending info, etc) must also be stored. Even if this is done by indexing, there is a lot of information to be stored beforehand
-As polygon counts increase, the overhead for the presorting gets very high. Furthermore, the sorting and storing of the polygons are likely to take quite a bit of time, and the vertex memory bandwidth and space requirements will get very high.



My thoughts:
I think deferred rendering is not going to be very practical in the future. Increasing polygon counts make the disadvantages outweigh the advantages for this style of rendering.
NVidia's method seems fairly elegant in that it is simple and works with the ordinary Z-buffer. The only problem is that this simple solution can only go so far, and is a bit short-sighted.
ATI seems to have the best of both worlds, offering very good overdraw removal that could get even better if they wanted.

The question is how much silicon do each of these methods take up? It's hard to judge, but it seems like NVidia's would be the least, and their method seems to be quite sufficient for the overdraw we are seeing right now. With an overdraw of about 3 or 4, I could see NVidia's method being very good, ATI's method to have the possibility of being significantly better but not by a whole lot, and PowerVR's method have limitations with more polygons. For these reasons I think ATI's method is best, NVidia's next, and finally PVR in last. It seems, however, there may be limitations with ATI's current implementation looking benchmarks, but I'm not certain. It's got much more room for improvement compared to NVidia, however.

On the other hand, it may be that with very high polygon counts and complex vertex shaders the value of HSR will be diminished, as rendering time will be very small compared to vertex time. However, with increasingly complex pixel shaders in DX9 and if using many textures, as is likely for realistic 3D graphics, eliminating pixels could be very valuable indeed.

Your thoughts and expertise are very welcome.

arjan de lumens
03-May-2002, 22:22
A few comments:

AFAIK, the ATI hierarchical Z method only uses 2 levels of Z buffers, one with the full framebuffer resolution and another one at 1/16 the resolution (in the Radeon 8500; so that a 1024x768 framebuffer gets a 1024x768 level 1 z-buffer and a 256x192 level 2 z-buffer.) Hierarchical Z can be rather effective for scenes with high overdraw and front-to-back ordering; Radeons have traditionally beaten Geforces at benchmarks like Villagemark (which has intentionally large overdraw). Hierarchical Z tend to just get in the way when a scene is rendered in back-to-front order, though.

Future implementations may have a full pyramid of Z-buffer levels. Since the higher levels would be rather small, it would cost little to cache them on-chip, thereby avoiding most of the bandwidth cost of ascending/descending the full pyramid. With the pyramid in place, it would be entirely possible to Z-reject polygons even prior to triangle setup.

Both ATI and Nvidia currently use simple Z-buffer compression as well, since Radeon256 and Geforce3, typically reaching about 2:1 compression most of the time. With Multisampling becoming common, there will be even more opportunities for compression as well - about 5:1 should be attainable at 4xMSAA.

AFAIK, Kyro-class designs do not do any per-pixel sorting directly; rather, what they do is something like this:

First, for each polygon belonging to the tile, do Z (actually W) test of every polygon, for each pixel building a list of polygons affecting the final color of the pixel. (This list follows API order in Kyro chips; some other PowerVR designs, like the one in Dreamcast will depth sort this list as it is built)
Then actually render (apply color/texture/etc) to each pixel.

In this architecture, the W testing can, at very little hardware cost, be massively parallellized; Kyro has 32 W-test units. With a big enough tile size, there is no reason why this number could not be increased almost indefinitely.

The problems of high polygon counts (>70 MPolys/sec as produced by e.g. a HOS engine, like ATI's TruForm) still loom as a spectre over this architecture, though, and will do so until someone figures out a clean way to bin pre-tessellation HOSes.

Dave Baumann
03-May-2002, 22:28
1. ATI - Hierarchical Z
This method uses progressively smaller Z-buffers, each one one quarter or one sixteenth the size of the previous (I'll use the example of one quarter in this explanation).

Hrm...

ATI - Hierarchical Z (http://216.12.218.25/domain/www.beyond3d.com/reviews/ati/radeon8500p2/index4.php)

Future implementations may have a full pyramid of Z-buffer levels.

Another one that can be introduced is a reverse Hierarchy; i.e. as well as having buffers that store the highest values have one that stores the lowest value - this can save multiple Z read-check-write cycles on the full frame Z-buffer just reducing it to a write.

Basic
03-May-2002, 23:59
I believe that MS only affect the color buffer and not the z-buffer in all MS-capable cards today. If z was treated in the same way as color, you wouldn't detect polygon intersections. So I don't see any reason why it would get a better compression ratio.

But if you don't mind if poly intersectinos isn't antialiased, you could do a "S3TC-inspired" framebuffer compression. 3x32bit colors + 16x2bit indices = 128bit. Then add 3x(32bit Z+ 8 bit stencil) = 120bit ~= 128bit, and use same indices. That would give you 16xMS with a maximum of 3 fragments per pixel, a kind of primitive Z3. And a compression of 4 compared to a straight implementation. (You could get max 4 fragments per pixel with a 24bit z-buffer at a cost of 32bit more per pixel.)

Tagrineth
04-May-2002, 01:39
In terms of transistor cost, I'd say PowerVR has it hands down.

Compare Kyro II to VSA-100.

K2 has 15M transistors, with two pixels per clock and one TCU per pipe. It supports 2k*2k textures, 32-bit colour even in 16-bit mode <g>, DOT3, EMBM, et al.

VSA-100 has what, 18M transistors? with two PPC, one TCU per pipe, and so on - however, VSA-100 has fewer extra features: No DOT3, no EMBM, no 3D Textures...

In other words, TBR requires fewer transistors than IMR :)

Oh, by the way, PowerVR does it via 'raytracing'. Once the triangles are sorted, the core sends a 'ray' through the sample points, detecting solid and transparent pixels to determine occlusion. That means it gets free per-pixel transparency accuracy and per-pixel global effects (i.e. fog). :D

arjan de lumens
04-May-2002, 02:34
In terms of transistor cost, I'd say PowerVR has it hands down.

Compare Kyro II to VSA-100.

K2 has 15M transistors, with two pixels per clock and one TCU per pipe. It supports 2k*2k textures, 32-bit colour even in 16-bit mode <g>, DOT3, EMBM, et al.

VSA-100 has what, 18M transistors? with two PPC, one TCU per pipe, and so on - however, VSA-100 has fewer extra features: No DOT3, no EMBM, no 3D Textures...

In other words, TBR requires fewer transistors than IMR :)

IIRC, the VSA100 has 14M transistors and T-buffering. I don't know why 3dfx didn't add EMBM and DOT3 to that design, given that they are both features that are very cheap and easy to implement. And VSA100 seems rather bloated to me for the given feature set. Compared to, say, the Geforce2, with T&L and 4 times as many texture units weighing in at "only" 25 M transistors.

SA
04-May-2002, 03:21
About hierarchial z buffering. It provides efficient z checking and hidden surface removal under most circumstances. However, there are some cases where it does not work well.

Problems

There are 3 major problems with hierarchical z buffering based z checking:

1. The parallel plane problem.
2. The pin hole problem.
3. The z ordering problem.

The parallel plane problem refers to multiple parallel planes in close proximity to each other and at a slant angle to the viewer. Examples of this are a deck of cards, the pages of a book, some sheets of paper on a desk, paintings hung on a wall, etc. The close proximity of the planes and their slant means the the znear of the hidden planes are closer than the zfar of the visible plane down to a few pixels. Thus the z check fails at the higher levels of the hierarchy even though the triangles are processed front to back and the further triangles are all hidden.

The pin hole problem refers to small random holes in the scene preventing larger areas from filling up. This is a common problem in natural scenes such as trees and forests. This problem is worse than might be expected since even for holes that are eventually filled by distant geometry, all the geometry that occurs before the hole is filled fails the z check at the higher levels of the hierarchy. Note that like the problem above, this problem happens even though the geometry is processed front to back.

The z ordering problem is much more apparent. Most scenes are not ordered front to back since sorting by render state is usually more important.


Solutions for Hierarchical Z Buffers

1. The parallel plane problem can be solved by z checking multiple cells/pixels (say 4 or more) in parallel per pipe for each level of the z hierarchy. That is, the solution to the first problem is to use a combination of hierarchical z and parallel z checks (a combination of ATI's and NVidia's approaches).

2. The pin hole problem is a much more general problem that all occlusion culling algorithms face. One solution is to use something like the HOM algorithm that ignores small holes and treats them as occluded. However, this causes artifacts in the scene and I don't recommend it. The best solution is to use faster z checking hardware with more parallel z checking to solve the problem. So the best solution here is the same as the solution to 1.

3. The z ordering problem can be solved by performing application driven deferred rendering. The geometry is processed once with all shading turned off, z writing turned on, and no lighting done in the vertex shader. The second time it is processed with shading turned on, z writing turned off (but z checking turned on in both cases), and lighting processed. The first pass only processes z's, which are compressed and so it consumes little bandwidth. Since the render states do not matter on the first pass the geometry can be processed strictly front to back. On the second pass, the z buffer is fully set and only visible surfaces are pixel shaded and rendered and the geometry is ordered by render state. This creates a mechanism for deferred shading using immediate mode rendering.


Tilers

Deferred rendering tilers are also two pass mechanisms similar to solution 3 above. The major difference from solution 3 is that they perform the two passes in the driver without any application intervention, the first pass is essentially a very low resolution scan keeping the geometry in the (tile) buffer instead of a high res scan that keeps z values in the (z) buffer, and the second pass does not need to transform the geometry. On the first pass, solution 3 must store the first z (compressed and also in the hierarchy) while a tiler must store the triangle in the tile buffer. As triangle rates increase (and thereby scene complexity) the memory bandwidth tilts more in favor of solution 3 above, however, solution 3 requires more vertex shader performance.

The other major occlusion problem that all the above HSR methods face is eliminating T&L processing for hidden triangles. This is best accomplished by using a hierarchical visibility query of the bounding boxes in the scene graph in the first pass (before the lighting is done) and culling the geometry in those bounding boxes that are hidden. This method can be used with either a tiler or a hierarchical z buffer/fast z check.

3dcgi
04-May-2002, 04:04
I believe that MS only affect the color buffer and not the z-buffer in all MS-capable cards today. If z was treated in the same way as color, you wouldn't detect polygon intersections. So I don't see any reason why it would get a better compression ratio.


I think you've got this backwards. You need multiple z values per pixel to determine polygon intersections.

3. The z ordering problem can be solved by performing application driven deferred rendering. The geometry is processed once with all shading turned off, z writing turned on, and no lighting done in the vertex shader. The second time it is processed with shading turned on, z writing turned off (but z checking turned on in both cases), and lighting processed. The first pass only processes z's, which are compressed and so it consumes little bandwidth. Since the render states do not matter on the first pass the geometry can be processed strictly front to back. On the second pass, the z buffer is fully set and only visible surfaces are pixel shaded and rendered and the geometry is ordered by render state. This creates a mechanism for deferred shading using immediate mode rendering.

I agree with your logic that this could improve performance with chips that perform HSR, however I want to draw attention to this being an application solution. The hardware doesn't need to do anything other than efficient HSR.

Ailuros
04-May-2002, 04:30
In terms of transistor cost, I'd say PowerVR has it hands down.

Compare Kyro II to VSA-100.

K2 has 15M transistors, with two pixels per clock and one TCU per pipe. It supports 2k*2k textures, 32-bit colour even in 16-bit mode <g>, DOT3, EMBM, et al.

VSA-100 has what, 18M transistors? with two PPC, one TCU per pipe, and so on - however, VSA-100 has fewer extra features: No DOT3, no EMBM, no 3D Textures...

In other words, TBR requires fewer transistors than IMR :)

IIRC, the VSA100 has 14M transistors and T-buffering. I don't know why 3dfx didn't add EMBM and DOT3 to that design, given that they are both features that are very cheap and easy to implement. And VSA100 seems rather bloated to me for the given feature set. Compared to, say, the Geforce2, with T&L and 4 times as many texture units weighing in at "only" 25 M transistors.

True for the v4 4k5 only. Once chips are parallelized in SLI (v5 5k5), transistor count doubles immediately (2x14M/chip). Apart from the obvious quality differences between OGSS and RGSS, single chip VSA-100 was capable of only 2x sample AA (2 samples per chip).

Kristof
04-May-2002, 08:22
PowerVRs design is not all that big, just remember MBX... its the full PowerVR system and is very very small...

And I guess at some time we'll just have to prove that storage of scene information is not that "big" a deal ;)

Basic
04-May-2002, 10:06
I believe that MS only affect the color buffer and not the z-buffer in all MS-capable cards today. If z was treated in the same way as color, you wouldn't detect polygon intersections. So I don't see any reason why it would get a better compression ratio.


I think you've got this backwards. You need multiple z values per pixel to determine polygon intersections.


Agreed that it was unclear. What I meant was that with MS you treat the z-buffer the same way as with SS. Thus, MS won't get better z-buffer compression ratio than SS.

Dave B(TotalVR)
04-May-2002, 12:12
And I guess at some time we'll just have to prove that storage of scene information is not that "big" a deal :wink:

*COUGH* Like coming out with a new card that proves it.... :lol:

3dcgi
05-May-2002, 21:32
Agreed that it was unclear. What I meant was that with MS you treat the z-buffer the same way as with SS. Thus, MS won't get better z-buffer compression ratio than SS.

Ahh. That I agree with.