Xenos Z-only pass when tiling

assen

Veteran
In the article on Xenos it is stated that the expected way to perform tiling on Xenos is to do a Z-only pass first, during which to determine the screen-space extents of each command buffer entry, then use them to guide the predicated rendering during the next passes. However, a 1280x720 4xMSAA 32-bit Z-buffer still occupies 14 MB, and still doesn't fit in the eDRAM.

How come?
 
32-bit Z-buffer??

Anyway, primary reason is to populate the HiZ and reject many, many pixels very fast right after triange setup.
 
Well, yes, PC cards typically use a 24-bit Z-buffer with 32 bits per pixel (or MSAA sample, in this case), with the additional 8 bits used as stencil.

16-bit Z is not trivial to use without artefacts, at least in games with both near and far objects.

I understand what the idea of a Z-only pass is, what I don't get is how is it possible at all - where do you fit the rendered Z-buffer? Do you mean that only the "coarse" or "hierarchical" representation of the Z-buffer is preserved/built at all? This will be useful to some degree, of course, although it won't achieve the desired "only one fully shaded fragment per pixel" property of the full-resolution Z-only pass.
 
IIRC 16-bit Z is stil most commonly used.

The savings from using fast rejects from HiZ/ZCULL pre-population are high.

Bear in mind that single pixel Z reject is a bandwidth saving only operation - this test is done right at the end of the pipeline, at pixel output stage, hence pixel shader operations on the pixel have already been carried out; the courser HiZ/ZCULL routines are done onchip, hence they can occur prior to shading. On Xenos, saving per-pixel bandwidth ain't that much of an issue.
 
I presume that Xenos does the following:
  1. 1280x720 0xAA extents-pass: draw triangles for the entire scene to determine which tile each triangle occupies and thus mark-up the command buffer for each triangle to specify which tile(s) the triangle occupies. This pass also fills the hierarchical-Z buffer on the parent die (which is retained throughout all succeeding tiled passes) which will reduce over-draw
  2. Tile-1 z-only pre-pass: writes full-resolution 4xAA Z-only data into the backbuffer (the colour part of the backbuffer is untouched), using the triangles that have been marked-up for tile 1.
  3. Tile-1 rendering: The actual rendering pass, with colour and Z-writing. The only remaining overdraw is for portions of triangles that need anti-aliasing and those triangles that are alpha-blended.
Steps 2 and 3 are then repeated for each successive tile.

Step 2 isn't actually needed - but it's a good way of speeding up rendering (by reducing overdraw) whether you're rendering on Xenos or another GPU.

Jawed
 
Jawed, I agree with your description minus step 2. I don't see how that could benefit Xenos unless HiZ information isn't preserved when switching from 0xAA to 4xAA, which I doubt. At that final Z test stage the ROPS have 256GB/sec of bandwidth to eDram, so an extra pass would seem more expensive than the extra writes.
 
Yeah, I think you're right. I was forgetting that pixels can only be culled by hierarchical-Z and the precision of Zs in the hierarchical-Z buffer can't be improved by a z-only prepass with MSAA turned on(step 2).

Jawed
 
Dave, your comment is confusing...especially when your article uses 32-bit Z-buffer in the example calculation! :???: :?:
 
Dave Baumann said:
IIRC 16-bit Z is stil most commonly used.
Wow. That's the first I've ever heard that. I think most use a split 24/8 Z/stencil buffer on the PC. Not sure about consoles.
 
Jawed said:
I presume that Xenos does the following:
  1. 1280x720 0xAA extents-pass: draw triangles for the entire scene to determine which tile each triangle occupies and thus mark-up the command buffer for each triangle to specify which tile(s) the triangle occupies. This pass also fills the hierarchical-Z buffer on the parent die (which is retained throughout all succeeding tiled passes) which will reduce over-draw
  2. Tile-1 z-only pre-pass: writes full-resolution 4xAA Z-only data into the backbuffer (the colour part of the backbuffer is untouched), using the triangles that have been marked-up for tile 1.
  3. Tile-1 rendering: The actual rendering pass, with colour and Z-writing. The only remaining overdraw is for portions of triangles that need anti-aliasing and those triangles that are alpha-blended.
Steps 2 and 3 are then repeated for each successive tile.

Step 2 isn't actually needed - but it's a good way of speeding up rendering (by reducing overdraw) whether you're rendering on Xenos or another GPU.

Jawed
I'm pretty sure step 1 requires 1280x720 with 4x AA (if you plan to render the final scene with 4x AA) and step 2 isn't necessary. When building the Hi-Z buffer you need sub-samples because it is possible for a triangle to contribute to a tile with AA enabled yet it doesn't contribute to the tile without AA.
 
OK, so we're back to square one :)

We (mostly) agreed that the Z-buffer is likely to be 32-bit, we (mostly) agreed that the Z-buffer needs to be full-resolution (i.e. 1280x720x4) - so how do you fit the Z-only pass Z-buffer in the eDRAM?

I can see how Jawed's scheme would be possible, if the rendertarget setup API on the Xenos allows the users to preserve the HiZ on-die buffers between passes.
 
Alstrong said:
Dave, your comment is confusing...especially when your article uses 32-bit Z-buffer in the example calculation! :???: :?:
Mmmm, I guess I miss remembered. I remember doing the calcs with one depth and got feedback to do it the other.

Jawed said:
Yeah, I think you're right. I was forgetting that pixels can only be culled by hierarchical-Z and the precision of Zs in the hierarchical-Z buffer can't be improved by a z-only prepass with MSAA turned on(step 2).
If you think to the PC figures then you'll see that the rejected numbers increases with the level of AA - I don't think the HierZ is changing (when its at its max resolution) with AA, its just getting courser. I think the 1280x1024 @ 2x FSAA figure in the MS presentation represents the lowest level of detail that can be stored on the HiZ, but when 4x is used then the HiZ is effectively just more course.

assen said:
We (mostly) agreed that the Z-buffer is likely to be 32-bit, we (mostly) agreed that the Z-buffer needs to be full-resolution (i.e. 1280x720x4) - so how do you fit the Z-only pass Z-buffer in the eDRAM?
Again, I'm not at all sure that the pixel level Z-buffer is a concern; the primary function is to reduce shader overhead, which pixel level Z rejects doesn't do. Pixel level rejects are there for bandwidth saving and thats not an issue with Xenos.
 
assen said:
In the article on Xenos it is stated that the expected way to perform tiling on Xenos is to do a Z-only pass first, during which to determine the screen-space extents of each command buffer entry, then use them to guide the predicated rendering during the next passes. However, a 1280x720 4xMSAA 32-bit Z-buffer still occupies 14 MB, and still doesn't fit in the eDRAM.

How come?

The result of the Z-only pass during tiling doesn't fill onlt the z buffer but also the Hi-Z buffer that is used for early rejection of fragments before the fragment shader is run. The buffer reserved to Hi-z Zs not big enough to hold all information at 720p 4x, that is a Z-only pass for each tile is required. At 2xMSAA, the hi-z buffer is big enough to hold all information thus only one Z-only pass is required for all tiles. When this mode is activated, a single DIP is predicated against the tile and is also predicated against occluders: basically you get hw occlusion culling "for free" in the sense you don't write code, which is nice and simple, especially if you can layer your scene to render big occluders first (the landscape for example).

Fran/Fable2
 
Im actually quite confused about that feature, the eDram, lets make a crackpotish idea, and assume devloperX makes a xbox360 game aimed at 1080i, 32bits, and 4xMSAA, HDR and AF, and try to get as much framerate as possible, do the size of the eDram matter?
 
kimg said:
Im actually quite confused about that feature, the eDram, lets make a crackpotish idea, and assume devloperX makes a xbox360 game aimed at 1080i, 32bits, and 4xMSAA, HDR and AF, and try to get as much framerate as possible, do the size of the eDram matter?

Ofcourse the size of the eDram would matter.
 
Being able to render the screen in tiles, the size of the eDRAM doesn't prohibit full-feature rendering. It just means the more tiles you need, the more overhead. Which is the same for any hardware - the more features you use, the more demands are placed on the system.
 
Back
Top