I don't want to turn this into a big back-and-forth argument, so I'm going to try and clarify things.
Crusher said:
Mintmaster said:
Crusher, what Basic means by "hidden pixels" is "z-fail". If they fail, they are in effect hidden. He's just talking about how ATI's Heirarchical Z doesn't work when drawing the stencil volumes because of this. Trust me, Basic is a smart, knowledgeable guy that doesn't easily get "confused".
That's not what it sounds like he is talking about to me. Nor does your description of what Hierarchial Z does sound correct. You say HZ doens't work when drawing stencil volumes because it throws away hidden pixels, and you claim the hidden pixels are the parts of the volume face that fail the z-test (i.e. is this pixel in front of or behind the pixel stored in the same location in the z-buffer). Throwing away pixels that fail the z-test is PERFECTLY FINE. It should never be keeping those pixels anyway, since you explicitly disable z-buffer writes before you render the volume. All you care about is the result of the test--the fact that it did fail--so that you can alter the entry in the stencil buffer accordingly. As long as the driver accurately reports the result of the depth test, everything should be peachy. And I don't see any possible reason why HZ would affect the depth tests themselves.
HZ on R300 keeps the max of all Z values in each tile (assuming the convention where higher Z is further away). If a current polygon's closest z (i.e. lowest z) is bigger than the value in the corresponding HZ tile, the polygon is discarded, as it entirely fails the test. This is ordinary depth pass rendering.
If you now want to do depth fail rendering, you have no HZ acceleration. HZ holds the max, so you can't tell when a polygon entirely passes and thus should not be rendered in depth fail mode. In other words, pixels from the stencil volumes that pass the depth test will not change the stencil buffer, but cannot be discarded rapidly by HZ. Polygons that fail can be discarded, but you don't want to discard them since they need to update the stencil buffer one pixel at a time.
NOTE: when I say polygons wrt HZ, I mean tiles or blocks of pixels within polygons.
This is what was written in ATI's HZ performance guidelines: If you change the Z function from depth pass to depth fail, HZ can't work. I just explained to you why, and this is what Basic was talking about. ATI then falls back to ordinary Z-buffering.
Mintmaster said:
And I don't know what you mean by "backside of models", but you do render the backside of the stencil volumes, which is what Grall is talking about. He said "stencils", not models.
He said models, but perhaps he was talking about the stencil volumes instead of models, in which case I misunderstood him.
Grall said stencils, then said models in a statement enclosed in parentheses immediately thereafter. There is no need for you to nitpick.
Mintmaster said:
Finally, we are obviously talking about situations that aren't CPU limited. What good does a fast GPU do you then?
Nowhere in Carmack's .plan update did he even suggest that Doom 3 was not being CPU limited, and since the comments I'm responding to are referring to the FX's performance in Doom 3, I don't see how it could be obvious that we're talking about situations that aren't CPU limited.
If we are CPU limited, why the hell would he be talking about video card performance? All you have to do to test the video card is raise the resolution so that framerates are significantly faster than when disabling rendering altogether.
Whenever you talk about video card performance, you mean not CPU limited. Otherwise you are either talking about driver overhead or have no idea what you're talking about, neither of which apply to John Carmack's statements.
Mintmaster said:
While the graphics card is handling the intense texturing for one frame, the CPU is doing the stencil volumes for the next frame. NV30 should very well be able to burn through them.
NV30 might be able to burn through the z tests, and the rendering pass to add the shadow from the stencil mask might not take too long, that I could agree with. My point is, while the NV30 might be able to handle it's share of the workload for the stencil volumes, the end performance probably isn't going to be "blazing fast" like it sounded he was expecting it to be, since there are still lots of things that have to be done to calculate them. And your comment implies that the NV30 has other things to do while the CPU is computing the volumes, which isn't normally the case.
The rendering process is usually that you build the volume for one occluder, do the z tests and update the stencil mask, then build the volume for the next occluder. In this situation, if the GPU can do the transform and z tests faster than the CPU can compute the volume for the next occluder, the GPU will be sitting idle waiting for that information. And if you have this all being done in the same function (or even the same thread), the transformation and z testing won't be done concurrently with the volume production anyway, so they'll both be waiting for the other. You could generate all the shadow volumes before you begin transforming and doing z tests, but I don't think that would be any faster, and you'd have to have to store a lot more vertices in each frame.
The CPU does not wait for the GPU to finish the stencil drawing, nor the other way around. Things get queued up, with the GPU finishing rendering one frame while the driver caches the draw commands for the next. The drawing calls in the function do not wait for the GPU to finish before returning. This is probably the most fundamental of driver enhancements to reduce CPU usage.
The only time it fails is when you change rendering resources like textures or vertex buffers in the middle of a frame, or if you need to get a result back, like doing a framebuffer read or using occlusion query, in which case you empty the queue. Even the latter has mechanisms for issuing the query and retrieving results later. If working with dynamic vertex buffers, then the driver can make a copy of the vertex data, via CPU (or AGP, I think), and queue it.
Carmack knows very well how to optimize a program. He will not let both GPU and CPU have any significant idle time in the same frame. If the driver doesn't do what I said, then he will ping-pong between vertex buffers from frame to frame.
So even if you generate your shadow volumes and send them to the GPU one at a time, the driver will effectively wind up drawing them all together some time later.