Why the heck wouldn't this be the case when writing Z to a rendertarget, as Arun and I (and the slide you posted) have been clearly saying since the very beginning of this debate?
Lower resolution Z? What? Nobody in this thread has suggested using a lower resolution. The DX10 fallback is to have Z (well, distance to be more precise, since that's more useful) in your G-buffer. The slide you keep referring to says that, Arun said that, SuperCow said that, I said that, and
even you said that!
Well it seems that the lower cost option (which also has lower IQ and what I've been assuming is the normal choice) is to use hardware Z instead of writing Z/distance into the G-buffer. That appears to be what Sebbbi was doing (though he didn't say so explicitly):
http://forum.beyond3d.com/showpost.php?p=1094257&postcount=21
The IQ loss isn't major (minor artefacting on overlapping triangles).
If you go through the "fallback" options in that set of slides, they are all more costly (bandwidth/memory/time). Writing Z to an MRT adds to the cost. D3D10.1 allows the developer to fix the minor IQ artefacts at no increase in G-buffer creation cost.
(In theory D3D10.1 hardware Z provides slightly better IQ because the rasteriser outputs true Z for each sample (whereas the pixel shader can only write Z per pixel) and because the developer knows the positions of each sample within the pixel. But that's just an aside as I wasn't using that stuff as the basis of "better IQ" - my argument rested on just bandwidth versus IQ.)
Stencil? Marking visible and in-range pixels for lighting is the biggest advantage of deferred rendering.
You've got me there.
That's exactly what I was talking about. If writing to MSAA textures saves bandwidth, as you claimed, then compression must be enabled. Yet when I suggested this, it's "ridiculous" and "a load of baloney".
You're also suggesting that after one is finished writing to a MSAA texture, the card goes through it and uncompresses it for use in readback. You have solid info for this? Assuming you're right, for each pixel that's compressed, you read one sample worth of data and then write it to all samples (this is done in a tiled manner for efficiency, of course).
Something like this is unavoidable. Particularly as fp32 texels are not generally compressible so the texture units won't have the capability to de-compress the "compressed fp32 render target". If render target compression of fp32 MRTs is something like "this tile has all four samples the same in pixels 1 and 2, the remaining samples are: ..." it's not going to be a suitable technique for texture compression (because with average textures you'll get no compression at all). So, in my view, there won't be any hardware in the texture pipes to "uncompress render target data".
It seems to me that render target decompression is much like AA resolve, just without the averaging. Unfortunately we're not going to get an absolute answer on this...
You think this saves a lot of bandwidth over just writing to all samples in the first place?
Of course, because during G-buffer creation you've got overdraw (5x still seems like a reasonable number these days). Also G-buffer creation has to test Z for every pixel it creates, so you want the most bandwidth-efficient Z-testing possible, which comes courtesy of hierarchical Z and all the rest of it.
One more thing: Without MSAA, G-buffer creation is not bandwidth bound unless you're using RSX. It's fillrate bound. Whether you are outputting one rendertarget or 10, you are writing at the same rate: 96 bytes per clock on G80, 64 bytes per clock on R600, both well within BW limits. More rendertargets in fact make it easier to reach the peak speed, since Z read/write is done only once.
OK, I'm missing something here perhaps, on R600: 4 MRTs, each with 4 bytes of colour, + 4 bytes of Z is 20 bytes x 16 pixels per clock = 320 bytes, at 742MHz is 237GB/s.
Now, I admit, the shader that generates each G-buffer pixel should be running for longer than 4 cycles (R600 has 64 pixels in flight but only 16 can write to the RBEs per clock), but even at 8 cycles per G-buffer pixel, that's 119GB/s of data coming out of the ALU pipes. 8 cycles is enough to fetch 5 or 6 textures + do some math on them. Vertex shading should add a few cycles I suppose...
But then you have to add in the bandwidth consumed by fetching those 5 or 6 textures per G-buffer pixel...
Looks thoroughly bandwidth constrained to me. What am I missing?
What's the angle of 4x rotated grid on ATI's and NVidia's parts? Hint: much less than 45 degrees.
As you've hopefully noticed by now, even with Xmas's suggested 26.5 degrees (which seems to be the angle ATI is using now that I've measured), you're still looking at 80%+ wasted space, which is a lot if you've got MRTs, not just a single render target...
Jawed