What does DX10.1 add vs. DX10 that makes such a big difference? [deferred shading]

When triangles from two different objects share a pixel, a shared Z is meaningless. In conventional MSAA resolve this is irrelevant - but deferred rendering's shading pass must have Z in order to obtain any meaning from the G-buffer. If the shading pass reads the G-buffer at 2560x2048 (4xMSAA on 1280x1024) but reads Z at 1280x1024 then you will get subtle rendering errors where triangles from two different meshes meet within a pixel. They're subtle errors but they're there nonetheless.
No one is talking about lower resolution depth, that isn't even possible in a single pass as all render targets need to have the same number of samples. With multisampled depth output to a "color" buffer you still get different depth values per triangle, just not per sample inside the same triangle.

If you render at significantly less of an angle you get the cost of 4xSS at the IQ of ~2xSS. Great.
45° is a pretty poor angle. For 5x "rotated ordered grid" you'd use 26,5°
 
No one is talking about lower resolution depth, that isn't even possible in a single pass as all render targets need to have the same number of samples. With multisampled depth output to a "color" buffer you still get different depth values per triangle, just not per sample inside the same triangle.
Which is more costly than the D3D10.1 approach - but yeah, not as costly as supersampling which I'll happily admit. I realise now this is what Mintmaster was referring to as "MSAA rendertarget Z".

45° is a pretty poor angle. For 5x "rotated ordered grid" you'd use 26,5°
For 1280x1024 that's an overhead of 82% (1603x1488, needs rounding up to the nearest 4 or 16) and for 1920x1200 89% (2254x1931 subject to rounding).

Jawed
 
I agree with the rest, but I don't think this is true for an FP32 depth buffer.
Probably true, but I guess we have to test it out to see if the distribution of values is good enough. I know you've made some posts on this matter, putting 1 at the near plane, but I guess we'll have to see. Can you enable stencil with FP32 Z? Stencil is very important in deferred rendering, as it allows you to light only visible pixels in range of the light.

You write Z to the depth buffer anyway, so writing it to another render target as well is wasted bandwidth.
In that part of my post I'm not comparing rendertarget (DX10) to depth buffer (MSAA). Jawed was saying supersampling the G-buffer uses far less BW than multisampling it, but that's only possible with color compression.

However as SuperCow already pointed out you can't use that buffer simultaneously for depth testing and for the lighting calculations, so if you want to profit from depth test when rendering the light extends you may need to create a copy of the buffer. Copying the buffer obviously has no overdraw but it's read and write instead of just a write.
Good point. Plus, it's a serial step that can't be parallelized and occasionally hidden in vertex limited parts of G-buffer rendering. Recovering distance from Z also requires more math. Add it all up, and avoiding Z in a new render target may not help much.
 
Last edited by a moderator:
For the best image quality on edges, each triangle that falls within a pixel must have a 1:1 relationship between Z and albedo, normal, specularity etc.
Why the heck wouldn't this be the case when writing Z to a rendertarget, as Arun and I (and the slide you posted) have been clearly saying since the very beginning of this debate?

Lower resolution Z creates an edge artefact for those aspects of lighting that are dependent on Z.
Lower resolution Z? What? Nobody in this thread has suggested using a lower resolution. The DX10 fallback is to have Z (well, distance to be more precise, since that's more useful) in your G-buffer. The slide you keep referring to says that, Arun said that, SuperCow said that, I said that, and even you said that!

G-buffer creation is bandwidth bound - MSAA'd creation, particularly in making use of the GPU's compression features which are there to save bandwidth, is a big win.

32-bit Z to the rescue.
Stencil? Marking visible and in-range pixels for lighting is the biggest advantage of deferred rendering.

The saving is during creation, not on read-back.
That's exactly what I was talking about. If writing to MSAA textures saves bandwidth, as you claimed, then compression must be enabled. Yet when I suggested this, it's "ridiculous" and "a load of baloney". :rolleyes:

You're also suggesting that after one is finished writing to a MSAA texture, the card goes through it and uncompresses it for use in readback. You have solid info for this? Assuming you're right, for each pixel that's compressed, you read one sample worth of data and then write it to all samples (this is done in a tiled manner for efficiency, of course). You think this saves a lot of bandwidth over just writing to all samples in the first place?

One more thing: Without MSAA, G-buffer creation is not bandwidth bound unless you're using RSX. It's fillrate bound. Whether you are outputting one rendertarget or 10, you are writing at the same rate: 96 bytes per clock on G80, 64 bytes per clock on R600, both well within BW limits. More rendertargets in fact make it easier to reach the peak speed, since Z read/write is done only once.

Enable MSAA on R600 and the max theoretical speed boost is 2x vs. rendering a larger G-buffer. Now, in deferred rendering you're suffering the extra load of G-buffer writing and then reading to save on lighting calculations, so clearly lighting is the biggest workload, right? How does halving (at best) only the G-buffer creation time give you a big performance boost?

It won't, hence the 2A+4B vs. 4A+4B comment.
(Sorry, I flipped up my inequality in the last post. A << B here.)
Two overlapping triangles 10m apart have quite different Z values.
Put Z in a rendertarget (i.e. as part of your G-buffer) and the subsamples have quite different values too. WTF is your point?

If you render at significantly less of an angle you get the cost of 4xSS at the IQ of ~2xSS. Great.
What's the angle of 4x rotated grid on ATI's and NVidia's parts? Hint: much less than 45 degrees.
 
Why the heck wouldn't this be the case when writing Z to a rendertarget, as Arun and I (and the slide you posted) have been clearly saying since the very beginning of this debate?

Lower resolution Z? What? Nobody in this thread has suggested using a lower resolution. The DX10 fallback is to have Z (well, distance to be more precise, since that's more useful) in your G-buffer. The slide you keep referring to says that, Arun said that, SuperCow said that, I said that, and even you said that!
Well it seems that the lower cost option (which also has lower IQ and what I've been assuming is the normal choice) is to use hardware Z instead of writing Z/distance into the G-buffer. That appears to be what Sebbbi was doing (though he didn't say so explicitly):

http://forum.beyond3d.com/showpost.php?p=1094257&postcount=21

The IQ loss isn't major (minor artefacting on overlapping triangles).

If you go through the "fallback" options in that set of slides, they are all more costly (bandwidth/memory/time). Writing Z to an MRT adds to the cost. D3D10.1 allows the developer to fix the minor IQ artefacts at no increase in G-buffer creation cost.

(In theory D3D10.1 hardware Z provides slightly better IQ because the rasteriser outputs true Z for each sample (whereas the pixel shader can only write Z per pixel) and because the developer knows the positions of each sample within the pixel. But that's just an aside as I wasn't using that stuff as the basis of "better IQ" - my argument rested on just bandwidth versus IQ.)

Stencil? Marking visible and in-range pixels for lighting is the biggest advantage of deferred rendering.
You've got me there.

That's exactly what I was talking about. If writing to MSAA textures saves bandwidth, as you claimed, then compression must be enabled. Yet when I suggested this, it's "ridiculous" and "a load of baloney". :rolleyes:

You're also suggesting that after one is finished writing to a MSAA texture, the card goes through it and uncompresses it for use in readback. You have solid info for this? Assuming you're right, for each pixel that's compressed, you read one sample worth of data and then write it to all samples (this is done in a tiled manner for efficiency, of course).
Something like this is unavoidable. Particularly as fp32 texels are not generally compressible so the texture units won't have the capability to de-compress the "compressed fp32 render target". If render target compression of fp32 MRTs is something like "this tile has all four samples the same in pixels 1 and 2, the remaining samples are: ..." it's not going to be a suitable technique for texture compression (because with average textures you'll get no compression at all). So, in my view, there won't be any hardware in the texture pipes to "uncompress render target data".

It seems to me that render target decompression is much like AA resolve, just without the averaging. Unfortunately we're not going to get an absolute answer on this...

You think this saves a lot of bandwidth over just writing to all samples in the first place?
Of course, because during G-buffer creation you've got overdraw (5x still seems like a reasonable number these days). Also G-buffer creation has to test Z for every pixel it creates, so you want the most bandwidth-efficient Z-testing possible, which comes courtesy of hierarchical Z and all the rest of it.

One more thing: Without MSAA, G-buffer creation is not bandwidth bound unless you're using RSX. It's fillrate bound. Whether you are outputting one rendertarget or 10, you are writing at the same rate: 96 bytes per clock on G80, 64 bytes per clock on R600, both well within BW limits. More rendertargets in fact make it easier to reach the peak speed, since Z read/write is done only once.
OK, I'm missing something here perhaps, on R600: 4 MRTs, each with 4 bytes of colour, + 4 bytes of Z is 20 bytes x 16 pixels per clock = 320 bytes, at 742MHz is 237GB/s.

Now, I admit, the shader that generates each G-buffer pixel should be running for longer than 4 cycles (R600 has 64 pixels in flight but only 16 can write to the RBEs per clock), but even at 8 cycles per G-buffer pixel, that's 119GB/s of data coming out of the ALU pipes. 8 cycles is enough to fetch 5 or 6 textures + do some math on them. Vertex shading should add a few cycles I suppose...

But then you have to add in the bandwidth consumed by fetching those 5 or 6 textures per G-buffer pixel...

Looks thoroughly bandwidth constrained to me. What am I missing?

What's the angle of 4x rotated grid on ATI's and NVidia's parts? Hint: much less than 45 degrees.
As you've hopefully noticed by now, even with Xmas's suggested 26.5 degrees (which seems to be the angle ATI is using now that I've measured), you're still looking at 80%+ wasted space, which is a lot if you've got MRTs, not just a single render target...

Jawed
 
I think there are not enough ROPs to write 320 bytes per clock, Jawed.
Yes... 96 bytes per pixel at plain fillrate for just a single render target - 16 x (4 bytes colour + 4 bytes Z) - at 742MHz is 95GB/s - before Z testing or texture fetching. Hmm...

Jawed
 
If you go through the "fallback" options in that set of slides, they are all more costly (bandwidth/memory/time). Writing Z to an MRT adds to the cost. D3D10.1 allows the developer to fix the minor IQ artefacts at no increase in G-buffer creation cost.
One more rendertarget is a small cost in the grand scheme of things. If you use stencil (as all DR's should) and can't use 32-bit Z, then it gives you higher quality too due to precision. Note also that Z readback needs more complicated math too compared to distance, and you need to copy the Z buffer if you're stenciling, so the difference is even less.

Anyway, neither me nor Arun had any issue with admitting there's a small performance penalty for adding distance to the G-buffer. This whole debate has been about the IQ difference between that and Z.

Of course, because during G-buffer creation you've got overdraw (5x still seems like a reasonable number these days). Also G-buffer creation has to test Z for every pixel it creates, so you want the most bandwidth-efficient Z-testing possible, which comes courtesy of hierarchical Z and all the rest of it.
5x overdraw for opaque pixels? With all the early Z rejection hardware enabled? That's nonsense.

OK, I'm missing something here perhaps, on R600: 4 MRTs, each with 4 bytes of colour, + 4 bytes of Z is 20 bytes x 16 pixels per clock = 320 bytes, at 742MHz is 237GB/s.
First of all, Z is compressed. That's why modern video cards can reach their peak pixel rates (check xbitlabs reviews).

More importantly, 4MRTs takes 4 cycles for an ROP to output. Forget about the shader, as the limitation is at the ROPs. You could only output 4 pixels per clock in your scenario, so even ignoring Z-compression, it's 59GB/s.

BW limitations kick in with alpha-blending and uncompressible MSAA, and should it happen with the latter, then the improvement is even less than the 2x I assumed earlier. Regarding texture BW per pixel, it's a fraction of the large framebuffer BW in DR. Most textures are compressed, and shadowmap samples are done in the shading pass.

As you've hopefully noticed by now, even with Xmas's suggested 26.5 degrees (which seems to be the angle ATI is using now that I've measured), you're still looking at 80%+ wasted space, which is a lot if you've got MRTs, not just a single render target...
Like I said, that's just one of many possibilities. You can pick almost any rotation angle and any resolution, run the lighting shader on each pixel, then fit it to the screen as you would with any arbitrarily sized and oriented data in image processing. I doubt you can have high resolutions anyway when supersampling the lighting shader.
 
One more rendertarget is a small cost in the grand scheme of things.
I'm surprised you're writing off an extra 25%, say.

If you use stencil (as all DR's should) and can't use 32-bit Z,

http://forum.beyond3d.com/showpost.php?p=1094748&postcount=1554

there seems to be no problem with 32-bit Z and stencil.

then it gives you higher quality too due to precision. Note also that Z readback needs more complicated math too compared to distance, and you need to copy the Z buffer if you're stenciling, so the difference is even less.
You can't stencil the lighting with the Z-buffer in place? Or do you mean after you've stencilled and want to read it for lighting?

5x overdraw for opaque pixels? With all the early Z rejection hardware enabled? That's nonsense.
I think Carmack, in presenting Rage, mentioned 5x overdraw.

You didn't seem unhappy with 5x:

http://forum.beyond3d.com/showpost.php?p=932025&postcount=18

First of all, Z is compressed. That's why modern video cards can reach their peak pixel rates (check xbitlabs reviews).
Unfortunately, as you add MRTs, the proportion of Z diminishes, which means the proportion of bandwidth saving you'll get from Z compression diminishes even further.

More importantly, 4MRTs takes 4 cycles for an ROP to output. Forget about the shader, as the limitation is at the ROPs. You could only output 4 pixels per clock in your scenario, so even ignoring Z-compression, it's 59GB/s.
I agree, provided that's how the ROPs treat MRT writes (which, admittedly, seems likely). Seems a shame that when outputting MRTs the ROPs are unable to "re-use" the otherwise idle "Z bandwidth". I suppose, in theory, the way the ROPs handle transfer of blocks into memory should maximise utilisation of bandwidth - so this "unused Z bandwidth" problem is prolly a red herring.

BW limitations kick in with alpha-blending and uncompressible MSAA, and should it happen with the latter, then the improvement is even less than the 2x I assumed earlier. Regarding texture BW per pixel, it's a fraction of the large framebuffer BW in DR. Most textures are compressed, and shadowmap samples are done in the shading pass.
Well, it's hard for me to quantify the bandwidth associated with the textures: how many textures for albedo; what maps are used: normal, diffuse, specular, ambient etc. In a forward renderer these texture reads tend to be spaced out a bit, whereas the G-buffer pass tends to cram them all together, which is asking for texture cache thrashing on top of whatever bandwidth they consume.

But it may be that with G80/R600 bandwidth, per se, is no longer the issue just because they have quite a lot compared to their colour rate. But colour rate then becomes the problem. And so you still have the fact that writing Z explicitly as an MRT is going to have a noticeable additional cost.

Jawed
 
I'm surprised you're writing off an extra 25%, say.
25% of just the G-buffer pass and only when not vertex limited. I'd bet it's under 10% of the total frame cost.

http://forum.beyond3d.com/showpost.php?p=1094748&postcount=1554

there seems to be no problem with 32-bit Z and stencil.
Yeah, looks like I erred there. Didn't see the new 64-bit per sample depth-stencil format in DX10 (DXGI_FORMAT_D32_FLOAT_S8X24_UINT). I seriously doubt that using it won't incur a performance penalty.

You can't stencil the lighting with the Z-buffer in place? Or do you mean after you've stencilled and want to read it for lighting?
Not according to Xmas and SuperCow. You can't use the depth/stencil buffer for culling pixels (those not visible by a light source) and simultaneously use the same buffer for lighting. So we're talking about a 64-bit per sample read and write!

I think Carmack, in presenting Rage, mentioned 5x overdraw.

You didn't seem unhappy with 5x:

http://forum.beyond3d.com/showpost.php?p=932025&postcount=18
Carmack was not talking about opaque pixels, nor was he talking about overdraw that made it through all the Hi-Z/early-Z culling. Deferred rendering is very friendly to sorting too because all pixels pretty much use the same pixel shader. As for me, I was intentionally picking a large number as a worst case to show how huge 10GPix/s is.

I agree, provided that's how the ROPs treat MRT writes (which, admittedly, seems likely). Seems a shame that when outputting MRTs the ROPs are unable to "re-use" the otherwise idle "Z bandwidth".
No it's not a shame, because Z BW is a lot smaller than colour BW for pixels that get through culling. If colour+Z BW is 5 bytes for one MRT, it'll be 4.5 for two MRTs over two clocks, 4.25 for four MRTs over four clocks. Does that difference really warrant doubling the MRT output? Especially when MRT is unused in almost all forward renderers?

Well, it's hard for me to quantify the bandwidth associated with the textures: how many textures for albedo; what maps are used: normal, diffuse, specular, ambient etc. In a forward renderer these texture reads tend to be spaced out a bit, whereas the G-buffer pass tends to cram them all together, which is asking for texture cache thrashing on top of whatever bandwidth they consume.
There are a few things to notice between a DR's G-buffer pass and a comparable FR to determine BW loads:
A) Both have equal texturing demands aside from shadowing
B) G-buffer pass does no shadow map sampling, but FR does (32-bit, uncompressed, multiple samples, no mipmapping)
C) G-buffer has 4-5x the write BW

If we assume the framebuffer to texture BW ratio in a FR is 1:1 (and I personally believe texture BW is lower), then all the points above suggest that the G-buffer pass will be well over 5:1.

And so you still have the fact that writing Z explicitly as an MRT is going to have a noticeable additional cost.
For the last time, DX10.1 does not remove that for free. To get rid of one 32-bit rendertarget, you must:
A) Use a 64-bit per sample Z/stencil-buffer to maintain accuracy (and it's still not as good as FP32 distance, but hopefully good enough)
B) Copy this big Z-buffer before lighting
C) Do extra math to convert Z into distance in the shading pass

This is not free, and you're doing it all as an alternative to what I estimate is under 10% of a perf hit.
 
Yeah, looks like I erred there. Didn't see the new 64-bit per sample depth-stencil format in DX10 (DXGI_FORMAT_D32_FLOAT_S8X24_UINT). I seriously doubt that using it won't incur a performance penalty.
It's more data so it can't be free, but I doubt it's implemented with 24 wasted bits.

A) Use a 64-bit per sample Z/stencil-buffer to maintain accuracy (and it's still not as good as FP32 distance, but hopefully good enough)
FP32 depth is usually already limited by vertex shader precision, and interpolation of Z should be more precise as it is linear in screen space.

C) Do extra math to convert Z into distance in the shading pass
One rcp if you store 1/Zeye.
 
It's more data so it can't be free, but I doubt it's implemented with 24 wasted bits.
Possibly, but even if you're right, it's an additional cost for every Z-test done during the G-buffer pass. As an aside, do you think the stencil buffer is kept separate? Or do you think the depth/stencil has a stride of 5 or 6 bytes as opposed to 8? Trying to save some space this way might pose problems/costs for depth/stencil readback, as all other formats written have power-of-two strides, and a split buffer has its own costs.

FP32 depth is usually already limited by vertex shader precision, and interpolation of Z should be more precise as it is linear in screen space.
You think iterator interpolation isn't done at full precision? I know it's mathematically more complex than screenspace interpolation, but I'd still expect it to be done properly.

I'll admit, though, that in my last post I forgot about how you can get a great distribution of values when mapping 1 to the near plane, as you've pointed out in other threads. So my minor point about precision is now debunked.
One rcp if you store 1/Zeye.
True, but it's still something. I'm just pointing out that when you add all these things together and compare it to the cost of another rendertarget in the G-buffer, the total difference won't be particularly big.
 
Not according to Xmas and SuperCow. You can't use the depth/stencil buffer for culling pixels (those not visible by a light source) and simultaneously use the same buffer for lighting. So we're talking about a 64-bit per sample read and write!
If you use stencil marking for all your deferred lights then you should get away with not having to copy the depth buffer:

1) Mark stencil for pixels inside light volume using ZFail method (only writing to stencil)
2) Unbind Z buffer, bind Z buffer as texture and apply lighting equation for stencil-marked pixels

The problem is that sometimes it is not necessarily economical to use stencil for all your volumes. E.g. it can be sufficient to directly apply lighting by rendering the back faces of a light volume whenever the camera is inside the volume. In those cases you need access to both Z at the pixel shader level while still having your depth buffer bound for reads (which means a ZBuffer copy or writing Z into an MRT in the first place).
 
Possibly, but even if you're right, it's an additional cost for every Z-test done during the G-buffer pass. As an aside, do you think the stencil buffer is kept separate? Or do you think the depth/stencil has a stride of 5 or 6 bytes as opposed to 8? Trying to save some space this way might pose problems/costs for depth/stencil readback, as all other formats written have power-of-two strides, and a split buffer has its own costs.
I believe that even for uncompressed render targets we're looking at tiled buffers, so the stride is always at least a multiple of 32 bits (or more likely 128 bits). Z and stencil could be separate or interleaved at the pixel or tile level, in the latter case possibly even with different tile sizes. I'm not sure there's a single best implementation.

With separate tiles you could use different compression methods which might make sense depending on the scene. Stencil typically only changes at object edges, but it can be read-modify-write thus hidden edges can cause a change in stencil. Depth slope changes at (most) visible triangle edges.

You think iterator interpolation isn't done at full precision? I know it's mathematically more complex than screenspace interpolation, but I'd still expect it to be done properly.
A more complex operation will likely cause some precision loss, even if "done properly".

True, but it's still something. I'm just pointing out that when you add all these things together and compare it to the cost of another rendertarget in the G-buffer, the total difference won't be particularly big.
Oh, I certainly agree there's no huge difference to begin with, and it's only getting smaller in the future.


If you use stencil marking for all your deferred lights then you should get away with not having to copy the depth buffer:

1) Mark stencil for pixels inside light volume using ZFail method (only writing to stencil)
2) Unbind Z buffer, bind Z buffer as texture and apply lighting equation for stencil-marked pixels
You can't unbind the Z buffer without unbinding the stencil buffer. So at this point you have lost access to stencil information.

Demirug is right that a smart driver can handle using the buffer simultaneously without a copy, but I wouldn't rely on it.
 
You can't unbind the Z buffer without unbinding the stencil buffer. So at this point you have lost access to stencil information.
Doh! Need more coffee!
And you can't even access the stencil buffer as texture either since the Z Buffer is already viewed as a texture for Z data (Z and stencil require different views). Not that it would be a good idea anyway since you'd lose the early stencil optimizations.
 
Doh! Need more coffee!
And you can't even access the stencil buffer as texture either since the Z Buffer is already viewed as a texture for Z data (Z and stencil require different views). Not that it would be a good idea anyway since you'd lose the early stencil optimizations.
You can bind multiple views of the same resource simultaneously as inputs AFAIK. But yes, marking lit pixels with stencil only makes sense if you can use an early stencil test.
 
1. Cube map arrays. Can be used in GI techniques.
2. Separate Blend Modes per-MRT. Good for ... hmm...
3. Increased Vertex Shader Inputs & Outputs. Doubled from 16 to 32. Well, more interpolation registers are always good.
4. Gather 4 sampler. Good for GPGPU and PCF shadows.
5. LOD instruction. Good for custom texture filtering.
6. Multi-sample buffer. Lots of uses: custom edge filters, adaptive or custom AA, HDR resolve, etc Basically you have access to samples, not only pixels.
7. Programmable AA Sample Patterns. To experiment with custom AA for extra image quality.
8. Standarized minimum features: Min 4x MSA, FP32 and int16 filtering/blending, increased FP precision(0.5 ULP)

For me the two more important features are cube map arrays(#1) and Multi sample buffer(6)... and are enough to require you a new graphics card so start blaming your new and shiny GF8800. DX10.1 is supposed to come with Vista SP1.

About deferred shading, can be hard to do without these new things... and you need to put in a balance the blending problems + MRT bandwith and memory consumption + render volume light with stencil ops vs multipass/multiple lights iteration on the shader doing a pre-Z pass. I personally don't find it very suitable and hope with these DX10.1 features could help to improve it.
 
Last edited by a moderator:
Regarding #6, remember that DX10 had multisample buffer access for everything except the depth-stencil buffer.

How can deferred rendering take advantage of per-MRT blend modes? I'm curious about that, as I don't really see any way out of doing transparency in a forward rendering pass.

BTW, I agree with your assessment of deferred rendering. However, there are some techniques that just can't get by without it.
 
How can deferred rendering take advantage of per-MRT blend modes? I'm curious about that, as I don't really see any way out of doing transparency in a forward rendering pass.
I was asking myself the same question. Off the top of my head it could be useful to apply decals (bullet holes, blood splatter, etc.) to the G-Buffer. Maybe you want to completely overwrite the normal but blend the diffuse with the existing value in the G-Buffer or something similar. (In theory you'd probably really want to blend the normal too but you need programmable blending for that).
 
Back
Top