What does DX10.1 add vs. DX10 that makes such a big difference? [deferred shading]

Discussion in 'Rendering Technology and APIs' started by Jawed, Nov 14, 2007.

  1. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    No one is talking about lower resolution depth, that isn't even possible in a single pass as all render targets need to have the same number of samples. With multisampled depth output to a "color" buffer you still get different depth values per triangle, just not per sample inside the same triangle.

    45° is a pretty poor angle. For 5x "rotated ordered grid" you'd use 26,5°
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    Which is more costly than the D3D10.1 approach - but yeah, not as costly as supersampling which I'll happily admit. I realise now this is what Mintmaster was referring to as "MSAA rendertarget Z".

    For 1280x1024 that's an overhead of 82% (1603x1488, needs rounding up to the nearest 4 or 16) and for 1920x1200 89% (2254x1931 subject to rounding).

    Jawed
     
  3. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Probably true, but I guess we have to test it out to see if the distribution of values is good enough. I know you've made some posts on this matter, putting 1 at the near plane, but I guess we'll have to see. Can you enable stencil with FP32 Z? Stencil is very important in deferred rendering, as it allows you to light only visible pixels in range of the light.

    In that part of my post I'm not comparing rendertarget (DX10) to depth buffer (MSAA). Jawed was saying supersampling the G-buffer uses far less BW than multisampling it, but that's only possible with color compression.

    Good point. Plus, it's a serial step that can't be parallelized and occasionally hidden in vertex limited parts of G-buffer rendering. Recovering distance from Z also requires more math. Add it all up, and avoiding Z in a new render target may not help much.
     
    #43 Mintmaster, Nov 16, 2007
    Last edited by a moderator: Nov 16, 2007
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Why the heck wouldn't this be the case when writing Z to a rendertarget, as Arun and I (and the slide you posted) have been clearly saying since the very beginning of this debate?

    Lower resolution Z? What? Nobody in this thread has suggested using a lower resolution. The DX10 fallback is to have Z (well, distance to be more precise, since that's more useful) in your G-buffer. The slide you keep referring to says that, Arun said that, SuperCow said that, I said that, and even you said that!

    Stencil? Marking visible and in-range pixels for lighting is the biggest advantage of deferred rendering.

    That's exactly what I was talking about. If writing to MSAA textures saves bandwidth, as you claimed, then compression must be enabled. Yet when I suggested this, it's "ridiculous" and "a load of baloney". :roll:

    You're also suggesting that after one is finished writing to a MSAA texture, the card goes through it and uncompresses it for use in readback. You have solid info for this? Assuming you're right, for each pixel that's compressed, you read one sample worth of data and then write it to all samples (this is done in a tiled manner for efficiency, of course). You think this saves a lot of bandwidth over just writing to all samples in the first place?

    One more thing: Without MSAA, G-buffer creation is not bandwidth bound unless you're using RSX. It's fillrate bound. Whether you are outputting one rendertarget or 10, you are writing at the same rate: 96 bytes per clock on G80, 64 bytes per clock on R600, both well within BW limits. More rendertargets in fact make it easier to reach the peak speed, since Z read/write is done only once.

    Enable MSAA on R600 and the max theoretical speed boost is 2x vs. rendering a larger G-buffer. Now, in deferred rendering you're suffering the extra load of G-buffer writing and then reading to save on lighting calculations, so clearly lighting is the biggest workload, right? How does halving (at best) only the G-buffer creation time give you a big performance boost?

    It won't, hence the 2A+4B vs. 4A+4B comment.
    (Sorry, I flipped up my inequality in the last post. A << B here.)
    Put Z in a rendertarget (i.e. as part of your G-buffer) and the subsamples have quite different values too. WTF is your point?

    What's the angle of 4x rotated grid on ATI's and NVidia's parts? Hint: much less than 45 degrees.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    Well it seems that the lower cost option (which also has lower IQ and what I've been assuming is the normal choice) is to use hardware Z instead of writing Z/distance into the G-buffer. That appears to be what Sebbbi was doing (though he didn't say so explicitly):

    http://forum.beyond3d.com/showpost.php?p=1094257&postcount=21

    The IQ loss isn't major (minor artefacting on overlapping triangles).

    If you go through the "fallback" options in that set of slides, they are all more costly (bandwidth/memory/time). Writing Z to an MRT adds to the cost. D3D10.1 allows the developer to fix the minor IQ artefacts at no increase in G-buffer creation cost.

    (In theory D3D10.1 hardware Z provides slightly better IQ because the rasteriser outputs true Z for each sample (whereas the pixel shader can only write Z per pixel) and because the developer knows the positions of each sample within the pixel. But that's just an aside as I wasn't using that stuff as the basis of "better IQ" - my argument rested on just bandwidth versus IQ.)

    You've got me there.

    Something like this is unavoidable. Particularly as fp32 texels are not generally compressible so the texture units won't have the capability to de-compress the "compressed fp32 render target". If render target compression of fp32 MRTs is something like "this tile has all four samples the same in pixels 1 and 2, the remaining samples are: ..." it's not going to be a suitable technique for texture compression (because with average textures you'll get no compression at all). So, in my view, there won't be any hardware in the texture pipes to "uncompress render target data".

    It seems to me that render target decompression is much like AA resolve, just without the averaging. Unfortunately we're not going to get an absolute answer on this...

    Of course, because during G-buffer creation you've got overdraw (5x still seems like a reasonable number these days). Also G-buffer creation has to test Z for every pixel it creates, so you want the most bandwidth-efficient Z-testing possible, which comes courtesy of hierarchical Z and all the rest of it.

    OK, I'm missing something here perhaps, on R600: 4 MRTs, each with 4 bytes of colour, + 4 bytes of Z is 20 bytes x 16 pixels per clock = 320 bytes, at 742MHz is 237GB/s.

    Now, I admit, the shader that generates each G-buffer pixel should be running for longer than 4 cycles (R600 has 64 pixels in flight but only 16 can write to the RBEs per clock), but even at 8 cycles per G-buffer pixel, that's 119GB/s of data coming out of the ALU pipes. 8 cycles is enough to fetch 5 or 6 textures + do some math on them. Vertex shading should add a few cycles I suppose...

    But then you have to add in the bandwidth consumed by fetching those 5 or 6 textures per G-buffer pixel...

    Looks thoroughly bandwidth constrained to me. What am I missing?

    As you've hopefully noticed by now, even with Xmas's suggested 26.5 degrees (which seems to be the angle ATI is using now that I've measured), you're still looking at 80%+ wasted space, which is a lot if you've got MRTs, not just a single render target...

    Jawed
     
  6. tinokun

    Newcomer Subscriber

    Joined:
    Jul 23, 2004
    Messages:
    57
    Likes Received:
    66
    Location:
    Peru
    I think there are not enough ROPs to write 320 bytes per clock, Jawed.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    Yes... 96 bytes per pixel at plain fillrate for just a single render target - 16 x (4 bytes colour + 4 bytes Z) - at 742MHz is 95GB/s - before Z testing or texture fetching. Hmm...

    Jawed
     
  8. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    Yes.
     
  9. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    One more rendertarget is a small cost in the grand scheme of things. If you use stencil (as all DR's should) and can't use 32-bit Z, then it gives you higher quality too due to precision. Note also that Z readback needs more complicated math too compared to distance, and you need to copy the Z buffer if you're stenciling, so the difference is even less.

    Anyway, neither me nor Arun had any issue with admitting there's a small performance penalty for adding distance to the G-buffer. This whole debate has been about the IQ difference between that and Z.

    5x overdraw for opaque pixels? With all the early Z rejection hardware enabled? That's nonsense.

    First of all, Z is compressed. That's why modern video cards can reach their peak pixel rates (check xbitlabs reviews).

    More importantly, 4MRTs takes 4 cycles for an ROP to output. Forget about the shader, as the limitation is at the ROPs. You could only output 4 pixels per clock in your scenario, so even ignoring Z-compression, it's 59GB/s.

    BW limitations kick in with alpha-blending and uncompressible MSAA, and should it happen with the latter, then the improvement is even less than the 2x I assumed earlier. Regarding texture BW per pixel, it's a fraction of the large framebuffer BW in DR. Most textures are compressed, and shadowmap samples are done in the shading pass.

    Like I said, that's just one of many possibilities. You can pick almost any rotation angle and any resolution, run the lighting shader on each pixel, then fit it to the screen as you would with any arbitrarily sized and oriented data in image processing. I doubt you can have high resolutions anyway when supersampling the lighting shader.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    I'm surprised you're writing off an extra 25%, say.

    http://forum.beyond3d.com/showpost.php?p=1094748&postcount=1554

    there seems to be no problem with 32-bit Z and stencil.

    You can't stencil the lighting with the Z-buffer in place? Or do you mean after you've stencilled and want to read it for lighting?

    I think Carmack, in presenting Rage, mentioned 5x overdraw.

    You didn't seem unhappy with 5x:

    http://forum.beyond3d.com/showpost.php?p=932025&postcount=18

    Unfortunately, as you add MRTs, the proportion of Z diminishes, which means the proportion of bandwidth saving you'll get from Z compression diminishes even further.

    I agree, provided that's how the ROPs treat MRT writes (which, admittedly, seems likely). Seems a shame that when outputting MRTs the ROPs are unable to "re-use" the otherwise idle "Z bandwidth". I suppose, in theory, the way the ROPs handle transfer of blocks into memory should maximise utilisation of bandwidth - so this "unused Z bandwidth" problem is prolly a red herring.

    Well, it's hard for me to quantify the bandwidth associated with the textures: how many textures for albedo; what maps are used: normal, diffuse, specular, ambient etc. In a forward renderer these texture reads tend to be spaced out a bit, whereas the G-buffer pass tends to cram them all together, which is asking for texture cache thrashing on top of whatever bandwidth they consume.

    But it may be that with G80/R600 bandwidth, per se, is no longer the issue just because they have quite a lot compared to their colour rate. But colour rate then becomes the problem. And so you still have the fact that writing Z explicitly as an MRT is going to have a noticeable additional cost.

    Jawed
     
  11. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    25% of just the G-buffer pass and only when not vertex limited. I'd bet it's under 10% of the total frame cost.

    Yeah, looks like I erred there. Didn't see the new 64-bit per sample depth-stencil format in DX10 (DXGI_FORMAT_D32_FLOAT_S8X24_UINT). I seriously doubt that using it won't incur a performance penalty.

    Not according to Xmas and SuperCow. You can't use the depth/stencil buffer for culling pixels (those not visible by a light source) and simultaneously use the same buffer for lighting. So we're talking about a 64-bit per sample read and write!

    Carmack was not talking about opaque pixels, nor was he talking about overdraw that made it through all the Hi-Z/early-Z culling. Deferred rendering is very friendly to sorting too because all pixels pretty much use the same pixel shader. As for me, I was intentionally picking a large number as a worst case to show how huge 10GPix/s is.

    No it's not a shame, because Z BW is a lot smaller than colour BW for pixels that get through culling. If colour+Z BW is 5 bytes for one MRT, it'll be 4.5 for two MRTs over two clocks, 4.25 for four MRTs over four clocks. Does that difference really warrant doubling the MRT output? Especially when MRT is unused in almost all forward renderers?

    There are a few things to notice between a DR's G-buffer pass and a comparable FR to determine BW loads:
    A) Both have equal texturing demands aside from shadowing
    B) G-buffer pass does no shadow map sampling, but FR does (32-bit, uncompressed, multiple samples, no mipmapping)
    C) G-buffer has 4-5x the write BW

    If we assume the framebuffer to texture BW ratio in a FR is 1:1 (and I personally believe texture BW is lower), then all the points above suggest that the G-buffer pass will be well over 5:1.

    For the last time, DX10.1 does not remove that for free. To get rid of one 32-bit rendertarget, you must:
    A) Use a 64-bit per sample Z/stencil-buffer to maintain accuracy (and it's still not as good as FP32 distance, but hopefully good enough)
    B) Copy this big Z-buffer before lighting
    C) Do extra math to convert Z into distance in the shading pass

    This is not free, and you're doing it all as an alternative to what I estimate is under 10% of a perf hit.
     
  12. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    It's more data so it can't be free, but I doubt it's implemented with 24 wasted bits.

    FP32 depth is usually already limited by vertex shader precision, and interpolation of Z should be more precise as it is linear in screen space.

    One rcp if you store 1/Zeye.
     
  13. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Possibly, but even if you're right, it's an additional cost for every Z-test done during the G-buffer pass. As an aside, do you think the stencil buffer is kept separate? Or do you think the depth/stencil has a stride of 5 or 6 bytes as opposed to 8? Trying to save some space this way might pose problems/costs for depth/stencil readback, as all other formats written have power-of-two strides, and a split buffer has its own costs.

    You think iterator interpolation isn't done at full precision? I know it's mathematically more complex than screenspace interpolation, but I'd still expect it to be done properly.

    I'll admit, though, that in my last post I forgot about how you can get a great distribution of values when mapping 1 to the near plane, as you've pointed out in other threads. So my minor point about precision is now debunked.
    True, but it's still something. I'm just pointing out that when you add all these things together and compare it to the cost of another rendertarget in the G-buffer, the total difference won't be particularly big.
     
  14. SuperCow

    Newcomer

    Joined:
    Sep 12, 2002
    Messages:
    106
    Likes Received:
    4
    Location:
    City of cows
    If you use stencil marking for all your deferred lights then you should get away with not having to copy the depth buffer:

    1) Mark stencil for pixels inside light volume using ZFail method (only writing to stencil)
    2) Unbind Z buffer, bind Z buffer as texture and apply lighting equation for stencil-marked pixels

    The problem is that sometimes it is not necessarily economical to use stencil for all your volumes. E.g. it can be sufficient to directly apply lighting by rendering the back faces of a light volume whenever the camera is inside the volume. In those cases you need access to both Z at the pixel shader level while still having your depth buffer bound for reads (which means a ZBuffer copy or writing Z into an MRT in the first place).
     
  15. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    I believe that even for uncompressed render targets we're looking at tiled buffers, so the stride is always at least a multiple of 32 bits (or more likely 128 bits). Z and stencil could be separate or interleaved at the pixel or tile level, in the latter case possibly even with different tile sizes. I'm not sure there's a single best implementation.

    With separate tiles you could use different compression methods which might make sense depending on the scene. Stencil typically only changes at object edges, but it can be read-modify-write thus hidden edges can cause a change in stencil. Depth slope changes at (most) visible triangle edges.

    A more complex operation will likely cause some precision loss, even if "done properly".

    Oh, I certainly agree there's no huge difference to begin with, and it's only getting smaller in the future.


    You can't unbind the Z buffer without unbinding the stencil buffer. So at this point you have lost access to stencil information.

    Demirug is right that a smart driver can handle using the buffer simultaneously without a copy, but I wouldn't rely on it.
     
  16. SuperCow

    Newcomer

    Joined:
    Sep 12, 2002
    Messages:
    106
    Likes Received:
    4
    Location:
    City of cows
    Doh! Need more coffee!
    And you can't even access the stencil buffer as texture either since the Z Buffer is already viewed as a texture for Z data (Z and stencil require different views). Not that it would be a good idea anyway since you'd lose the early stencil optimizations.
     
  17. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,314
    Likes Received:
    140
    Location:
    On the path to wisdom
    You can bind multiple views of the same resource simultaneously as inputs AFAIK. But yes, marking lit pixels with stencil only makes sense if you can use an early stencil test.
     
  18. santyhammer

    Newcomer

    Joined:
    Apr 22, 2006
    Messages:
    85
    Likes Received:
    2
    Location:
    Behind you
    1. Cube map arrays. Can be used in GI techniques.
    2. Separate Blend Modes per-MRT. Good for ... hmm...
    3. Increased Vertex Shader Inputs & Outputs. Doubled from 16 to 32. Well, more interpolation registers are always good.
    4. Gather 4 sampler. Good for GPGPU and PCF shadows.
    5. LOD instruction. Good for custom texture filtering.
    6. Multi-sample buffer. Lots of uses: custom edge filters, adaptive or custom AA, HDR resolve, etc Basically you have access to samples, not only pixels.
    7. Programmable AA Sample Patterns. To experiment with custom AA for extra image quality.
    8. Standarized minimum features: Min 4x MSA, FP32 and int16 filtering/blending, increased FP precision(0.5 ULP)

    For me the two more important features are cube map arrays(#1) and Multi sample buffer(6)... and are enough to require you a new graphics card so start blaming your new and shiny GF8800. DX10.1 is supposed to come with Vista SP1.

    About deferred shading, can be hard to do without these new things... and you need to put in a balance the blending problems + MRT bandwith and memory consumption + render volume light with stencil ops vs multipass/multiple lights iteration on the shader doing a pre-Z pass. I personally don't find it very suitable and hope with these DX10.1 features could help to improve it.
     
    #58 santyhammer, Nov 21, 2007
    Last edited by a moderator: Nov 22, 2007
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Regarding #6, remember that DX10 had multisample buffer access for everything except the depth-stencil buffer.

    How can deferred rendering take advantage of per-MRT blend modes? I'm curious about that, as I don't really see any way out of doing transparency in a forward rendering pass.

    BTW, I agree with your assessment of deferred rendering. However, there are some techniques that just can't get by without it.
     
  20. SuperCow

    Newcomer

    Joined:
    Sep 12, 2002
    Messages:
    106
    Likes Received:
    4
    Location:
    City of cows
    I was asking myself the same question. Off the top of my head it could be useful to apply decals (bullet holes, blood splatter, etc.) to the G-Buffer. Maybe you want to completely overwrite the normal but blend the diffuse with the existing value in the G-Buffer or something similar. (In theory you'd probably really want to blend the normal too but you need programmable blending for that).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...