Three new DirectX 11 demos

Discussion in 'Rendering Technology and APIs' started by Andrew Lauritzen, Jul 31, 2010.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Sigh I totally forgot about shuffle - I remember being excited about that, once :lol:

    Well both GPUs support concurrent compute kernels now (still pretty fuzzy on what that really means on ATI, though) so if it isn't in 12, well there's always LRB...

    I agree generally.

    The learning curve is still too steep even for the hardware designers (local atomics and tessellation being great examples, currently), so I think this needs repeated revisiting. (Then there's the physics of what's buildable, which I think is why tessellation is poor in ATI currently - not sure what the deal is with local atomics in NVidia.)

    Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.

    So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.

    :razz: Well my posting this morning was the summation of my thoughts as I fell asleep last night before having a proper look at your code.

    You reported about 1.4ms to compute the histogram on ATI, as I understand it (I've no idea what kind of overhead that has...). That's about 380 million cycles. For about 2 million samples? (Or is this 4xMSAA, 8 million samples?)

    So around 200 cycles (or 50 for 4xMSAA?) per sample?

    The inner loop is 31 cycles according to GPU Shader Analyzer.

    Yeah, I was assuming a "persistent" kernel.

    As I haven't done this stuff for real I don't understand the issue here. But I have a feeling I'm seeing coherent fetches where there aren't any :???:

    I made that point partly to defend against the cache thrashing that technique 3 engenders. On ATI clause switches cost significant cycles, so a single clause of 16 TEX is preferable to 16 clauses of 1 TEX - the latency of the latter and the increase in cache thrashing it directly causes would hurt 3.

    Does the entire histogram need to be generated each frame?

    As the camera translates the histogram is, in general, shifting coherently, isn't it?
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    If you are sharing active data between hundreds and thousands of cores, it doesn't matter if you are coherent or not. Which is what all the coherency bad people apparently don't get. What coherency buys you is significantly increased flexibility and dynamic scalability. The actual cost of the coherency when done correctly is fairly low in the actual hardware. In a lot of ways its a software/hardware trade-off that has been going on for decades. Every single time, coherency has won, simply because it is more flexible and thus requires orders of magnitude less software complexity as you scale.

    It means you don't have to track every bit of data and can instead optimize only for the data that has a performance impact.
     
  3. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yeah it's nice although packstore is still the niftiest instruction IMHO :)

    Right although "concurrent" doesn't necessarily mean "can be launched by the GPU" or "guaranteed parallel execution" or anything like that. It's unclear how restricted the GPU schedulers are here.

    I hear that vocal section but TBH they're mostly software people making broad claims about hardware engineering based on generalized logic at best (of course this doesn't apply to everyone, but I've met a lot of these people :)). When I ask the hardware people though they say that it scales fine and I'm inclined to trust them more...

    I'll accept some argument over requiring full coherence but I really don't see how we can get away without proper caches in the future. (As an aside, I'm also told that once you have a cache, coherence isn't very hard or expensive and scales fine too...) Unless people come up with something equivalently clever it will relegate GPUs to operate only on the fairly small subset of regular problems that they can already handle now. I think Fermi made a definite step in the right direction on this one and I'm excited to see what comes next.

    What do you mean by "local cache"? Fermi's L1$ counts IMHO, and that's all I'm really asking for. If they get it fully pipelined with atomics and the like then we're 90% there.

    No MSAA at the moment so yeah, ~2 million samples.

    The "gather" fetch in the shading languages is sort of like bilinear in that you give it a float texture coordinate and get back the four surrounding elements. Not only does this mean having to do a silly int->float in this case but it also messes with the typical parallel domain since now each work item is handling those 4 elements in its kernel. Not a huge deal and definitely possible to do but not a trivial change to the code either.

    I also don't get why they don't just rearrange the compute domain into gather4-style blocks if their hardware benefits from this... DirectCompute leaves that completely up in the air and implementation-defined. I must admit to not completely understanding the fast path here on ATI and how the gather4 stuff relates to linear memory loads.

    For complete correctness, yes. It's like occlusion culling... even a minor shift from one frame to the next could reveal a new object (or occude an old one). You can start to play games but you immediately lose the guarantee that ever screen-space sample will have a shadow map sample. How important this is depends on stuff like the frequency of geometry in the frame and the speed of movement of the camera and objects.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    SCCC doesn't have coherency, that's my point.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Yes, I think Fermi L1$ fits the bill.

    Well, that's pretty disturbing then, because on ATI histogram generation is running at ~1/6th throughput: theoretical 31 cycles for the inner loop takes ~200 cycles.

    I have to admit after seeing the projection/unprojection stuff in the code I'm lost. I think you are sampling raw depth, so should be able to sample contiguous areas from depth coherently (and fast), but...

    The mapping from compute domain to data should be totally under the programmer's control, ultimately because it's a balancing act. e.g. it can be optimal to do 16 gather4s or only 1. That decision is the programmer's. On NVidia the optimisation profile is different...

    Every fetch of a mere 32 bits in your code results in the 128-bit pipe from L1 to registers being 25% utilised (per core it's a 512-bit pipe). You've only got ~1TB/s of that bandwidth in Cypress (54GB/s per core).

    In the end ALU:TEX is so high that Z fetches shouldn't be in the picture.

    What is the worst-case with 100% collision rate?
     
  6. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    But you're assuming no collisions there right? There are *lots* of collisions! These need to be serialized across the SIMD lanes. For the 7 scatters and assuming 1 cycle/colliding SIMD lane at best...

    Yes it's pure 1:1 streaming. Is there a faster way than to just write the obvious code that maps work items to pixels and loops over chunks of the screen? All the code I've seen uses this style and I've never seen explicit "gather4"s for performance or anything.

    Right but in all the current APIs it is implementation defined. NVIDIA "sort of" defines it as row major and I believe ATI tends to use the same convention but none of the APIs guarantee it. The best you can do is just use a 1-D domain, assume the SIMD packing is linear in this domain and do the addressing math yourself.

    Right but how does this interact with the work-item packing? Can it not collect four of those fetches from adjacent work items and do a 128-bit fetch similar to NVIDIA's coalescing?

    It doesn't get much slower, which is why this implementation is good. You can test yourself by just moving the camera and walking up to a wall :) That case absolutely destroys the vertex scatter implementations but this one remains pretty fast due to the local atomics. Of course the SIMD lanes still get serialized but there is really no avoiding that.
     
    #26 Andrew Lauritzen, Aug 13, 2010
    Last edited by a moderator: Aug 14, 2010
  7. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    meh, research project. Not a product. huge difference in end users, resources, time, etc.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Yeah, that's why I asked about the worst case.

    But now that we've discussed it more, it seems the algorithm needs to be optimal for what is currently causing 100% collision rate. i.e. it would be preferable if it was collision-free, or at least to minimise the cost of the worst-case.

    In general, if bandwidth/latency-hiding is a bottleneck, then it's faster. I think the DC sample universe is a bit new as yet.

    Sampling each element only once means that bandwidth is capped by memory controllers, so the potential gain from gather4 is then mostly about defending against cache thrashing (since multiple hardware threads share a core) and, in the case of ATI, minimising systematic latency caused by clause switching (I think NVidia is mostly immune here).

    Obviously if sampling is truly stochastic then elements will end up being fetched multiple times simply due to the banking/burstiness of DDR.

    But your kernel is far away from being Z-fetch bandwidth bound so that isn't a motivation for gather4.

    Hmm, I was under the impression it's strictly defined row-major for work-items and work groups in all APIs.

    http://msdn.microsoft.com/en-us/library/ff471570(v=VS.85).aspx

    If work items address contiguous areas of samples, then yes you'll get coalesced memory access and nice cache usage.

    But bandwidth twixt TEX L1 and registers will go unused if each fetch is less than 128-bits (on ATI - who knows what it is on NVidia?)

    Bear in mind that a filtered texel's components are delivered to the pixel shader as floats, i.e. in graphics a 4-component texel is 128-bits in the register file, in general. So 128-bits of data in the register file are expanded from, say, 4 bits of compressed texel.

    So gather4 was created as it's common to have a single-component resource, but single fetches of 32-bits leaves a lot of L1->register-file bandwidth on the table.

    In CUDA the coalescing that's often discussed is for fetches that don't go through the texturing hardware. Texturing does, too, but texturing hardware is not used much in CUDA.

    In DC shaders it's unclear to me how NVidia uses the available hardware - e.g. in your shader I have no idea whether the texturing hardware is being used or whether the load/store path is being used. These two paths are distinct and parallel, each with a dedicated L1. Fermi documentation suggests that load/store is the fastest path for compute. So there's a decent chance that 32-bit fetches are actually optimal on GF100.

    In ATI you can see from GPU Shader Analyzer that it's using the texturing hardware:

    Code:
            14 TEX: ADDR(274) CNT(1) 
                 27  LD R0.x___, R2.xy0w, t2, s0
    and that 96 bits of bandwidth are lost to the three "_"s.

    The advantage of fetching say four samples per work item is that you can test if all four samples have the same bin. In that case you can use a single atomic to increment by 4 instead of four distinct atomics of 1, so this should be 4x faster for the worst case.

    Similarly you can aggregate min and max for the light if all four samples hit the same bin, for another 4x speed-up.

    Then you can ask: is it also worth testing if any 3 samples have the same bin? What about any 2?

    Now, the fun bit is what's the optimal count of samples per work item? It might be 2. Or 8. etc.

    It would be amusing if 2 was optimal. Though I suspect 2 won't help make this approach viable on NVidia, it appears you would need 8.

    ---

    On NVidia, because it seems atomics are running at "return-value rate" and since a sample's light coordinate can't be both min and max, it might be worth making min/max atomics mutually exclusive. i.e. use the return value from max to decide if the min atomic is even worth executing.

    Still, it seems unlikely this will be enough of a performance gain.

    Might be worth giving this memory, i.e. each work item remembers binID and min/max for the prior sample. Potentially have several bins memorised like this. Anything to avoid a collision...
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Larrabee is not a product either. And as for time, I'm explicitly referring to 10 years' time when we have 1000 core chips.
     
  10. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    That is only in terms of the indices that you get into your shader. There's no discussion/guarantees about how those indices get packed into a SIMD block/warp/wavefront since they still like to pretend that these machines might not be SIMD :) Implementations could pack indices [0,0]-[16,0] into a SIMD warp, [0,0]-[0,16], [0,0]-[4,4], etc. Obviously each of these changes the memory access pattern of the kernel.


    It seems like they use the load/store path for everything but real texture lookups (not entirely sure what they do with "Load", but definitely for the fetches that require no data or address conversions, etc). These seem to go through the regular path and get cached in the core L1.

    Ah I see, interesting. I guess that's probably because it's from a Z-buffer which is most likely swizzled, etc. That actually begs the question as to whether or not NVIDIA has to go through the TEX unit to get the address swizzling as well... quite possibly yes.

    The next question of course is how gather4 works (or doesn't?) with MSAAed surfaces since this code technically references a sample index as well (although it basically gets compiled out when MSAA_SAMPLES=1).

    Sure, and this is precisely my argument for lane swizzling! :) But it should be noted that it reduces some of the overall parallelism of the kernel (turns it into ILP effectively) which with these high-end ATI GPUs is becoming problematic... they're so wide and require so much latency hiding that it's hard to fill them, particularly at lower resolutions.

    Sure. With lane swizzling I do the address math across the SIMD lanes and conservatively combine ones that match. I don't test all of the possibilities but I do a pow-2 style paired reduction and only combine addresses that match. This catches the most important case of "all the same" and does pretty well in the average case too. With Gather4 I could certainly do something similar for the 4-wide vector.

    Well, it can be both min and max if it's the first/only sample in the bin :)

    Software cache style? I assume it would be best then to reorganize the threads so that each one traverses a continuous range rather than each iteration of the warp handling a block. This would sabotage memory coalescing on the load but I guess you could technically always load it in and reorganize in shared memory... if that worked out to be worth it.

    Definitely an interesting suggestion... I'll see if I can get some time to play with it!
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    With the dual constraints of local memory (shared) and workgroup synchronisation the hardware threads that constitute a workgroup share a core.

    Underlying that is a 1D execution domain and one or more 1D buffers, regardless of what's declared.

    Maybe NVidia's latest tools provide some insight?

    Well, I don't have any insights on what NVidia prefers, and I'm not much good at D3D either...

    Not sure what happens with 8xMSAA, is 4xMSAA a special case with a nice swizzle and everything else is linear?...

    Which I think is viable on the GPUs, through local memory as a communication medium.

    And what happens with 1000 cores?

    Actually, of course, 1000 cores means you've got "1000 atomics per clock" - so that would be pretty useful for this kernel. Ironically Larrabee has the highest core count of all these competing chips currently.

    So is the Larrabee version faster than 1.4ms? What's the worst-case like?

    Is the worst case uniform Z? I suspect not - I guess uniform Z is the fastest and more random Z is slower.

    Ha, well that comes out in the wash in the end, if the minimum never budges from its initial, dummy, value.

    Yeah some kind of software cache approach.

    Quite apart from the variations in GPUs out there, even given a single chip there's so many algorithm choices. It's like trying to find the single cut through a 12-dimensional cake that gets the best bits.
     
  12. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yes for sure, but my point is that you need to know that mapping of "declared" domain to "real" domain to manipulate the memory access patterns.

    Fine, but like i said it's not reasonable in the current programming model due to syncs.

    Ahmdahl's law :) Seriously though we need to put some hardware into more cleverness now... we're reaching the limit of just "more" IMHO.

    Can't really talk specifics unfortunately.

    For any scatter-based algorithm uniform is going to be the worst case... ideally that scatter/atomics taper nicely into log(n) reduction as data becomes more uniform but in practice it tends towards linear at best for a non-adaptive algorithm.

    Sure, but you'd still have to do a separate branch for each of the 3 min's... not sure that would be worth it but I guess you could try. It would also necessitate adding a return value but as you note maybe on NVIDIA you're already paying for that.

    Indeed and I have tried a variety of approaches but in my past experience software caches that actually involve any address math are far too slow. Maybe a single-element "cache" could be a win though for this data set. Again having to go so wide to fill the machine means there's a limited number of work-item iterations that can actually benefit from this and I imagine trying to do this in shared memory would be prohibitively slow compare to just the atomics.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    OK, being more explicit: the mapping is linear on the GPUs. The developer doesn't have to unwind a space-filling curve or some other abomination.

    Of course the developer might decide that it's worth implementing a space-filling curve for their algorithm's data-access. prunedtree's matrix multiplication algorithm does this, I think - effectively simulating the pixel shader access pattern in a compute shader. A newer variant of the algorithm, which uses a pixel shader as a compute shader (i.e. it doesn't use pixel outputs at all), gets the same performance without manipulating addresses as prunedtree does.

    Hmm, but your algorithm is adaptive on Larrabee, which is why it seems like it should be fastest for the uniform case, since that's a single write to memory whereas any other Z distribution incurs more atomics.
     
  14. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    That's true in practice, but that is not guaranteed by the APIs/spec! That's my point.

    Sure, but they are only "local" memory "atomics" so it's not a big deal relatively-speaking. SIMD size is relatively smaller than GPUs and scalar throughput is better as well. In practice there is only moderate speed variation depending on the data collision pattern, and in any real graphics scenes the range is between "very locally coherent -> completely locally coherent" :) You never see highly random patterns even with foliage and the like.
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Andrew (and Marco), thanks again for great research. If you have tried to port your algorithm to console hardware, I would like to hear about your experiments.

    I am planning to use SDSM in our next console project (we are already using EVSM filtering and PSSM), so I started formulating the min/max z and the cascade min/max xyz operations for hardware without compute shader synchronization features or atomics.

    The first draft (z min/max from 1280x720 depth texture --> min/max in two floats in a constant buffer):

    - One PS pixel samples a 4x4 grid (16 tex instructions), and outputs the min & max value (to a 16f-16f RT pixel)
    - Rendering to 4x4 rendertarget (16f-16f), one triangle covers the whole rendertarget (4x4 = 16 pixel triangles - enough for proper utilization)
    - In total one (4x4 pixel) rendered triangle processes 4x4 * 4x4 = 16x16 = 256 input pixels
    - Render (1280x720/16 = 80X45 =) 3600 triangles (4x4 pixels each) on top of each other with max blend in a single draw call.
    - Resolve the 4x4 render from EDRAM to a texture.
    - Read the 4x4 texture (16 samples) with one PS instance, get min/max from the 16 samples and memexport the min/maz z values to a constant buffer.

    Blender provides the atomic max operation (and negative values with max blend = atomic min). And blending to EDRAM is bandwidth free. There is no need for a vertex buffer (vertex index is enough to calculate the "full screen" triangles with same vertex positions and different texcoords each).

    Rendering to a 8x8 RT would drop the triangle count to 900, and 16x16 would drop it to 225. But both require one additional small downsample operation (read 16x16 -> output min/max to 4x4).

    Similar algorithm would likely be usable for finding the min/max xyz values for all the four cascades, but it would require four rendertargets (of 4x4 size each). And the final memexport step could be done in parallel for all the four cascades... saves a few cycles :).

    Alternatively I have been toying around with an algorithm doing recursive min/max downsampling and memexporting the result xyz maximums/minimums of each block. Too bad there are no atomic min/max memexports, it would make things much easier :)
     
  16. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    At least in CUDA, the mapping is guaranteed and explicitly set forth in the documentation.
    See pages 8 and 81 of the current CUDA programming guide:
    http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_Guide.pdf
     
  17. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Given the nature of our current jobs, we have not, but I'm really interested in hearing about your experiences doing it and happy to share any experience we do have with the algorithm!

    Awesome! Curiously, 16-bit EVSM? Impressive that that works well enough if so.

    Note as well that we toyed with the idea of undersampling the depth buffer (i.e. only look at every other value in each dimension for instance)... you lose the guarantee of having a shadow sample for every pixel but it should be rare that you hit really bad cases. Might be something to play with if performance is a concern.

    Clever - yeah definitely makes sense to use/abuse the EDRAM/ROP setup in the 360 :) The typical solution FWIW is to do an O(logn) pass reduction with ping-ponged textures where each pass renders a primative that covers ~N pixels where N is around how many SIMD lanes you need to fill the machine. Then in the pixel shader it loops over the relevant block (resolution/N) and accumulates the reduction in registers, writing the final result out. You usually only need a few passes of this given current GPU core counts. It's usually worth doing the lookups "strided" for a given pixel invocation as that makes the lookups across pixels in the SIMD batch coherent (consecutive) and avoids bank conflicts.

    That's the implementation we used in DX9/10 and it's quite fast. Yours seems even more clever on the 360 though as long as the rendering of 3600 triangles doesn't bottleneck too much on primitive setup or otherwise. I'm interested to see your results.

    Yup we do the same.

    True, although are you planning to run the SDSM partition algorithm on the GPU or CPU? If the latter reading back the last level of the reduction can actually be faster than doing it on the GPU since GPUs are actually pretty bad at the "last few levels" of these reductions (which are basically serial work).

    Yes for RTs or four passes, whichever is faster.

    Yes that's similar to the pixel shader implementation that I described above. It should work well enough.

    Very interested to see your results and I'm glad you're trying it out!
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    The fill rate deficit for using four render targets should be pretty much masked by the 16 tex instructions. So four RTs should be faster (in my algorithm draft at least).

    For 2x2 recursive downsampling, it should be likely the best choice to render the first pass to 4 render targets (first pass source data contains pixels that belong to all cascades), and the later passes to single render targets (pixel data for each cascade is already separated in the first pass to the four render targets).

    Actually the xyz min/max (= 2*3 channels) for four cascades would take 2*3*4 = 24 channels. So that would require six render targets on pre DX10 hardware (as pre DX10 hardware doesn't have integer registers and integer bit operations, so packing 2x16 bit values to a single 32 bit integer output is not possible). And DX9 didn't support more than 4 simultaneous render targets, so doing it this way requires at least two passes (3 rt:s each).

    Recursive output with memexport would allow me to write all the 24 values (xyz min/max for four cascades) in adjacent memory locations from a single shader. But I would lose the blender atomic min/max operations, so it would be more like a traditional recursive downsampling that way. Seems I have to implement both and do some profiling to see the bottlenecks.

    Actually there are some visible graining artifacts, as I currently use 16-16 integer buffers. I use separate bias and scale for both channels, as they have different value range. Did you do experiments with integer formats when you developed the algorithm? (on consoles filtering from integer buffers is generally faster)
     
    #38 sebbbi, Oct 26, 2010
    Last edited by a moderator: Oct 26, 2010
  19. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Makes sense. You may be able to early-out of certain tiles on the screen even as it's common for sub-tiles to below entirely to one partition (and fairly easy to detect). That said this falls into "optimizing for the easy case" which I typically try to avoid in game algorithms.

    Right although if you have a small enough scene you may be able to ignore the z bounds... they really do help the light bleeding and precision with any filterable representation though.

    Hmm yeah... unfortunate sequence of missing features :(

    Yeah unfortunately I don't have good intuition for the 360 memexport/EDRAM trade-offs. I'm curious about the results though if you do implement and profile one or both!

    Interesting, and you manage to make that filter correctly? Impressive :) No I haven't actually played with fixed-point math with EVSMs since they so perfectly fit the floating point representation. Indeed they have better useful precision with float formats than conventional VSMs which only effectively make use of the mantissa (although of course this works well for fixed-point formats).

    On PC as you probably know there's no support for 32-bit fixed-point/integer filtering which makes 32-bit float (which is filterable on all released DX10+ HW luckily even though not required to be) the most viable solution. I briefly looked into whether EVSMs made 16-bit float more viable and while it does improve the situation over VSMs slightly it's not really enough for large scenes. This was all before the SDSM work though so there's a possibility that the aggressive depth range clamping helps a lot there.

    Again, very interested in your experiences. I'm glad to see you taking this stuff to the next level already!
     
  20. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Optimal depth bounds are very important because of the EVSM filtering, and that's why I want to have the tight SDSM z-bounds as well. A tight near clip allows us to clamp the objects before near clip very close to the receivers (like you said, this reduces light leaking considerably), and a tight far plane in combination with tight near plane brings the best possible precision to the 16 bit fixed point.

    The 16 bit EVSM works quite well. The filter quality is very good for the closest cascades, but there's visible graining in the furthest cascades (we have around 2 kilometer view distance). The quality (with four cascade PSSM) is not yet something I would release as a final product. But with tighter SDSM bounds and some fine tuning, the 16 bits are likely enough (at least I hope so).

    I plan to use the z-buffer we render for the virtual texture page fetching to feed the SDSM system. It's 4x4 smaller than the game resolution, so the downsampling performance should be better. Currently we render the virtual texture page fetch frame by using a crude camera movement approximation (future frame). But I could modify it to render the next frame almost perfectly: perfect next frame camera matrix and near perfect object movement approximation (position & rotation added with velocity & angular velocity). This would allow me to lock the result next frame (prevent lock stalls) and use the information for SDSM (it should provide more correct result than just using last frame min/max xyz). And the downsampled depth buffer is really useful for occlusion culling as well (we plan to have similar real time occlusion culling system than they have in CryEngine 3)... Two weeks ago I refactored our tile based deferred away, but if we get near perfect tile near/far depth values (without lock stalls), it might change my plans a bit :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...