How much can raytracing be separated from shading?

Discussion in 'Rendering Technology and APIs' started by Shifty Geezer, Oct 10, 2019.

  1. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,891
    Likes Received:
    11,483
    Location:
    Under my bridge
    A rumour in the console prediction discussion has a theoretical 7TF Navi console with 800 GB/s bandwidth and an unspecified amount of raytracing hardware.

    DXR incorporates raytracing into shaders, with the surfaces being computed on shader cores and accelerating RTX hardware being invoked only to perform ray intersect tests, so RT performance is bound to compute. With such a model, 7TFs compute seems ineffective and doesn't tally with the very high bandwidth.

    In the case this console is Sony's PS5, it would not being bound to DXR and Sony are free to pursue any other avenues. Ergo, what are the options for RT solutions given such a bandwidth-heavy solution? Could you trace geometry in bespoke hardware and populate a geometry/surface buffer with surface IDs, perhaps, and then shade? The traditional raytracing algorithm is very simple, tracing surfaces as it goes, iteration by iteration, but given the success of deferred rendering, is there an opportunity for some form of deferred raytracing, or something else that handles geometry and shading differently, wherein a 7TF console would only be needing those shaders for surface shading?
     
    mahtel, TheAlSpark and JoeJ like this.
  2. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    513
    Likes Received:
    614
    Sounds reasonable to be. Seperating tracing and shading, and doing both seperately is what i would had expected for realtime HW.
    So, could look like this:

    Trace for one intersection, append all hitpoints to a huge list.
    Bin / sort hitpoints by material / textures to get cache efficient shading like we have with rasterization.
    Determinate the next ray direction of the path (which depends on material if we want proper MIS).
    Optional: Reorder rays so they are spatially coherent, and we get cache efficient tracing.
    Continue with step 1 for the next ray of the path (submitting a large buffer of rays to RT cores).

    This is how powerful HW raytracing might / should look like, and initally i assumed RTX would do all this already.
    Likely, tracing and shading can run in parallel, shading on compue cores while RT cores trace independently. Both with haevy demands on bandwidth, so doubling this would make sense.

    Downside: Reduced general purpose compute power because necessary RT die area would be large this time?
    Solution: Remove rasterization hardware. At this point no more need for hybrid. (IMO).

    And yes, i know we talk about rumors here, just for fun :)
     
    milk and Shifty Geezer like this.
  3. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    701
    Likes Received:
    587
    Location:
    55°38′33″ N, 37°28′37″ E
    I suspect there is little point in deferred raytracing right now. Tracing intersections will be heavily memory bandwidth bound, not shader computation bound.

    With raytracing, we start from a pixel in screen coordinates and shoot rays into our scene at different angles - when each ray would need to travel a significant part of the BVH tree, especially when it has to be rebuilt for each frame, memory access will be very incoherent.

    For each source pixel, there could be multiple hits along each ray, which would invoke anyhit shaders for all transparent surfaces and the closest-hit shader for the final non-transparent surface, each involving texture access and/or additional rays spawned towards light sources.
    That could take enormous amounts of memory bandwidth, depending on the properties of BVH partitioning algorithm.


    I'd guess initial hardware implementation will still be based on the current DXR paradigm, where raytracing constitutes a different processing pipeline as opposed to the 'traditional' rasterization pipeline (or the new task/mesh rasterization pipeline).
    These pipelines will be scheduled to run in parallel, with each one having its own command queues.

    This way, the 'traditional' rasterization pipeline, and especially the meshlet-based rasterizer, would be compute-shader intensive to allow highly parallel processing of complex geometry, while the raytracing pipeline would be memory access intensive - and to maintain reasonable performance tradeoff, it would operate with simplified scene geometry and reduced texture resolution (and maybe reduced front-buffer resolution as well).
     
    #3 DmitryKo, Oct 10, 2019
    Last edited: Oct 11, 2019
  4. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    513
    Likes Received:
    614
    Instead using any hit shaders and recursive control flow, my proposal could only do closest hits, requiring multiple passes to deal with transparency.
    But ignoring transparencey and using hacks so far worked well for games, and even with any hit shaders we will continue with those hacks in any case.

    Agree, but they could aim for a leap and keep quiet about it. Technically it might be possible, see the more advanced ImgTech hardware which has reordering, and maybe material sorting too (can't remember). Both those tasks could utilize the same FF HW.

    But it's not that i believe this to happen so quickly. We could do similar speculations for Ampere if we want.
    Reordering is only worth it for GI, but not for shadow rays or sharp refelctions.
    Material sorting might only be worth it for sharp reflections.
    Photorealisic Path Tracing seems still out of reach and also a very inefficient way to get there.

    So, this really seems far fetched speculations. But it's the only explantation for the rumored 7TF at 800GB/s we came up with. (...would be the opposite impression of what we saw from AMDs TMU patent. They could have licensed ImgTech... who kows.)
     
  5. PSman1700

    Regular Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    486
    Likes Received:
    113
    But that's talking 7TF with RT, what about double that in 2020 gpu's, with possible or most likely improved RT hardware?
     
  6. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    701
    Likes Received:
    587
    Location:
    55°38′33″ N, 37°28′37″ E
    The increased bandwidth could be required for the ray intersection engine to operate. No matter which block performs the actual intersection processing and flow control, the BVH nodes still need to be walked through one by one, which involves traversing the tree multiple times in a very irregular memory access pattern.

    https://devblogs.nvidia.com/thinking-parallel-part-iii-tree-construction-gpu/
    https://devblogs.nvidia.com/thinking-parallel-part-ii-tree-traversal-gpu/


    The shader units in the AMD raytracing patent are mostly used for control flow, while the actual BVH traversal and processing is performed by the fixed function ray intersection engine, which has its own dedicated ALUs for ray intersection testing, sharing the texture data paths with the TMU to fetch BVH nodes.

     
    #6 DmitryKo, Oct 10, 2019
    Last edited: Oct 11, 2019
    iroboto, BRiT and PSman1700 like this.
  7. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    513
    Likes Received:
    614
    Sure, but why AMD needs twice the bandwidth than NV? Because they are bad ingenieurs? No no - if those numbers are true, secret sauce is much more likely.
    Maybe the patent (which also mentions optional FF for control flow), together with "selective lighting effects" is meant just for distraction, and they mean RT when they say "Nvidia Killer" :D

    Although offtopic, i forgot to mention HW binning would have many other applications:
    Acceleration structure building for geometry, photons, rays, SPH fluid, maybe stuff like SAT fo rdenoising, etc.
     
  8. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    701
    Likes Received:
    587
    Location:
    55°38′33″ N, 37°28′37″ E
    For doing what, exactly? We don't have performance numbers to draw any conclusion.
    There are many ways to make efficient use of this bandwidth - I'd hope they offer better raytracing performance than the GeForce RTX, as well as new features in the RT tier 1.1.
     
    #8 DmitryKo, Oct 11, 2019
    Last edited: Oct 11, 2019
    iroboto and JoeJ like this.
  9. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    513
    Likes Received:
    614
    I found me an explantation that makes sense: Assuming the patent and the rumor applies, AMD could eventually double TMUs, so while tracing the CU can switch to a shading wavefront and access textures using the second TMU.
    Would explain the demand on BW and even if CUs control traversal outer loop, they should be available for shading most of the time.
    So no big surprise other than higher RT performance than expected. (I did expect 2060 RT perf. but not more.)

    Yay, some LOD! :D
     
    Dictator and iroboto like this.
  10. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    701
    Likes Received:
    587
    Location:
    55°38′33″ N, 37°28′37″ E
    They don't need to increase the TMU count for that. TMUs are dedicated fixed-function ALUs that perform texture loads and trilinear/anisotropic filltering.

    If they are adding REs (Raytracing Engines), that is fixed-function ALUs for BVH loads and ray intersection tests, they would need to increase the width of the memory bus and the dedicated caches, so the bandwidth could be effectively shared between TMUs and REs working in parallel.

    But the number of REs is not necessarily bound to the number of TMUs one-to-one - in fact, I would expect them to add quite a significant amount of REs, so the scheduler could optimize for rays coming in approximately the same direction, and BVH nodes could be shared and re-used by different rays to improve cache locality.
     
    #10 DmitryKo, Oct 11, 2019
    Last edited: Oct 11, 2019
  11. milk

    Veteran Regular

    Joined:
    Jun 6, 2012
    Messages:
    3,017
    Likes Received:
    2,589
    dude, the patent...
     
  12. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,891
    Likes Received:
    11,483
    Location:
    Under my bridge
    I've just remembered Sony were rumoured/had patented photon-mapping tech. That's a tech I know little about. How does photon-mapping shift the balance of processing versus memory access? Well, like anything you can use different approached to favour one or other resource, but is there scope with photon-mapping to get proportionally better acceleration through memory consumption rather than compute, especially if you focus more on lighting than reflections without needing per-surface calculations for secondary rays?
     
  13. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    513
    Likes Received:
    614
    Makes more sense, but then what is the connection to TMUs at all, other than memory bus? (This question bothers me since reading the patent - i'm neither good with patents nor hardware :) )

    Sheduler being able to sort rays would be neat. I wonder if RTX does this already.
     
    PSman1700 likes this.
  14. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    513
    Likes Received:
    614
    Don't know much either, but photons need some spatial structure to access them, and unlike RT BVH the data changes completely every frame so they can not refit bottom levels but may need a full rebuild each frame. Likely very BW intense.
    How much this could save on compute may be a question of additionl FF HW for photon gathering and those rebuilds.
     
  15. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,891
    Likes Received:
    11,483
    Location:
    Under my bridge
  16. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    701
    Likes Received:
    587
    Location:
    55°38′33″ N, 37°28′37″ E
    The patent does not claim any particular issue ratio between texture filtering and ray intersection units in the texture processor.

    The essential claims of the patent are that the texture adressing unit and texture caches are reused for loading BVH nodes and communicating with the shader unit wavefronts.

    There could be various implementations of the ray intersection engine: 1) the TMU is designed as one single block, so that texture fintering unit and the ray intersection are switched functions tied to the single texture addressing block, or 2) texture filteing and ray intersection would be running in parallel, with their texture adressing blocks being a kind of scheduled, out-of-order wavefront processor with concurrent access to the memory controller.
     
    TheAlSpark and JoeJ like this.
  17. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,792
    Likes Received:
    5,879
    Location:
    ಠ_ಠ
    Where the texture is, there must be a polygon? or something

    ...along those...

    ...lines?

    shifty.gif
    unamused.png
     
  18. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,891
    Likes Received:
    11,483
    Location:
    Under my bridge
    Actually, no. There needs to be a surface or rather, instructions to shade a pixel (or sample) a certain way. You could use a texture to colour a 3D fluid-simulated volume without any triangles. You could use a texture to shade pixels based on ray depth, or anything. All that's needed, somehow, is to create an arrangement of colouring instructions in relation to the scene geometry. So, randomly, trace 64x64 pixel tiles, populate each pixel with surface data, then send that tile off to the shaders to shade while the next tile is worked on. The problem then, I think, becomes one of batching shaders to work on coherent pixels.
     
    TheAlSpark likes this.
  19. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    513
    Likes Received:
    614
    Just compared Vega64 vs. Radeon VII, on techpowerup. The latter has 122% relative performance. TF about 13 with a difference of one, and VII bandwidth a bit more than 2x. Just to proof there is likely not much benefit from 800GB/s for rasteriszation and shading in games.

    I would rule out fancy stuff like reordering because it's not worth it yet. Same applies to material sorting. (If shedulers can do some in place sorting it will take even longer until such things become a win in practice)
    I would also rule out photon mapping. I think this is nice for offline, but for realtime i see much better options like texture space shading or other surface caching methods (although with some hurdles on content creation and engine complexity, stuff the movie industry would not want).
    And i don't believe in 3rd party extra hardware or big differences between Sony / MS.

    So in the end i'm back to the start, assuming very similar tech to RTX / DXR, but perf closer to the highest end NV models not the lower end ones. To get there the 800GB/s might be just more effective than more TF / larger GPU.
    I would not wonder about HW BVH build, but anything else sounds too far fetched.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...