RDNA 2 Ray Tracing

Discussion in 'Architecture and Products' started by Jawed, Dec 8, 2020.

  1. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    Now knowing even RT cores are programmable, i see my previous view of 'flexible AMD' vs. 'fast but limited NV' is pointless - sorry for that.
    Though, the problem remains quite the same:
    It's difficult for both NV and AMD to expose flexibility to devs, because neither can guarantee such things will still work on future HW.
    Proposal: Experimental extensions for research purposes. Can be used for games too, but then future HW requires to patch the games.
    Currently i guess only the big fishes like Epic collaborate with HW vendors, discussing options, doing some tests. Results will lead to new standards and better APIs.
    That's fine, but because the LOD problem has no standard solution, making such public extensions would make sense to spur more research and explore more options.

    However, the situation seems much better than i have thought all the time. Chances we get traversal shaders with Turing support are pretty good i guess.
     
  2. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    Not sure what you disagree about, but i meant the TMU patent gave two options: Either implement traversal loop in FF, or leave it to the shader cores. They did not say what they might use, but it was clear their intersectors operate on 4 boxes back then already.

    With BVH there is no strict need to query all children simultaneously, nor do we have to process them in closest first order. Stackless traversal algorithms usually do neither. (Just to mention - the norm seems ordered traversal using some short stack.)
    A research result of 3 being the optimal branching factor can't be general. It depends on the algorithm, quality of BVH, scene, etc. The only general argument i could come up with is this geometrically one: 'If we want to subdivide space so children have half the size of parents, in 3D space we get an ideal number of 8 children, like an octree.', which is a very weak argument as well.

    Agree about BVH2 being no good idea. But i think we could implement octrees or sparse grids by calling the intersecton multiple times and refine order manually if necessary. That's not really what i'm personally interested in, but hey - i could raytrace my fluid simulator which uses sparse grid.
    What i am interested in is to reuse my GI BVH for HW RT. I use 8 children at max, but at the surface it looks like a quadtree of surfels and has mostly 4 children. I guess it would just work.

    What i mean is simply... writing some code:

    struct Node
    {
    AABox bbox;
    int childIndex;
    int childCount;

    int parentIndex?
    int adjacencyIndices?
    }

    The intersection instructions take only pointers to bounding boxes, but they do not make assumptions about anything else. So it's completely up to us what acceleration structure we use, and how we traverse it.
    All my talk here is about the idea to not use DXR at all, but implement data structure and traversal in compute. This makes sense because AMD lacks traversal fixed hardware. So we should get the same or better performance than their generic DXR implemenation, with extra features like LOD.

    Did not pay attention to triangles, but only one triangle per node sounds like BVH will take a lot of memory. I assume they may iterate over a list of triangles serially, but no idea.


    Parent - child dependencies on refit can be avoided by enlarging boxes so they still bound children under animation. That's useful at least for the lowest levels - we only need to transform the boxes with the geometry.
    To further reduce dependencies, we can at least process all levels x of all BLAS we have, then sync and continue with level x+1. So, ideally we get one dispatch per tree level.
    Alternatively, NV often does research on doing it all in one go at the cost of doing more work like searching and sorting. I'm not up to date with this. Assuming modern NV is good with async compute the number of dispatches should be no more big problem if other work keeps the GPU saturated.
    What i personally do as well is to build multiple top levels in a single wide workgroup, which moves the barriers from API level into the compute program. That's quite a win for me.

    How deep the tree has to be depends mostly on branching factor. Binary tree needs too many levels.
    With my BVH8 i need about 20 for open world, but my surfels are 10cm so larger than game triangles.
    For fluid sim i use sparse grid (to me that's just octree but 4^3 instead 2^3 children), and IIRC i use only two internal levels for grid sizes of 1024^3. The fact how aggressively a larger branching factor can reduce tree levels is very interesting and not so obvious at first.
     
    BRiT likes this.
  3. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    881
    Likes Received:
    1,022
    Location:
    55°38′33″ N, 37°28′37″ E
    No, it's the other way round. In the raytracing pipeline, generated rays are submitted to the traversal units, which walk each ray through the BVH tree and start specific raytracing shaders (intersection, any-hit, closest hit) when they find ray-triangle instersection(s) or a ray miss.
     
    PSman1700 likes this.
  4. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    Agreed. No instructions necessary. But how about inline tracing? If there were traversal units, there would be at least one traversal instruction for that?
     
  5. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,795
    Likes Received:
    1,978
    Location:
    Earth
    I believe you could also have multiple BVH structures and decide when casting a ray that which BVH to use.
     
  6. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    881
    Likes Received:
    1,022
    Location:
    55°38′33″ N, 37°28′37″ E
    Inline raytracing makes no difference. The whole raytracing pipeline is data-driven - you shoot rays from screen space using ray generation shader, and its output implicitly flows into traversal/intersection units (no matter fixed function or shader-based), which walk the BVH tree to find ray-triangle intersections and spawn 'hit' or 'miss' shaders, which modify each ray's attributes to define the final color of your pixel.

    Inlining combines these separate stages into one 'uber-shader', but it's just an abstraction to simplify programming of low-res effects that do not require full performance. You don't really need flow instructions which interrrupt your ray shader program and start BVH traversal - scheduling multiple work units and controlling fixed-function processing is graphics driver's job (which may be aided by hardware scheduler to reduce driver overhead).
     
    #26 DmitryKo, Dec 15, 2020
    Last edited: Dec 15, 2020
  7. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    57
    Likes Received:
    100
    Hmm, isn't Specialized TraceRayInline control flow pretty simplified?
     
  8. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    I'm still not convinced. It may look so from the API side, but if we use inline tracing from compute, there should be some related instruction.
    I can imagine it would also work to put ray parameters to some memory / registers, then call a generic sync instruction to notify work is waiting. But then i would expect at least some related RT tokens to be mentioned for that certain instruction.

    Which would give us stochastic LOD. Overcoming DXR instancing limitations would be another application.
     
  9. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,795
    Likes Received:
    1,978
    Location:
    Earth
    Shouldn't this work pretty well
    https://devblogs.microsoft.com/directx/dxr-1-1/
     
    PSman1700 likes this.
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    921
    Likes Received:
    356
    The instruction(s) take a node-pointer, and some parameters. While you still potentially can provide a node in situo, you would have to export it to memory, as these are pure memory instructions.
    It's a similar case to trying to think of providing a texture content in the shader: It seems possible, but you have to spill, as the fetch/sample instruction is a pure memory instruction.
    The bit-pattern of the node/leaf itself is not specified. The instruction, though, returns actual new node-pointers, which only need to be passed to the next iteration (in any distinct order you like), except if you hit a leaf, then it's whatever is configured in the T# uop field.

    Edit: BTW it seems easy enough to peek into the DXR buffers and get an understand what precisely is up ... it's a public buffer, so suit yourself. :)
     
    #30 Ethatron, Dec 15, 2020
    Last edited: Dec 15, 2020
    JoeJ and PSman1700 like this.
  11. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,795
    Likes Received:
    1,978
    Location:
    Earth
    the node/leaf patterns easily could have hw dependent format and even compression :)
     
  12. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    My goal would be to generate the 1D-BVH textures in video memory myself, not to create temporary nodes during traversal, so this would be fine.
    But it reminds me i often think it would be nice to have texture filter support for some pixels in LDS memory.

    I'll do this. ...If i ever get to that bottom of todo list where RT support is :D
     
    ethernity and Ethatron like this.
  13. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    921
    Likes Received:
    356
    It certainly is hardware dependent. So far, I think, every one new generation of AMD changed the texture descriptor bit-field.
    For sure it's possible to hack the coding and patch it together yourself, but's extremely limited use even just for fun.

    When I worked out the advantages of an embedded system (as the console is) vs. the PC in the other thread, this is it. You'll never be able to be forward compatible (as PC games/drivers need to be) if you tailor it for some specific architecture revision. And the API by definition wants to guarantee absolute compatibility, vertically and laterally. Which means the only thing you can do is the stuff defined in the API, and nothing else. Normally programming means you can do whatever you want, can produce infinite permutations of your idea. In this case you have precisely 2 broadly defined approaches (pipeline and inline), and you can't deviate at all. It happens often enough that a single line of change within a specific setup can have major performance implications (I'm not saying this is the case here generally, or RT specifically, but regularly in tight spot optimization), but this possibility simply doesn't exist with DXR. Obviously Microsoft or Khronos could offer more specific and neat instructions, identifying smarts ways of breaking our of the box, but at the cost of generality (forward compatibility means it can't be taken out later, or changed, see CPU SIMD instructions), and if at all it's an extreme long extension process. Take as an example the lane-instructions in the Direct3D Shader Models. It's already taking years to get the most simple of common low-level features into the spec, the drivers and ultimately the public. On the consoles you can be a RT programmer, on the PC with RT as it is, you're a manager, not a programmer. At least it feels like that. :)
     
    DavidGraham, manux and milk like this.
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    "Simultaneously" is strictly speaking incorrect, agreed. But all children need to be evaluated.

    In fixed function hardware, given that the number of children is small (2 or 4 or even 8) and that the evaluation of each sibling is independent of the others, it makes sense to evaluate them all simultaneously.

    I think we can come up with a minimal node definition in AMD's BVH4:

    Code:
    struct Node {
    AABox bbox;
    nodeIndex[] childIndices;
    }
    Until we see the generated ISA for a traversal shader, we can't know how a path/stack is used though (e.g. is there a "nodeIndex parentIndex" in the struct?).

    We can assume that bbox can be set to all zeroes to indicate a degenerate node (one that can't be queried) for scenarios where there's less than four children, or some similar coding, since a hardware implementation would have zero time overhead while avoiding the storage overhead per child. Though AABox might implement some kind of compression scheme (assuming 6 floats for diagonally opposed cuboid corners) in which case the coding could be more obscure.

    Anyway, what I'm trying to do here is separate the minimal conceptual model the hardware could use (because we do not have access to the data structure of the BVH nodes) from algorithmic layers that interact with hardware-accelerated BVH4 queries.

    Presumably if you build a queue of top levels, you can then feed them into running workgroups as some rays expire, a sort of "hybrid persistent compute" traversal model.

    To me the interesting idea here is to opportunistically generate rays deep in the BVH, benefitting from cache locality and taking decisions about the degree of ray coherence or what happens in a ray's diffuse bounce.

    This picture:

    upload_2021-1-2_16-30-36.png

    (X-axis is the branch factor B, and Y-axis is the number of node tests. Note that although this particular graph assumes a fixed N, the patterns hold for all values of N.)

    from: https://psychopath.io/post/2017_08_03_bvh4_without_simd
     
    Lightman, JoeJ and BRiT like this.
  15. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    We could likely figure that out. But even if it shows up those child pointers are there (which is likely), it does not tell us if they are used only from software (AMD's DXR implementation, and outer traversal loop),
    or if there is a hardware mechanism which uses them too (traversal outer loop processed by some fixed function block).

    After having learned about potential micro code (making NV RT cores partially programmable from their side, eventually), i realize my view on compute vs. fixed function might be naive.
    But my assumption would be like this: AMD did some work with Radeon Rays, and algorithms were taken from there to implement DXR. Those programs are not much different from the compute shaders we can use within our gfx APIs.
    If that's true, it becomes very likely we could beat their performance with a customized solution in many cases. Unlimited flexibility would not be the only advantage.

    You mean to start rays at the surface from a known BVH node, so the traversal from the root down just to find the start can be avoided?
    Yeah, that's a nice idea for such custom optimizations!
    I remember i thought about this some time back, probably when working on a path tracer for reference renders, but did not try it out. For paths this could be a big win, because we can reuse the nodes at the hitpoint.
    But i see a big problem for the game application. No paths, and if we start rays from pixels we don't know any starting node yet. We could use a texture to fetch it from there, but generating and storing this, dealing with LOD, etc., not easy. But would be more practical if we would use primary rays too, no more rasterization.
     
  16. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    57
    Likes Received:
    100
    Lightman and JoeJ like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    I've realised that you can shift the storage of bounding boxes upwards by one level in the tree:

    Code:
    Node
    {
    BBox[] childBoxes;
    NodeIndex[] childIndices;
    }
    and this could present an opportunity to increase the compression ratio for bounding boxes. With 24 floats there could be shared exponents and some rule-breaking with respect to IEEE 754 (a variant of denormal formatting). Or run length encoding. Or both combined. In the end you need a compression algorithm that results in conservatively erroneous bounds (i.e. the errors produce larger boxes, not smaller). This is similar to how texture formats rely upon fairly complex bitwise operations, which of course are trivial for hardware to decompress. The key is to make a node a fixed size and smaller than the raw data.

    So each node is "larger" but now there would be one less level to reach an arbitrary leaf. And since all four children need to be tested, it makes sense to perform only a single memory access to then run four tests, instead of four incoherent memory accesses before running four tests.

    We know that the test of a ray against four AABBs is in hardware: there's a dedicated instruction for that. We know that there is no fixed function traversal in hardware.

    I get the sense that algorithms are evolving rapidly...

    I'm thinking about how in global illumination you have per triangle locality when generating bounce rays. So you can start with extremely sparse rays to find triangles that are candidates for GI value and then spawn bounce rays at each triangle.
     
    Lightman, BRiT, Krteq and 1 other person like this.
  18. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    Makes sense.
    Though it's also possible they store children sequentially in memory and need only firstChild with two bits for childCount.

    I just thought about something else: They store all BVH in texture, so we could use virtual texturing feature to stream BLAS very easily. And with their support for large BVH there is no need to resolve indirection about pointers.
    I really think consoles go that route. Still requesting for PC too...

    Sounds a bit like photon mapping? However, the locality you aim for is a must have IMO, because modern hardware really wants it. Things like QuakeRTX or Metro miss to utilize it, and path tracing does so in general. Big reason why this classical RT with DXR is so fishy for games.
     
    Krteq likes this.
  19. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    4,547
    Likes Received:
    2,084
    They must be fooling us, Quake ray traced performance isnt what it could have been, people stuck with bad performance.
    People should sue them whos behind dxr and ray tracing.
     
    Lightman likes this.
  20. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    You failed on realizing my sarcasm in the other post, but i don't fail to detect yours here ;)

    Seriously, there's nothing wrong with brainstorming some optimization ideas about new HW.
    But if you think DXR is just fine as is, then enlighten us about how you think UE5 could support it so easily, or tell me how i should use it on geometry that constantly changes due to LOD...
     
    PSman1700 and milk like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...