AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

In general, RDNA2 appears to use Level 2 acceleration for ray tracing, while Turing and Ampere are Level 3.
Level 4 and 5 take more die area and require even more highly specialized units, this is an area where NVIDIA might venture forward in their upcoming archetictures.

https://gfxspeak.com/2020/09/28/the-levels-tracing/
I think both AMD and nVidia's implementation are Level 3. The difference is in the traversal, not the BVH.
Nope, you should have opened the link too.
  • Level 0: Legacy solutions
  • Level 1: Software on traditional GPUs
  • Level 2: Ray/box and ray/tri-testers in hardware
  • Level 3: Bounding Volume Hierarchy (BVH) processing in hardware
  • Level 4: BVH processing and coherency sorting in hardware
  • Level 5: Coherent BVH processing with Scene Hierarchy Generation (SHG) in hardware
RDNA2 is Level 2, but not sure if NVIDIA counts as Level 3 either. At least reading through that it's talking about more than just traversal of the BVH tree (which is what NVIDIA has over AMD)
 
Primitive shaders have a hardware assisted mode that uses the input assembler and a fast launch mode that looks like compute. Mesh Shaders use the later mode and it's existed since Vega though a few tweaks were need to support the Mesh Shader API.

AMD abandoned the IA a long time ago; for the better. They experimented a lot over the years and across products, regarding the primitive pipeline. It might not be so visible on the public surface.
I would say this is a fairly good proto-concept for mesh shaders: https://patents.google.com/patent/US20140362081A1/en
Basically in that iteration there's a large degree of freedom possible in the span from "fetch shader" to "compute vertex front" to "vertex shader", where the purpose of each conceptual stage is already overlapping. I tend to think of "Primitive Shaders" as a cleanup of the inevitable mess from the experimentation, a consolidation into a cleaner more unified hardware concept. Amplification isn't embedded into these as a first class citizen yet, but if you got no IA, you can do whatever you want through the draw parameters and just treat the whole thing as a procedural generation problem. You can see Amplification being realizable in terms of instancing. It's very interesting how this loops back to the DX9 tesselation add-on, where the amplification basically happened [conceptually] inside the IA, and the vertex shader got fed with barycentrics. Amazing flexibility. When looking at the AMD ISA, I feel the hardware can implement a large amount of different abstract rasterization pipeline models, without much of a problem.

I think this whole history and evolution of the primitive front-end would be a very nice article for Beyond3D. :love:
 
Nvidia is Level 3, as it has hardware acceleration for the BVH traversal process.
Does running it on MIMD cores make it "hardware accelerated" over SIMD cores? Because that's literally the difference, NVIDIA has MIMD processor in the RT core for traversal while AMD runs it on SIMD cores
 
Does running it on MIMD cores make it "hardware accelerated" over SIMD cores? Because that's literally the difference, NVIDIA has MIMD processor in the RT core for traversal while AMD runs it on SIMD cores
“MIMD” is a vague implementation detail of Nvidia’s hardware BVH processor (“RT cores”) by the Level 3’s definition.

It is debatable though by the loose level definitions. Say if the BVH processor is a microprocessor core (as implied with “MIMD”) with special data paths (like many GPU subsystems), you are free to argue it not being Level 3, since it is controlled by software/microcode.

Likewise, RDNA 2 accelerates not only the intersection, but also the memory access with its vector gather memory pipeline. So even if it runs the traversal loop in CU, one can’t say truly that it is “just Level 2”, as if there is no BVH traversal/walking acceleration by hardware.
 
Last edited:
This should bring some more light onto this:
image-56-1024x686.png
 
Does running it on MIMD cores make it "hardware accelerated" over SIMD cores?
It does because these are specialized MIMD cores with specialized ISA and likely formats, which offload the main SIMD cores and make traversal as fast as posible (it would be weird to select the number of the cores if they can't saturate intersection units).
On the other hand, there are general SIMD cores with general formats and precisions, these can be OK for coherent rays and can be bad at uncoherent rays for millions reasons - divergence, memory boundness, etc.
General formats and precisions requirements might cause BVH bloating and increased memory traffic since one of the reasons why specialized HW is so efficient is because it uses minimal precsion for the task and specialized compact formats.

I guess a lot can be debated on the Level 4 and Level 5 though.
Coherency sorting doesn't not seem to be a solved problem (and that's not really a problem for MIMD cores), which can be well generalized in HW and be ok for most of the cases.
Imagination's point on coherency sorting for better memory accesses is arguable too, that stuff can be handled by better memory requests coalescing, better caches logic, larger caches, etc.
Making BVH building completely in HW doesn't make a lot of sense if you can do the same efficiently on SIMDs and hide the processing time in async queues (I don't see modern games with millions of triangles suffering from this).
And the main critique for the article - there are no evidences that additional levels would bring any performance improvements, there is an evidence (real performance numbers) that current Lvl 3 works much better than Lvl 2 though.
Making stuff complex (sorting in HW) doesn't always work. Ironically, imagination's retirement from the desktop PCs was the best proof of the statment.
 
AMD abandoned the IA a long time ago; for the better. They experimented a lot over the years and across products, regarding the primitive pipeline. It might not be so visible on the public surface.
I would say this is a fairly good proto-concept for mesh shaders: https://patents.google.com/patent/US20140362081A1/en
Basically in that iteration there's a large degree of freedom possible in the span from "fetch shader" to "compute vertex front" to "vertex shader", where the purpose of each conceptual stage is already overlapping. I tend to think of "Primitive Shaders" as a cleanup of the inevitable mess from the experimentation, a consolidation into a cleaner more unified hardware concept. Amplification isn't embedded into these as a first class citizen yet, but if you got no IA, you can do whatever you want through the draw parameters and just treat the whole thing as a procedural generation problem. You can see Amplification being realizable in terms of instancing. It's very interesting how this loops back to the DX9 tesselation add-on, where the amplification basically happened [conceptually] inside the IA, and the vertex shader got fed with barycentrics. Amazing flexibility. When looking at the AMD ISA, I feel the hardware can implement a large amount of different abstract rasterization pipeline models, without much of a problem.

I think this whole history and evolution of the primitive front-end would be a very nice article for Beyond3D. :love:
It depends on what you consider to be the IA. I was referring to hardware that reads the index buffer, forms primitives, and performs vertex reuse. That patent refers to a concept with a feature name called Dispatch Draw. It predated Primitive Shaders and has similarities though it's implemented very differently.
 
Why do you think one triangle per leaf node is mistaken? I can think of some disadvantages but nothing particularly huge as far as I can tell.
 
Exponential ncrease in BVH memory footprint.


why exponential? Pretty much the majority of the BVh is just going to be the raw triangle data which would be a lower bound anyway (40 bytes for 9 floats + the triangle id). For a packing with N triangles (and assuming each box node has at least 2 children) you need N triangle nodes + N/2 box nodes + N/2/2 box nodes etc coming to N triangle nodes + N-1 box nodes.
 
Why was this brought up again? There's a dedicated thread for the HU "anti-RT" arguments. Nothing has changed and neither has all your opinions. So why complain about your beloved Nvidia again when it's going to go nowhere?
 
Back
Top