AMD made a mistake letting NVIDIA drive Ray Tracing APIs

MfA

Legend
AMD's console dominance gave them the power to drive APIs even without first move advantage ... and they should have grasped it, both for the good of themselves and for the good of game developers.

The way AMD implemented Ray Tracing through "just" an intersection instruction might be harder to scale up to NVIDIA's performance, but if properly exposed it can do what RTX can't ... be a natural part of compute based rendering. With Nanite hitting, the importance of that is huge IMO. The right way would have been to add a Vulkan shader/compute extensions for their intersection instructions, maybe also for ray differentials since NVIDIA didn't (implemented with microcode, but leaving the door open for hardware implementation). The existing API is fine for lazy devs, but it's a bad fit for developers who want to explore non standard acceleration structures and geometry representations.

The way NVIDIA is driving ray tracing is boring and proprietary, it will eventually get where it needs to go, but at a drip feed pace ... Micro-Mesh is a good example of everything wrong with it, why the hell did that need an entirely new hardware generation? Give Brian Karis&Co the tools they need.

AMD let themselves be maneuvered into a second CUDA trap, but this time they had the power to forge their own way ... between consoles giving them market power and Nanite showing the market need.
 
Nothing is stopping AMD from exposing a compute based RT api and proving its viability. Do the consoles do RT using compute or dedicated traversal / intersection transistors?

A shader extension for Micro Mesh would make no sense as you can already write a hit shader to do the same thing. In fact you “have” to do it in compute on RDNA 2/3/Turing/Ampere.

Intrepid developers are also free to write pure compute RT implementations that bypass the hardware completely and they’re free to use whatever data structures they like. Just like Epic wrote a compute based rasterizer for Nanite you can write a compute based raytracer for Nanite. It’ll just be slow as balls.
 
Last edited:
Intrepid developers are also free to write pure compute RT implementations that bypass the hardware completely and they’re free to use whatever data structures they like. Just like Epic wrote a compute based rasterizer for Nanite you can write a compute based raytracer for Nanite. It’ll just be slow as balls.
Not only slower but lower quality, too. Lumen has a nice disco effect indoors which is worse than Metro Exodus EE RTGI implementation. Reflections are bad as well. The software solution in Fortnite is just behind any implementation in AAA games.
 
The BVH/tri intersection instruction isn't exposed to developers, except on Linux if you want to get down to the driver.

RDNA 2 already had the BVH/tri intersection instruction. That nothing stops AMD from exposing it leads you back to the title of the thread.
 
Last edited:
The existing API is fine for lazy devs, but it's a bad fit for developers who want to explore non standard acceleration structures and geometry representations.
Please, we've had enough of developers claiming to be code masters, only for their ports to come out with sub par performance and compatibility, eventually those very same developers abandon their projects and do the most half assed job possible.

Just look at the mess DX12 created in the PC space for almost 10 years, stutters, constant long shader compilation, and bad memory management plague most DX12 PC ports, and developers are directly to blame, they claimed they can do better, but they couldn't deliver. So execuse me for not trusting a signle word they say.
 
The BVH/tri intersection instruction isn't exposed to developers, except on Linux if you want to get down to the driver.

RDNA 2 already had the BVH/tri intersection instruction. That nothing stops AMD from exposing it leads you back to the title of the thread.
There is an AMD GPU Services Library that allows developers to call Radeon shader intrinsics directly but there are no intersection testing intrinsics defined yet. Why don't you ask AMD to adding it?
 
The BVH/tri intersection instruction isn't exposed to developers, except on Linux if you want to get down to the driver.

It's an interesting idea but it's not clear how it helps. If we take texturing as an analogy TMUs are mostly a black box and we don't give developers free reign to define their own texture formats or filtering logic and it seems to work just fine because the black box gives you speed. You need to lock down certain parameters in order to enable hardware acceleration. A BVH traversal instruction by itself isn't useful without defining the BVH data structure which is basically what's already happening today with DXR. A triangle intersection instruction is probably easier since everyone already agrees on the definition of a triangle but it will be almost impossible for developers to beat hardware at that task for any given batch of triangles. Hardware has a massive advantage there in data management via compression, caching, sorting and coalescing data accesses etc. DXR allows you to write your own intersection shader if you want to use something besides triangles but nobody is doing that for obvious reasons.

Hardware acceleration and freedom are competing priorities. If freedom is the goal there's always the pure compute option with the resulting tradeoff in performance.
 

This analysis tells me that the only result of giving AMD the capability to define DXR on their own would be a severe limitation on what RT h/w can other IHVs implement in their chips - meaning less performance basically for the only benefit of AMD's RT h/w not being the slowest.
 
Which wasn't what I was suggesting at all, that was a done deal any way. They should have exposed their hardware in Vulkan extensions and fought to have it exposed in DirectX.

Just for instance, the intersection instruction would probably trivially fit in the nanite inner loop ... no ray tracing even necessary to make it useful.
 
A mistake.
What if there is actually a reason, like the lack of apparent benefits or such option not being a good fit for where AMD themselves are planning to evolve their RT h/w?

I very much doubt that any of these huge corporations make such "mistakes" without something to back these decisions up. You could argue that the whole idea of DXR was a "mistake" but then there must be something which would prove why.
 
They are fighting from behind, sacrificing the present for the future is not really a luxury they can afford.

Carving out a niche is more important ... in this case, the nanite niche.
 
They are fighting from behind, sacrificing the present for the future is not really a luxury they can afford.

How many years should real-time RT have been delayed in order to enable this better future? What guarantee is there that better future would ever come?
 
It's an open question if it would delay it at all. At the end of the day tesselation and even geometry shaders delayed dicing rather than helping it ... the solution was compute all along.

RTX and Nanite are on opposite sides of a spectrum. NVIDIA has got RTX locked up, AMD should pick the other side. Maybe Brian can use the instruction not just to speed up Nanite but find better ways to do ray tracing too.
 
It's an open question if it would delay it at all. At the end of the day tesselation and even geometry shaders delayed dicing rather than helping it ... the solution was compute all along.

DX11 tessellation didn't delay Nanite. If Nanite was possible via compute on Fermi back in 2010 Epic wouldn't have waited 10 years to do it. It seems farfetched to expect that with the same transistor budget Turing could have delivered a more programmable version of RT in 2018 with usable performance. AMD spent even fewer transistors several years later and didn't change the game from a programmability or performance perspective. Because it wasn't possible.

RTX and Nanite are on opposite sides of a spectrum. NVIDIA has got RTX locked up, AMD should pick the other side. Maybe Brian can use the instruction not just to speed up Nanite but find better ways to do ray tracing too.

If raytracing against Nanite clusters was practical Nvidia would also call that "RTX". Let's say AMD agrees and exposes a freely available "intersect triangle" instruction. Is there any reason to think the RDNA 2/3 hardware implementation of triangle intersection can be used efficiently with arbitrary triangle data provided by Nanite? Before you even get there you have to solve for traversing the Nanite cluster hierarchy. There's no hardware from any IHV to help with that.

A good starting point would be to write a pure compute based Nanite raytracing solution. I don't think an individual Nanite node is tree based so tracing rays through it is probably horrendously slow. Either way you need to select the right cluster and its LOD for the hit point that you're shading. Epic has already said that Nanite is too dense to just toss the triangles into a DXR BVH because current hardware traversal is too slow for that much geometry. Once that's figured out you would then need to figure out how to export the hit triangles in a way that's friendly to AMD's triangle intersection hardware. Only after doing that compute based proof-of-concept would it be reasonable to ask AMD to expose the instruction.

This is exactly what Nvidia did with Optix. They built compute based raytracing pipelines for years before doing hardware acceleration. Any developer or IHV that believes in a compute based RT future has all the tools available to them today to make their point. The fact that nobody is actually doing that speaks volumes. Again, results do matter.
 
Let's say AMD agrees and exposes a freely available "intersect triangle" instruction. Is there any reason to think the RDNA 2/3 hardware implementation of triangle intersection can be used efficiently with arbitrary triangle data provided by Nanite?

Sure, it just intersects a ray with a triangle to get you a barycentric coordinate.

I'm not suggesting ray tracing the Nanite data structure, I'm just suggesting using the instruction to get the barycentric coordinate for the visibility buffer. That's the beauty of compute, flexibility.
 
Sure, it just intersects a ray with a triangle to get you a barycentric coordinate.

I'm not suggesting ray tracing the Nanite data structure, I'm just suggesting using the instruction to get the barycentric coordinate for the visibility buffer. That's the beauty of compute, flexibility.

Ah, you're proposing that AMD looks for a way to accelerate the Nanite software rasterizer. That would be smart. Not sure if there's a ton of opportunity for hardware to help though.

Where would you stick the intersection instruction?

nanite-rasterizer.png
 
You start one loop up, just try to take all the pixels in the screen space bounding box and do one intersection per pixel.
 
That would be an interesting face off. Each CU does 1 hardware triangle intersection per clock so it’s starting off with a 64x disadvantage vs the SIMDs.
 
Back
Top