AMD made a mistake letting NVIDIA drive Ray Tracing APIs

MfA · Apr 11, 2023

Can still do other work while it's busy, "free" FLOPs.

PS. 32x.

Rootax · Apr 13, 2023

Being first has a lot of drawback too, especially during the research phase. I don't know if amd want or can be a pipe cleaner on the gpu market.

JoeJ · Apr 13, 2023

trinibwoy said:
A BVH traversal instruction by itself isn't useful without defining the BVH data structure which is basically what's already happening today with DXR.

The data structure isn't defined. It's a blackbox lacking any specification and access. That's the whole problem with DXR.
Nobody requests a programmable texture filter or triangle ray intersection function. Those are small building blocks requiring no modifications. But this does not hold for acceleration structures, or just any form of data in general.

This is the first time an API puts a blackbox around data. : (

trinibwoy said:
Hardware acceleration and freedom are competing priorities. If freedom is the goal there's always the pure compute option with the resulting tradeoff in performance.

The freedom we can have from flexible compute is not an argument or excuse to justify the blackbox. It's the opposite: The progress we got from programmable GPUs clearly shows that this flexibility is needed, and is even more important than raw performance on the long run.

Imagine there would never have been any cuda, compute shaders, or ML acceleration, but NV did blackboxed DLSS running on the same HW, just not accessible to anyone else.
There would be no Nanite or ChatGPT, just DLSS. (And NV would not be as successful as they are either.)

However, that's just about ideology.
Because this whole argument that blackboxed BVH data is required so IHVs can optimize for performance is wrong, and completely misses the point.
All we want is them to specify their data structures, so we can access, generate, and modify them. Respecting whatever requirements or preferences the HW has.
So please do just that. And if we then end up with 10 different data structures for all the chip gens already out, we may - eventually - complain that's a mess and too much to handle. And we may request an API to manage this diversity.

But right now we don't know anything about how diverse or complex this is. We don't know if improvements are currently possible or unpractical for every IHV. There is no comment on the topic from IHVs or API designers.
The only thing we know for sure is that there are hard API limitations, preventing any traceable geometry with gradual level of detail. So the current Status Quo is not acceptable on the long run.
Thus any defense of this status quo is against the progress of raytracing.

DegustatoR said:
This analysis tells me that the only result of giving AMD the capability to define DXR on their own would be a severe limitation on what RT h/w can other IHVs implement in their chips - meaning less performance basically for the only benefit of AMD's RT h/w not being the slowest.

Didn't read the analysis, but anyway - it depends.
Example: If AMD would expose their intersection instructions, their RT would be surely the fastest for my (hypothetical) gradual LOD geometry.
This is because then i could use my own dynamic BVH data structures, requiring to write my own tracing traversal as well.
The reason this would be fastest simply is that in this case AMD HW would be the only cone which can trace my geometry, since DXR can not trace dynamic geometry at all. So there would be RT support on AMD only.

I say 'hypothetical', because in the real world all this means that i can not do my gradual LOD the way i want. I have to make an inferior version, for example like Epic did with using low poly proxies for RT.
The door to a LOD solution is currently locked, so you won't find it in any analysis. Thus, what such analysis tell you is no confirmation and your conclusions are built on biased data.

Just for the record: I do not think AMDs RT is better, or would be if it was the only standard. NVs much higher effort on HW accel. is fine. We only need to lift the blackbox so we can bypass the limitations if we need to and are willing to spend the effort on our side too. Currently, being 'lazy devs' is our one and only option. Regarding traceable geometry we're currently stuck at static topology, which rules out dynamic level of detail. And that's not acceptable to anyone, no matter what's his primary interest or agenda.
I know i sound annoying repeating this again and again, but i'm right with my point. Defending a restricted status quo is wrong, and thus even more annoying.

LOD is the only reason why we need more flexibility with RT we came across till yet, so the debate really reduces to that in the end.
And it's not that IHVs have not worked on the topic. Summing up what we have:
Intel: Traversal shaders. Allows to turn discrete LOD into continuous LOD using stochastic sampling. It's a general solution, but increases divergence, and there is no gradual detail geometry for rasterization.
NV: DMM, tessellation and displacement. It's detail amplification, but does not solve reduction. Topology remains static - it can not open or close holes on the mesh. IDK if applicable for rasterization as well.

The problem is still open, and way too big we could hope IHVs solve it once and for all. We need flexibility so we can explore specific solutions which work for our case. Maybe some time we'll all agree on one such solution. But even Nanite more looks like a first step, but surely no final / general solution either.
So imo current RT APIs are a failure and need a fix. I don't care who's guilty or who provides the fix, but better a temporary fix early, than waiting for 10 more chip gens to make it harder with each one.

JoeJ · Apr 13, 2023

MfA said:
Maybe Brian can use the instruction not just to speed up Nanite but find better ways to do ray tracing too.

They had no RT in mind, but the obvious goal would be to use the Nanite BVH for RT too, eliminating building costs completely.
Not sure about the raster inner loop. RT instruction can't profit from vertices already in LDS or registers, they must come from VRAM. Likely cached ofc., but still.
And the instruction can't do one triangle per thread in parallel, so i guess the idea would be a big slow down.

So beside LOD or other needs for dynamic geometry, what would benefit from programmable traversal or isolated ray triangle intersection? I can't think of any example.
I think AMD did not expose because:
* Market share too low to hope for adoption. (Due to cross platform focus, the PC limitations swap over to consoles too - they won't work on custom LOD system which then only works on consoles.)
* Intent to switch to fixed function traversal in the future. Using shader cores for that isn't great. Too much divergence and waiting.

JoeJ · Apr 13, 2023

trinibwoy said:
DX11 tessellation didn't delay Nanite. If Nanite was possible via compute on Fermi back in 2010 Epic wouldn't have waited 10 years to do it.

I think Karis has mentioned the development of Nanite itself already took something like 10 years.
Imo, the delay on tackling LOD was indeed caused by the brute force power of GPUs. Similar to the hidden surface removal problem.
Both those problems were very actively researched in the software rendering era, but after GPUs became widespread it allowed to postpone further back. Existing solutions were not compatible with GPU, and GPU was fast enough to do more work than needed.
Nowadays finally both problems see good solutions on GPU, thanks to the flexibility compute has enabled.
The solutions are not even that different from what we had in the SW render era, but back then it was not predictable they would become possible in the future.
So we get the benefit from more flexibility only some time after it's exposed, making it difficult to argue why it's essential for RT as well.

TopSpoiler · Apr 13, 2023

JoeJ said:
Intent to switch to fixed function traversal in the future.

Yeah, I'm kinda surprised that AMD is improving their ray tracing capabilities on the hardware side. Eventually their ray accelerators will perform much like fixed-function hardware.

iroboto · Apr 13, 2023

JoeJ said:
ChatGPT

Not sure I agree with this. Having access to the model that is trained and the data that is used for training are 2 completely different things. Trained algorithms are always by definition blackboxed. You have no access to the original data, only the final model. You cannot manually manipulate the weights on a model to get a better response, only retraining the model will get a different model that works.

DegustatoR · Apr 13, 2023

TopSpoiler said:
Eventually their ray accelerators will perform much like fixed-function hardware.

They perform like FF h/w right now. The difference is that their capabilities are limited when compared to what "RT cores" are able to do in other h/w.
General idea here as I see it is to improve the RT h/w, not improve the execution of RT on FP32 SIMD units (which is highly likely to be impossible without improving the RT h/w first).

trinibwoy · Apr 13, 2023

Rootax said:
Being first has a lot of drawback too, especially during the research phase. I don't know if amd want or can be a pipe cleaner on the gpu market.

AMD was first for building multiple asynchronous compute queues in hardware. Now we take it for granted. They can do it if they try.

trinibwoy · Apr 13, 2023

JoeJ said:
The data structure isn't defined. It's a blackbox lacking any specification and access. That's the whole problem with DXR.
Nobody requests a programmable texture filter or triangle ray intersection function. Those are small building blocks requiring no modifications. But this does not hold for acceleration structures, or just any form of data in general.

This is the first time an API puts a blackbox around data. : (

However, that's just about ideology.
Because this whole argument that blackboxed BVH data is required so IHVs can optimize for performance is wrong, and completely misses the point.
All we want is them to specify their data structures, so we can access, generate, and modify them. Respecting whatever requirements or preferences the HW has.
So please do just that. And if we then end up with 10 different data structures for all the chip gens already out, we may - eventually - complain that's a mess and too much to handle. And we may request an API to manage this diversity.

I suspect you're underestimating the complexity involved in exposing the internal BVH structure. Texture compression formats are successful because enough of the big market players got behind them, were willing to pay licenses to the inventors and were willing to include the necessary hardware in their chips. Let's follow the same path for BVH. This would require some player, let's say Intel, to offer their proprietary BVH format to the industry. This assumes that the format is even good enough to be shared. Then Microsoft, Nvidia, AMD etc would have to agree that the format is good, be willing to pay Intel to use it and then change their APIs and hardware to accommodate the format. And all of this would depend on Intel being willing to license it in the first place. If for some reason they think it provides them with a significant competitive advantage it may not make sense to do so.

The industry can't even agree on a simple data model for managing shader state! I wouldn't even entertain the idea of asking developers to support multiple proprietary BVH formats for each hardware vendor. Given the rough state of games today even when the apis and hardware are very similar, that's just asking for a complete shit show.

JoeJ said:
The freedom we can have from flexible compute is not an argument or excuse to justify the blackbox. It's the opposite: The progress we got from programmable GPUs clearly shows that this flexibility is needed, and is even more important than raw performance on the long run.

Imagine there would never have been any cuda, compute shaders, or ML acceleration, but NV did blackboxed DLSS running on the same HW, just not accessible to anyone else.
There would be no Nanite or ChatGPT, just DLSS. (And NV would not be as successful as they are either.)

But right now we don't know anything about how diverse or complex this is. We don't know if improvements are currently possible or unpractical for every IHV. There is no comment on the topic from IHVs or API designers.
The only thing we know for sure is that there are hard API limitations, preventing any traceable geometry with gradual level of detail. So the current Status Quo is not acceptable on the long run.
Thus any defense of this status quo is against the progress of raytracing.

Nobody disagrees that things will be better in the long run. But you're complaining that today's solutions are bad without offering an alternative that is possible "today" with the same available resources. Of course in the long run the stuff we have today will look horrible. CUDA arrived in 2007 and has evolved significantly since then influencing DirectX along the way. How many years did it take after that for Nanite to show up? DXR will also evolve and it may take 15 years to see the Nanite equivalent of RT. That's life.

JoeJ · Apr 13, 2023

iroboto said:
Not sure I agree with this. Having access to the model that is trained and the data that is used for training are 2 completely different things. Trained algorithms are always by definition blackboxed. You have no access to the original data, only the final model. You cannot manually manipulate the weights on a model to get a better response, only retraining the model will get a different model that works.

Yeah, but for my argument this does not matter. What i mean is: Cuda allowed people to get massive speedups for all kind of applications which can be parallelized. ML coincidentally was one of them, and only after Cuda we saw the rise of deep neural networks. But Cuda was not built or tailored towards ML application. Specialized ML HW came only after this demand has shown. If there was no Cuda, rapid progress on deep learning would not have happened, and so there would be no ChatGPT either.
But correct me if i'm wrong about this Cuda - ML relation. I do not really know much about the history of ML.

trinibwoy said:
I suspect you're underestimating the complexity involved in exposing the internal BVH structure.

No. I even assume the worst case, which is that NV uses treelets for compression and optimization. Treelets would mean that altering BVH can't be done on node granularity, but only on larger branches of the tree.
AMD likely has nothing like that, and since they have no traversal HW, they are probably very flexible regarding the data format.
Intel might use something completely different than from what NV is doing.

I'm aware, and thus i do not request a uniform BVH format. This could even work only on future chip generations, after IHVs have abandoned their custom optimizations in favor of a standard. So it wouldn't help for the GPUs already out.
I do not want to enforce a standard. It's too late for that. But also too early if IHVs still want to improve their formats.
All i want is that every IHV specifies his custom format, so i can implement vendor specific code to implement the same required functionality for all of them.
So no agreement or enforcement of a standard. Even vendor specific extensions would not be needed, just specification.
The only cost for IHVs would be to expose their proprietary data formats, which i doubt are secrets worth to hide from the competition.

Once we have this, few devs might use it, e.g. to implement LOD or streaming BVH from disk.
Eventually a desire on a single, industry wide standard would form with time, and eventually IHVs would agree that's a good idea. Proposals can be made, restrictions and requirements can be discussed, and maybe the API can be finally fixed.
But that's not needed to fix the technical limitations now. Vendor specifications are good enough.

trinibwoy said:
But you're complaining that today's solutions are bad without offering an alternative that is possible "today" with the same available resources.

I just did. Releasing specifications has no cost on resources, performance, money, or anything.

I really think it's the demand which is too low. Epic sure would be interested to make Nanite traceable, but who else has a problem with LOD?
It's a chicken and egg problem. Devs postpone working on LOD solutions until it's possible for RT too, so there is little demand, requests and complaints.

iroboto · Apr 13, 2023

JoeJ said:
Yeah, but for my argument this does not matter. What i mean is: Cuda allowed people to get massive speedups for all kind of applications which can be parallelized. ML coincidentally was one of them, and only after Cuda we saw the rise of deep neural networks. But Cuda was not built or tailored towards ML application. Specialized ML HW came only after this demand has shown. If there was no Cuda, rapid progress on deep learning would not have happened, and so there would be no ChatGPT either.
But correct me if i'm wrong about this Cuda - ML relation. I do not really know much about the history of ML.

No worries. It was actually the opposite. Data scientists were struggling with throughout so they leveraged pixel shaders to do math. Nvidia saw this and built CUDA to make it simpler. Then everyone went straight to CUDA. They changed the hardware several times on nvidia machines to support faster ML processing before they eventually moved to tensor cores which only accelerates neural network type models. We still use the standard compute cores for everything else like regression etc.

If there was no cuda, yea I think there would be some slightly slower progress. But it would have gone this way regardless, most ML algorithms are quite old, the support for coding was inevitable I think, so either OpenCL or Cuda would be the main GPU provider but a lot of technology is built on top of that as well. Other companies have flat out made their own hardware like Google and they don’t leverage cuda.

But I would actually liken cuda to directX in the role it played. Once released it was the boom of development following it.

In the end however, a lot of it was just slow progress. Cuda began with Kepler. And it took a while to get where it is now.

I don’t disagree about your issues with DXR, but it’s just we may need to start with fixed acceleration until hardware is fast enough to break out of it.

trinibwoy · Apr 13, 2023

JoeJ said:
All i want is that every IHV specifies his custom format, so i can implement vendor specific code to implement the same required functionality for all of them.

Yup, completely unrealistic for shipping actual games.

JoeJ said:
Releasing specifications has no cost on resources, performance, money, or anything.

Even assuming there are no trade secrets involved there is a documentation and support cost. Each IHV could even have multiple versions or different formats across architecture generations. But I bet there is secret sauce in compression methods, streaming etc that IHVs wouldn’t give away freely. It’s strange to assume otherwise.

MfA · Apr 13, 2023

iroboto said:
I don’t disagree about your issues with DXR, but it’s just we may need to start with fixed acceleration until hardware is fast enough to break out of it.

I don't think it's much more fixed for NVIDIA than it is for AMD, we just know what the underlying programmable hardware looks like for AMD.

Even when it's a couple times as fast as AMDs, AABB/tri intersection blocks are still quite slow in the grand scheme of things ... it doesn't make much sense to put them entirely in a fixed function pipeline. Putting some programmable hardware around it is relatively cheap.

Inuhanyou · Apr 13, 2023

I am dumb and not technically inclined so I'll just ask.

Is this about "software based RT vs hw accelerated fixed function blocs"?

iroboto · Apr 13, 2023

MfA said:
I don't think it's much more fixed for NVIDIA than it is for AMD, we just know what the underlying programmable hardware looks like for AMD.

Even when it's a couple times as fast as AMDs, AABB/tri intersection blocks are still quite slow in the grand scheme of things ... it doesn't make much sense to put them entirely in a fixed function pipeline. Putting some programmable hardware around it is relatively cheap.

Yea, at the end of the day both can launch Ray Intersection hardware from the graphics and compute pipeline, and DXR1.1 allows an inline shader invocation from both graphics and compute. Overall, I consider this pretty good for starting, I mean, DXR will likely look very different 10 years from now. I see this as being flexible, so I would agree with you, it's not entirely stuck in the graphics pipeline which is a good thing.

2018 they released DXR1.0, 2020 they released DXR1.1
I feel like 2023 there should be another release coming soon with the next generation of RT hardware support.

trinibwoy · Apr 13, 2023

Inuhanyou said:
Is this about "software based RT vs hw accelerated fixed function blocs"?

The OP was more about giving developers direct access to ray tracing instructions whether they're executed in software or hardware. The only instruction exposed by DXR allows you to trace a ray into a BVH and return the triangle(s) that were hit or a miss if nothing was hit. DXR doesn't expose the lower level box traversal or triangle intersection instructions directly. E.g. Even though AMD does traversal in software they don't expose a traversal instruction that developers can customize. It's unknown whether exposing these individual instructions would be useful on current hardware.

JoeJ · Apr 13, 2023

iroboto said:
I don’t disagree about your issues with DXR, but it’s just we may need to start with fixed acceleration until hardware is fast enough to break out of it.

So i'm still the compute warrior, who hates any fixed function units and wishes to replace them with programmable units?
No! I'm still misunderstood, sigh.

One more time: Fixed function acceleration is fine. No need to change it, or anything related the HW. There is not even a need to change the API, but ideally they add query functions to get specifics about the BVH data structure of the installed GPU.

I remember, initially i was not too excited about the fixed function and single threaded ray traversal of RTX. I would have expected something like packet traversal per workgroup.
And i have complained that NV has kind of stolen our task to make RT faster, so we can't do any meaningful research on this problem anymore on our side.

But that was years ago, shortly after DXR / RTX was announced. Back then i was not yet aware what's the REAL problem with DXR.
I did not yet realize DXR prevents progress with LOD, although i've worked on that back then already myself.
And i really should have known, because years before it was me correcting people if they said 'RT in realtime does not work because building the BVH is too expensive'. Then i have responded with 'Just build your BVH offline per model. At runtime then you only need to build BVH over all models, not over all triangles. The cost is low, so realtime RT will come.'
Maybe it was only after Nanite was shown when i've realized that this static TLAS / dynamic BLAS solution i've initially liked breaks with LOD.
Karis knew it. His early twitter response was 'and how to do LOD with DXR?', and he was not happy. I've understood what he really means only much later, and that i have the exact same problem.

So eventually, when the API was decided, there were not enough people (or nobody) around which saw the flaw.
Maybe the failure was no shortsighted intent, but indeed just human error. Maybe, if they had known, MS would have defined data structure specifications or even a uniform standard, as they usually do.

Idk. But when i presently talk about 'more flexibility for RT!', i really mean the flexibility we would get form access to data structures.
I do not mean the flexibility we would get from programming traversal or intersections. It's not needed. HW accelerate the fuck out of that - make it a single cycle op - i'm fine with it.

I really think this acceleration structure data is ours, not theirs. It's my data - i want access to it. So i'm not the compute warrior, but maybe the Edward Snowden of RT, hihihi ;D

iroboto · Apr 13, 2023

JoeJ said:
So i'm still the compute warrior, who hates any fixed function units and wishes to replace them with programmable units?
No! I'm still misunderstood, sigh.

One more time: Fixed function acceleration is fine. No need to change it, or anything related the HW. There is not even a need to change the API, but ideally they add query functions to get specifics about the BVH data structure of the installed GPU.

I remember, initially i was not too excited about the fixed function and single threaded ray traversal of RTX. I would have expected something like packet traversal per workgroup.
And i have complained that NV has kind of stolen our task to make RT faster, so we can't do any meaningful research on this problem anymore on our side.

But that was years ago, shortly after DXR / RTX was announced. Back then i was not yet aware what's the REAL problem with DXR.
I did not yet realize DXR prevents progress with LOD, although i've worked on that back then already myself.
And i really should have known, because years before it was me correcting people if they said 'RT in realtime does not work because building the BVH is too expensive'. Then i have responded with 'Just build your BVH offline per model. At runtime then you only need to build BVH over all models, not over all triangles. The cost is low, so realtime RT will come.'
Maybe it was only after Nanite was shown when i've realized that this static TLAS / dynamic BLAS solution i've initially liked breaks with LOD.
Karis knew it. His early twitter response was 'and how to do LOD with DXR?', and he was not happy. I've understood what he really means only much later, and that i have the exact same problem.

So eventually, when the API was decided, there were not enough people (or nobody) around which saw the flaw.
Maybe the failure was no shortsighted intent, but indeed just human error. Maybe, if they had known, MS would have defined data structure specifications or even a uniform standard, as they usually do.

Idk. But when i presently talk about 'more flexibility for RT!', i really mean the flexibility we would get form access to data structures.
I do not mean the flexibility we would get from programming traversal or intersections. It's not needed. HW accelerate the fuck out of that - make it a single cycle op - i'm fine with it.

I really think this acceleration structure data is ours, not theirs. It's my data - i want access to it. So i'm not the compute warrior, but maybe the Edward Snowden of RT, hihihi ;D

Lol. I hear you, but perhaps I dont understand where the issue is.

In my understanding, the hardware for function traversal and intersection should also require a very specific data structure?
If so, modification of the structure in ways that the silicon isn’t meant to handle going to result in crash?

In some ways the 2 are unlikely to be decoupled I think.

JoeJ · Apr 13, 2023

iroboto said:
In my understanding, the hardware for function traversal and intersection should also require a very specific data structure?
If so, modification of the structure in ways that the silicon isn’t meant to handle going to result in crash?

In some ways the 2 are unlikely to be decoupled I think.

You're right, the HW depends on the data structure, and we can't modify this structure. But we can modify the data.
If we know the specs, we can convert our data to the specified format the HW understands. Then we can do:
* Build BVH offline, stream it, convert to GPU. (Due to storage costs, that's maybe not always practical for the whole BVH. But then we can build only the lowest levels of the tree on the client for a compromise.)
* Apply changes to parts of the BVH, ideally. E.g. if a cluster of geometry on a model changes it's geometry due to LOD.

So we eliminate most of the cost of BVH building in the background during gameplay,
and we can implement fine grained LOD such as Nanite.
We may even get slightly better tracing perf. from high quality offline build. (but may be also slightly worse if our source data isn't ideal for all unique chips out there.)
We also achieve feature parity with consoles, if that matters.

To make it work and avoid a crash, we must know all details about the data structure as expected by the HW, plus ideally some optimization guide on what's ideal for the HW and what not.

AMD made a mistake letting NVIDIA drive Ray Tracing APIs

MfA

Rootax

JoeJ

JoeJ

JoeJ

TopSpoiler

iroboto

Daft Funk

DegustatoR

trinibwoy

Meh

trinibwoy

Meh

JoeJ

iroboto

Daft Funk

trinibwoy

Meh

MfA

Inuhanyou

iroboto

Daft Funk

trinibwoy

Meh

JoeJ

iroboto

Daft Funk

JoeJ

Similar threads