The data structure isn't defined. It's a blackbox lacking any specification and access. That's the whole problem with DXR.A BVH traversal instruction by itself isn't useful without defining the BVH data structure which is basically what's already happening today with DXR.
Hardware acceleration and freedom are competing priorities. If freedom is the goal there's always the pure compute option with the resulting tradeoff in performance.
Didn't read the analysis, but anyway - it depends.This analysis tells me that the only result of giving AMD the capability to define DXR on their own would be a severe limitation on what RT h/w can other IHVs implement in their chips - meaning less performance basically for the only benefit of AMD's RT h/w not being the slowest.
They had no RT in mind, but the obvious goal would be to use the Nanite BVH for RT too, eliminating building costs completely.Maybe Brian can use the instruction not just to speed up Nanite but find better ways to do ray tracing too.
I think Karis has mentioned the development of Nanite itself already took something like 10 years.DX11 tessellation didn't delay Nanite. If Nanite was possible via compute on Fermi back in 2010 Epic wouldn't have waited 10 years to do it.
Yeah, I'm kinda surprised that AMD is improving their ray tracing capabilities on the hardware side. Eventually their ray accelerators will perform much like fixed-function hardware.Intent to switch to fixed function traversal in the future.
Not sure I agree with this. Having access to the model that is trained and the data that is used for training are 2 completely different things. Trained algorithms are always by definition blackboxed. You have no access to the original data, only the final model. You cannot manually manipulate the weights on a model to get a better response, only retraining the model will get a different model that works.ChatGPT
They perform like FF h/w right now. The difference is that their capabilities are limited when compared to what "RT cores" are able to do in other h/w.Eventually their ray accelerators will perform much like fixed-function hardware.
Being first has a lot of drawback too, especially during the research phase. I don't know if amd want or can be a pipe cleaner on the gpu market.
The data structure isn't defined. It's a blackbox lacking any specification and access. That's the whole problem with DXR.
Nobody requests a programmable texture filter or triangle ray intersection function. Those are small building blocks requiring no modifications. But this does not hold for acceleration structures, or just any form of data in general.
This is the first time an API puts a blackbox around data. : (
However, that's just about ideology.
Because this whole argument that blackboxed BVH data is required so IHVs can optimize for performance is wrong, and completely misses the point.
All we want is them to specify their data structures, so we can access, generate, and modify them. Respecting whatever requirements or preferences the HW has.
So please do just that. And if we then end up with 10 different data structures for all the chip gens already out, we may - eventually - complain that's a mess and too much to handle. And we may request an API to manage this diversity.
The freedom we can have from flexible compute is not an argument or excuse to justify the blackbox. It's the opposite: The progress we got from programmable GPUs clearly shows that this flexibility is needed, and is even more important than raw performance on the long run.
Imagine there would never have been any cuda, compute shaders, or ML acceleration, but NV did blackboxed DLSS running on the same HW, just not accessible to anyone else.
There would be no Nanite or ChatGPT, just DLSS. (And NV would not be as successful as they are either.)
But right now we don't know anything about how diverse or complex this is. We don't know if improvements are currently possible or unpractical for every IHV. There is no comment on the topic from IHVs or API designers.
The only thing we know for sure is that there are hard API limitations, preventing any traceable geometry with gradual level of detail. So the current Status Quo is not acceptable on the long run.
Thus any defense of this status quo is against the progress of raytracing.
Yeah, but for my argument this does not matter. What i mean is: Cuda allowed people to get massive speedups for all kind of applications which can be parallelized. ML coincidentally was one of them, and only after Cuda we saw the rise of deep neural networks. But Cuda was not built or tailored towards ML application. Specialized ML HW came only after this demand has shown. If there was no Cuda, rapid progress on deep learning would not have happened, and so there would be no ChatGPT either.Not sure I agree with this. Having access to the model that is trained and the data that is used for training are 2 completely different things. Trained algorithms are always by definition blackboxed. You have no access to the original data, only the final model. You cannot manually manipulate the weights on a model to get a better response, only retraining the model will get a different model that works.
No. I even assume the worst case, which is that NV uses treelets for compression and optimization. Treelets would mean that altering BVH can't be done on node granularity, but only on larger branches of the tree.I suspect you're underestimating the complexity involved in exposing the internal BVH structure.
I just did. Releasing specifications has no cost on resources, performance, money, or anything.But you're complaining that today's solutions are bad without offering an alternative that is possible "today" with the same available resources.
No worries. It was actually the opposite. Data scientists were struggling with throughout so they leveraged pixel shaders to do math. Nvidia saw this and built CUDA to make it simpler. Then everyone went straight to CUDA. They changed the hardware several times on nvidia machines to support faster ML processing before they eventually moved to tensor cores which only accelerates neural network type models. We still use the standard compute cores for everything else like regression etc.Yeah, but for my argument this does not matter. What i mean is: Cuda allowed people to get massive speedups for all kind of applications which can be parallelized. ML coincidentally was one of them, and only after Cuda we saw the rise of deep neural networks. But Cuda was not built or tailored towards ML application. Specialized ML HW came only after this demand has shown. If there was no Cuda, rapid progress on deep learning would not have happened, and so there would be no ChatGPT either.
But correct me if i'm wrong about this Cuda - ML relation. I do not really know much about the history of ML.
All i want is that every IHV specifies his custom format, so i can implement vendor specific code to implement the same required functionality for all of them.
Releasing specifications has no cost on resources, performance, money, or anything.
I don't think it's much more fixed for NVIDIA than it is for AMD, we just know what the underlying programmable hardware looks like for AMD.I don’t disagree about your issues with DXR, but it’s just we may need to start with fixed acceleration until hardware is fast enough to break out of it.
Yea, at the end of the day both can launch Ray Intersection hardware from the graphics and compute pipeline, and DXR1.1 allows an inline shader invocation from both graphics and compute. Overall, I consider this pretty good for starting, I mean, DXR will likely look very different 10 years from now. I see this as being flexible, so I would agree with you, it's not entirely stuck in the graphics pipeline which is a good thing.I don't think it's much more fixed for NVIDIA than it is for AMD, we just know what the underlying programmable hardware looks like for AMD.
Even when it's a couple times as fast as AMDs, AABB/tri intersection blocks are still quite slow in the grand scheme of things ... it doesn't make much sense to put them entirely in a fixed function pipeline. Putting some programmable hardware around it is relatively cheap.
Is this about "software based RT vs hw accelerated fixed function blocs"?
So i'm still the compute warrior, who hates any fixed function units and wishes to replace them with programmable units?I don’t disagree about your issues with DXR, but it’s just we may need to start with fixed acceleration until hardware is fast enough to break out of it.
Lol. I hear you, but perhaps I dont understand where the issue is.So i'm still the compute warrior, who hates any fixed function units and wishes to replace them with programmable units?
No! I'm still misunderstood, sigh.
One more time: Fixed function acceleration is fine. No need to change it, or anything related the HW. There is not even a need to change the API, but ideally they add query functions to get specifics about the BVH data structure of the installed GPU.
I remember, initially i was not too excited about the fixed function and single threaded ray traversal of RTX. I would have expected something like packet traversal per workgroup.
And i have complained that NV has kind of stolen our task to make RT faster, so we can't do any meaningful research on this problem anymore on our side.
But that was years ago, shortly after DXR / RTX was announced. Back then i was not yet aware what's the REAL problem with DXR.
I did not yet realize DXR prevents progress with LOD, although i've worked on that back then already myself.
And i really should have known, because years before it was me correcting people if they said 'RT in realtime does not work because building the BVH is too expensive'. Then i have responded with 'Just build your BVH offline per model. At runtime then you only need to build BVH over all models, not over all triangles. The cost is low, so realtime RT will come.'
Maybe it was only after Nanite was shown when i've realized that this static TLAS / dynamic BLAS solution i've initially liked breaks with LOD.
Karis knew it. His early twitter response was 'and how to do LOD with DXR?', and he was not happy. I've understood what he really means only much later, and that i have the exact same problem.
So eventually, when the API was decided, there were not enough people (or nobody) around which saw the flaw.
Maybe the failure was no shortsighted intent, but indeed just human error. Maybe, if they had known, MS would have defined data structure specifications or even a uniform standard, as they usually do.
Idk. But when i presently talk about 'more flexibility for RT!', i really mean the flexibility we would get form access to data structures.
I do not mean the flexibility we would get from programming traversal or intersections. It's not needed. HW accelerate the fuck out of that - make it a single cycle op - i'm fine with it.
I really think this acceleration structure data is ours, not theirs. It's my data - i want access to it. So i'm not the compute warrior, but maybe the Edward Snowden of RT, hihihi ;D
In my understanding, the hardware for function traversal and intersection should also require a very specific data structure?
If so, modification of the structure in ways that the silicon isn’t meant to handle going to result in crash?
In some ways the 2 are unlikely to be decoupled I think.