NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
I gotta admit... as a consumer.. listening to developers talk sometimes is depressing. They're never satisfied :D

Half joking aside though, I VASTLY prefer playing games on PC, the hardware, the performance, the openness... It's the best. Sometimes though, when you visit forums like this, or twitter (again as a know nothing consumer) you get an unexpected healthy dose of reality... because between developers they tend to speak very frankly about things.. and sometimes unfiltered. And that's the thing to remember. A lot of times, when developers speak about bottlenecks they face, it's not that they're really trying to "hate" on anything. It's simply that they speak about the realities they face, as they see them.. and they are in the position to speak on it. Often it's just opinion, sometimes it's fact. And that's the thing, nothing would ever get any better if developers just accepted what was given to them. The big powerhouse developers that face these issues head on pushing state of the art push technology in the direction they think it should head.. and that often takes years upon years to eventually become a reality.. if ever at all.

So all they can do is push, and call for the changes/improvements they'd like to see. The reality is that closed-box consoles have certain inherent advantages over platforms which have more open ecosystems with abstracted hardware.. and vice versa. Anyway, I'm blabbing on, but I just think it's important to remember that not everything is an attack on a device, or an operating system, an API, or a packaged software. Some people want different things, and everyone has different goals and ambitions. Ultimately the market will decide what works and what doesn't.. and things will evolve from there.

Now that consoles and PCs are very similar architecturally, or at least more than ever before... both sides will have influence.. and hopefully that leads to better products all sides. "The best" or "better implementation" can be subjective.. so let's just see what happens. :smile2:
 
Depends on which of the things he’s saying that you’re referring to. Literally nobody thinks that flexibility is bad. That is not the problem with what he’s been saying.

It’s unsubstantiated stuff like “at some point, the console becomes faster no matter what's the difference in HW power.” There is no indication whatsoever that flexibility of current consoles will ever compensate for their significant deficit of raw RT performance. It would be great if this was demonstrably true but he’s making such claims with no theory or evidence to back it up. It’s all hand waving.

Indeed. Anyway, its quite obvious pc is having the (very) strong lead in ray tracing (on both amd/NV side of things), aswell as in AI/ML reconstruction tech. Thats aside from the much more capable normal rendering going on, which is still going to play a huge role in the total picture.

I gotta admit... as a consumer.. listening to developers talk sometimes is depressing. They're never satisfied :D

Half joking aside though, I VASTLY prefer playing games on PC, the hardware, the performance, the openness... It's the best. Sometimes though, when you visit forums like this, or twitter (again as a know nothing consumer) you get an unexpected healthy dose of reality... because between developers they tend to speak very frankly about things.. and sometimes unfiltered. And that's the thing to remember. A lot of times, when developers speak about bottlenecks they face, it's not that they're really trying to "hate" on anything. It's simply that they speak about the realities they face, as they see them.. and they are in the position to speak on it. Often it's just opinion, sometimes it's fact. And that's the thing, nothing would ever get any better if developers just accepted what was given to them. The big powerhouse developers that face these issues head on pushing state of the art push technology in the direction they think it should head.. and that often takes years upon years to eventually become a reality.. if ever at all.

So all they can do is push, and call for the changes/improvements they'd like to see. The reality is that closed-box consoles have certain inherent advantages over platforms which have more open ecosystems with abstracted hardware.. and vice versa. Anyway, I'm blabbing on, but I just think it's important to remember that not everything is an attack on a device, or an operating system, an API, or a packaged software. Some people want different things, and everyone has different goals and ambitions. Ultimately the market will decide what works and what doesn't.. and things will evolve from there.

Now that consoles and PCs are very similar architecturally, or at least more than ever before... both sides will have influence.. and hopefully that leads to better products all sides. "The best" or "better implementation" can be subjective.. so let's just see what happens. :smile2:

Good post. Explains lot. Things always have been like that, the pc has always been this platform with driver overheads, different IHV's, windows configurations, low to high end systems etc etc. Consoles always had the advantage of lower to metal API's and fixed hardware targets etc. What i must add is that things have gotten better over time on pc in this regard, much better. API's have improved, MS/Windows is in a different state today then 20 years ago (xbox<>pc integration), scaling etc etc.

Not saying he's wrong in every sense, but he's not right in every sense either.
 
However the trade off is that peak achievable RT performance on RDNA is much lower than black box traversal found in Turing/Ampere.
It appears it's not that simple:

GPSnoopy/RayTracingInVulkan: Implementation of Peter Shirley's Ray Tracing In One Weekend book using Vulkan and NVIDIA's RTX extension. (github.com)

as the section of the page for test results: "RayTracer Release 6 (NVIDIA drivers 461.40, AMD drivers 21.1.1)" shows that 6900XT varies between 35 and 124% of the performance of RTX 3090. The "peak" is higher on RDNA 2.

In this note on performance:

I suspect the RTX 2000 series RT cores to implement ray-AABB collision detection using reduced float precision. Early in the development, when trying to get the sphere procedural rendering to work, reporting an intersection every time the rint shader is invoked allowed to visualise the AABB of each procedural instance. The rendering of the bounding volume had many artifacts around the boxes edges, typical of reduced precision.
there's a reference to reduced precision. The note is old:

Adding random thoughts and observations · GPSnoopy/RayTracingInVulkan@11c2fb3 (github.com)

predating Ampere, which is why it only references Turing. There's every reason to expect Ampere would also be using reduced precision. Is reduced precision being used on RDNA 2? If the code changes to solve reduced precision traversal on NVidia are removed, what's the effect (quality and performance) on RDNA 2?

A theory about RDNA 2 performance relates to triangle intersection testing throughput:

"Compared to Turing, Ampere achieves 2x RT performance only when using ray-triangle intersection (as expected as per NVIDIA Ampere whitepaper), otherwise performance per RT core is the same. [...] The triangle-based geometry scenes highlight how efficient Ampere RT cores are in handling triangle-ray intersections; unsurprisingly as these scenes are more representative of what video games would do in practice."

So what's interesting is that all architectures do hardware-accelerated triangle intersection and it appears to have a dominant role in performance.

Does that mean that accelerated BVH traversal is not the problem on RDNA 2?

I find it quite difficult to reason about performance factors in this particular application.
 
If the compact Nanite geometry representation is compatible with RT then it probably worth adding support for it to the BVH builder and problem is solved - no need to store 2 sets of meshes and small memory footprints, continuous LODs can be converted to a few discrete LODs for RT on the fly.
That's probably no good idea. Nanite is very good, but still limited. Likely it has issues with changes in topology, both for geometry and UVs (i failed to test this with importing custom models).
We don't know about a general LOD solution yet, so we can not yet specify an API, or something like 'Alembic with lod'. I'm not so optimistic we'll ever have such a general LOD solution at all.
Really all we can and should do is specifying the data structures various GPUs use, and leave it to the developers if they want to go down this rabbit hole or not. Maybe, after many years, some standards could form from similar practices emerging over the whole industry.

I expect there will be a lot of kitbashing in games with Nanite, having a hard time imagining good RT perf with that even if BVH patching was available and free perf wise, so the proxy workaround seems to be the only viable solution at the moment, automatic geometry merging and clustering would also help a lot and hopefully they will be able to implement it.
Yeah, Nanite is just the only example i can use to have an argument.
For me it's quite different: My LOD and geometry stuff emerged from the GI work, so i did it the other way around than Epic. It's important for me to remove all intersections of geometry, so no GI samples are created on surfaces which end up inside a wall for example.
Thus i don't have this overlaps from kitbashing problem, but i also don't have the big compression advantage from instancing a 'small' set of models. Compression is still an open problem for me.
My idea to convert BVH for RT from existing BVH for LOD also probably differs from Epics plans to make Nanite compatible with RT.
However: Just specify the data structures as needed by HW and everybody can do whatever he wants. No need to discuss those applications in detail.

Characters will stay in this middle age for a while with Nanite too, but what's the issue with discrete LODs for static geometry?
Actually i'm fine with discrete LODs for characters or other small objects.
The problem comes with large models - terrain, architecture, etc. In the past, those things were often low poly without LOD at all. If we want high detail on them LOD is needed, but the models are too large and hard to subdivide into smaller models so we can use discrete LODs.
It's pretty scene dependent, and also related to the manual work needed to make discrete LODs work for whole scens.
With RT, you can change these LODs when triangles are subpixel so that nobody would even notice that these LODs exist.
Takes a lot of BVH memory to have all models at one triangle per pixel level. Never seen that beeing possible. And BVH generation cost is the same as its space complexity.
So i don't believe you that. But if so, again with larger models the closer / distant sections of those models will have a different triangle per pixel ratio.
 
Turing and Ampere can do RT in the exact same fashion as RDNA2. Would lead to some halving of performance but if somebody thinks this is good ...
Is this exposed in some API or is this you just assuming they can?
 
Turing and Ampere can do RT in the exact same fashion as RDNA2. Would lead to some halving of performance but if somebody thinks this is good ...

There’s no reason to think that Ampere can pass control to the shader during BVH traversal. If this was true it would be exposed in Optix or some Vulkan extension but it’s not. Patents make no mention of it.
 
It appears it's not that simple:

GPSnoopy/RayTracingInVulkan: Implementation of Peter Shirley's Ray Tracing In One Weekend book using Vulkan and NVIDIA's RTX extension. (github.com)

as the section of the page for test results: "RayTracer Release 6 (NVIDIA drivers 461.40, AMD drivers 21.1.1)" shows that 6900XT varies between 35 and 124% of the performance of RTX 3090. The "peak" is higher on RDNA 2.

In this note on performance:


there's a reference to reduced precision. The note is old:

Adding random thoughts and observations · GPSnoopy/RayTracingInVulkan@11c2fb3 (github.com)

predating Ampere, which is why it only references Turing. There's every reason to expect Ampere would also be using reduced precision. Is reduced precision being used on RDNA 2? If the code changes to solve reduced precision traversal on NVidia are removed, what's the effect (quality and performance) on RDNA 2?

A theory about RDNA 2 performance relates to triangle intersection testing throughput:

"Compared to Turing, Ampere achieves 2x RT performance only when using ray-triangle intersection (as expected as per NVIDIA Ampere whitepaper), otherwise performance per RT core is the same. [...] The triangle-based geometry scenes highlight how efficient Ampere RT cores are in handling triangle-ray intersections; unsurprisingly as these scenes are more representative of what video games would do in practice."

So what's interesting is that all architectures do hardware-accelerated triangle intersection and it appears to have a dominant role in performance.

Does that mean that accelerated BVH traversal is not the problem on RDNA 2?

I find it quite difficult to reason about performance factors in this particular application.

We seem to have understood the author’s speculation differently. RDNA 2 pulls ahead in simple scenes with procedural geometry. Meaning that intersection testing of leaf node primitives occurs on the shader core and not the RT hardware on both architectures. Maybe RDNA 2 is simply better at this due to leaf data being more readily available to the WGP.

Rift-breaker could be a good benchmark to illustrate this. It has a lot of alpha tested geometry that can’t be handled by the triangle intersection unit and has to be processed on the shader core. There’s a demo out but I can’t seem to find any benchmarks on the web.

The 6900 XT results show the RDNA 2 architecture performing surprisingly well in procedural geometry scenes. Is it because the RDNA2 BVH-ray intersections are done using the generic computing units (and there are plenty of those), whereas Ampere is bottlenecked by its small number of RT cores in these simple scenes? Or is RDNA2 Infinity Cache really shining here? The triangle-based geometry scenes highlight how efficient Ampere RT cores are in handling triangle-ray intersections; unsurprisingly as these scenes are more representative of what video games would do in practice.
 
There’s no reason to think that Ampere can pass control to the shader during BVH traversal. If this was true it would be exposed in Optix or some Vulkan extension but it’s not. Patents make no mention of it.

Correct, BVH traversal on NV is likely to be a state machine rather than a shader program. If we want to trigger a different traversal routine (of which there's a fixed number of in this case) then we apply "state changes" to this state machine ...

If traversal was handled by a program then we could virtually construct an infinite number of routines in that case ...
 
I'm pretty sure NV could implement the stochastic LOD algorithm in future HW easily, although the mutations of those traversal routines would double their number.
But i'm not sure if a robust general solution of its issues (the space between two discrete lods causing self intersections, etc.) is guaranteed. I did not fully understand Intels paper in this regard, but felt a bit fishy.
 
We seem to have understood the author’s speculation differently. RDNA 2 pulls ahead in simple scenes with procedural geometry. Meaning that intersection testing of leaf node primitives occurs on the shader core and not the RT hardware on both architectures. Maybe RDNA 2 is simply better at this due to leaf data being more readily available to the WGP.

Rift-breaker could be a good benchmark to illustrate this. It has a lot of alpha tested geometry that can’t be handled by the triangle intersection unit and has to be processed on the shader core. There’s a demo out but I can’t seem to find any benchmarks on the web.
With prpcedural geo on Ampere and Turing it should still get traversal acceleration - curious. I ought to try out some benches there
 
Anyhit shader already does that in DXR 1.0.

No it doesn’t. The any-hit shader is called after a triangle hit in a leaf node, not during BVH traversal. Inside of the any-hit shader you can choose to resume/recast the ray. But this isn’t the same as dynamic traversal through the BVH.

With prpcedural geo on Ampere and Turing it should still get traversal acceleration - curious. I ought to try out some benches there

You still get BVH traversal acceleration with procedural geo. But the triangle intersection hardware doesn’t help.
 
No it doesn’t. The any-hit shader is called after a triangle hit in the leaf node, not during BVH traversal.
I don't see how this is different. If they can call it on triangle hit then they should be able to call it on AABB hit as well. The h/w is programmable after all, the fact that it's a black box doesn't mean that it's not.
 
I don't see how this is different. If they can call it on triangle hit then they should be able to call it on AABB hit as well. The h/w is programmable after all, the fact that it's a black box doesn't mean that it's not.

The data structures for BVH nodes and triangles are different and they’re processed by different components of the RT unit with different data paths. Of course this assumes the actual implementation is anything like the patent description. Just because the RT unit can send triangles to the shaders doesn’t mean it can exchange BVH node data with the shader too.

Is it theoretically possible? Yes. Is there any evidence of this anywhere? Nope.

The TTU includes dedicated hardware to determine whether a ray intersects bounding volumes and dedicated hardware to determine whether a ray intersects primitives of the tree data structure. In some embodiments, the TTU may perform a depth-first traversal of a bounding volume hierarchy using a short stack traversal with intersection testing of supported leaf node primitives and mid-traversal return of alpha primitives and unsupported leaf node primitives

Notice the return type is always a geometry primitive not the encapsulating BVH node.
 
A 3080 is 36% faster than 6800XT @4K Ultra RT.


It seems they tackled the alpha transparency problem by increasing geometry detail and relying less on alpha. This should work in Ampere’s favor given the doubled triangle intersection throughput.

Another unique challenge our engineers had to solve while implementing raytracing techniques into The Riftbreaker was also connected to the vegetation system. Textures used by foliage are mostly alpha-tested and it is not uncommon for a texture in our game to have large areas that are fully transparent. We make extensive use of such textures, so it was necessary for us to find a solution to this and add support for alpha-testing in our AnyHit shader.

What we learned in this process is that alpha-tested materials are much more computationally expensive for raytraced shadows than for traditional shadow mapping and they are best avoided if possible. We optimize our content by limiting the surface area of all transparent objects to reduce the number of ray hits on transparent surfaces. The Riftbreaker’s camera view is isometric, so the number of polygons visible at once is naturally limited and we can easily increase the polycount of most objects without affecting GPU performance. In some cases, it is actually more efficient for us to increase the polycount of an object if we can transform its material from alpha-tested to a fully opaque one.

https://www.gamasutra.com/blogs/Pio...shadows_implementation_in_The_Riftbreaker.php
 
We seem to have understood the author’s speculation differently. RDNA 2 pulls ahead in simple scenes with procedural geometry. Meaning that intersection testing of leaf node primitives occurs on the shader core and not the RT hardware on both architectures. Maybe RDNA 2 is simply better at this due to leaf data being more readily available to the WGP
I don't understand what you mean when you say we view the speculation differently?
 
I don't understand what you mean when you say we view the speculation differently?

I think you’re saying that Ampere pulls ahead with triangle based geometry because of its advantage there and not primarily due to being faster at traversal. There’s probably truth to that but the other way to interpret the test results is that RDNA 2 is faster at intersecting procedural geometry which negates Ampere’s traversal advantage. We would need more controlled experiments to be sure.
 
So, given the state of DXR, what can nVidia (cause we are on a nVidia topic but I'm no fanboy) do to accelerate RT even more outside of bruteforce it with more rt core/increased frequency ?
 
Status
Not open for further replies.
Back
Top