AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Qesa · Mar 23, 2020

bbot said:
The XBox Series X intersection engine (RDNA2) can do 380 billion intersections per second. So how many rays per second can it do?

4*52 CUs*1.825 GHz gives me 379.6 billion intersections/second. I suspect it's a theoretical figure assuming full occupancy much like TFLOPS. Nvidia's "10 gigarays" was actual throughput against some actual model -- ultimately the numbers aren't at all comparable without knowing both the actual achieved occupancy and the number of intersections needed to compute a ray

Lurkmass · Mar 23, 2020

Qesa said:
4*52 CUs*1.825 GHz gives me 379.6 billion intersections/second. I suspect it's a theoretical figure assuming full occupancy much like TFLOPS. Nvidia's "10 gigarays" was actual throughput against some actual model -- ultimately the numbers aren't at all comparable without knowing both the actual achieved occupancy and the number of intersections needed to compute a ray

With consoles, you can create a custom BVH to feed to the GPU. Depends on how 'deep' the BVH tree is so it could easily be 380 billion rays/second in the case of a full screen quad which contains a single BVH node or it could be 38 billion rays/second with a 10 node deep BVH.

As I explained previously in the above a BVH's layout structure is a 'tree' so some 'branches' may require upto 10 node traversals to reach the leaf node or on others it could be as little as 5 node traversals in total. How 'deep' a BVH can be is variable in practice.

On consoles it's possible to design your own specific BVH optimized for your content. You could have a 1 node deep BVH but you will not get very many rays when the geometry being tested against represents a very small area of the BVH. You could have a 20 node deep BVH so this will give a very 'tight' BVH bound with a high chance of a hit rate but this will end up needing many traversals thus you still won't get many rays this way either. Developers will have to find their ideal balance for their content in terms of trading off between how 'deep' their structure will be to ensure a high enough hit-rate or reduce the 'tightness' to do more intersection tests to achieve a higher ray count. Every TMU comes with it's own intersection engine so it's not a surprise to see the number 4 being a multiplier since every CU comes with exactly 4 TMUs.

With other vendors giving a 'hard' ray count figure on their hardware it could imply that they have a fixed BVH structure.

DavidGraham · Mar 23, 2020

Lurkmass said:
With other vendors giving a 'hard' ray count figure on their hardware it could imply that they have a fixed BVH structure.

What are the advantages and disadvantages of a fixed BVH structure?

DegustatoR · Mar 23, 2020

You can (and actually have to) design your own BVH for your content everywhere, not "on consoles".

Lurkmass · Mar 23, 2020

DavidGraham said:
What are the advantages and disadvantages of a fixed BVH structure?

By fixing the BVH depth in hardware, you can guarantee more predictable performance by paying a constant traversal cost per ray but this means that the BVH layout can't be customized for the developer's content so they're going to have to rely on the driver to generate the BVH for them.

By offering a customizable BVH to developers, performance is highly variable depending on which parts of the sub-tree and how often they are getting traversed. How 'shallow' or 'deep' a sub-tree runs will let the hardware scale from higher performance to lower performance.

DegustatoR said:
You can (and actually have to) design your own BVH for your content everywhere, not "on consoles".

Actually, that's incorrect according to a joint presentation by Microsoft and Nvidia on page 8 of the slides. The geometry format of the acceleration structure is described as 'opaque' with the "layout determined by the driver and hardware".

On console APIs the programmer can explicitly define their BVH layout for the GPU to use.

Dictator · Mar 23, 2020

Lurkmass said:
With consoles, you can create a custom BVH to feed to the GPU. Depends on how 'deep' the BVH tree is so it could easily be 380 billion rays/second in the case of a full screen quad which contains a single BVH node or it could be 38 billion rays/second with a 10 node deep BVH.

As I explained previously in the above a BVH's layout structure is a 'tree' so some 'branches' may require upto 10 node traversals to reach the leaf node or on others it could be as little as 5 node traversals in total. How 'deep' a BVH can be is variable in practice.

On consoles it's possible to design your own specific BVH optimized for your content. You could have a 1 node deep BVH but you will not get very many rays when the geometry being tested against represents a very small area of the BVH. You could have a 20 node deep BVH so this will give a very 'tight' BVH bound with a high chance of a hit rate but this will end up needing many traversals thus you still won't get many rays this way either. Developers will have to find their ideal balance for their content in terms of trading off between how 'deep' their structure will be to ensure a high enough hit-rate or reduce the 'tightness' to do more intersection tests to achieve a higher ray count. Every TMU comes with it's own intersection engine so it's not a surprise to see the number 4 being a multiplier since every CU comes with exactly 4 TMUs.

With other vendors giving a 'hard' ray count figure on their hardware it could imply that they have a fixed BVH structure.

One thing is that the cost of ray triangle intersection in HW is not going to be the same as the traversal cost. It will be slower, so the ray count there is not just a matter of counting traversals/intersections and levels and doing the multiplication. I asked MS - they told me the 380 billion number is aabb bounding traversal tests and that triangle intersections are more expensive (they did not say by how much).

Lurkmass · Mar 23, 2020

Dictator said:
One thing is that the cost of ray triangle intersection in HW is not going to be the same as the traversal cost. It will be slower, so the ray count there is not just a matter of counting traversals/intersections and levels and doing the multiplication. I asked MS - they told me the 380 billion number is aabb bounding traversal tests and that triangle intersections are more expensive (they did not say by how much).

Now that you did mention it I remember another detail in the patent:

The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes

This potentially means that testing ray-triangle intersections is 4x more expensive compared to the ray-box tests.

eloic · Mar 23, 2020

Stupid question, maybe, but when are the new GPUs supposed to be ready? What hardware is running the raytracing tech demo? I'm a bit worried regarding RTRT in comparison to Nvidia's solution, because it all seems a bit obscure. Why aren't they showing demos with current games which feature RTRT? It seems as if they're not confident enough in their own solution.

Bondrewd · Mar 23, 2020

eloyc said:
Stupid question, maybe, but when are the new GPUs supposed to be ready?

~Q3 or so.
Depends on how it goes now lmao, with both customers and supply chains being lowkey on fire.

eloyc said:
What hardware is running the raytracing tech demo?

Either console silicon or N21.

eloyc said:
Why aren't they showing demos with current games which feature RTRT?

Pretty sure AMD DXR drivers are WIP still.

trinibwoy · Mar 23, 2020

Lurkmass said:
This potentially means that testing ray-triangle intersections is 4x more expensive compared to the ray-box tests.

In either case any quoted rays per second metric is pretty meaningless as there is no standard measure for comparing theoretical performance across IHVs. We need a well defined RT equivalent of a texture or pixel throughput test.

I do like AMD’s intersections per second more than Nvidia’s gigarays in that it provides some insight into max hardware capabilities.

Maybe 3DMark will do us a favor and whip up a few theoretical tests. Deep bounding box traversal with a few triangles in each leaf node. And triangle intersection with some trivial number of bounding boxes.

OlegSH · Mar 23, 2020

Lurkmass said:
With consoles, you can create a custom BVH to feed to the GPU

What do you mean by saying a "custom"?

Pretty sure a bounding-box data format and precision can't be changed without loosing HW acceleration capability since HW works with fixed formats (not some arbitrary data)
You can't change bounding volumes shapes either, this will break HW compatibility too.

Lurkmass said:
You could have a 1 node deep BVH but you will not get very many rays when the geometry being tested against represents a very small area of the BVH

What makes you think there are no empty space optimizations in driver's BVH builder?

Lurkmass said:
BVH tree is so it could easily be 380 billion rays/second in the case of a full screen quad which contains a single BVH node or it could be 38 billion rays/second with a 10 node deep BVH

In reality it will take much more tests per ray because there will be more than 2 triangles in the last node.

Lurkmass said:
You could have a 20 node deep BVH so this will give a very 'tight' BVH bound with a high chance of a hit rate but this will end up needing many traversals thus you still won't get many rays this way either. Developers will have to find their ideal balance for their content in terms of trading off between how 'deep' their structure will be to ensure a high enough hit-rate or reduce the 'tightness' to do more intersection tests to achieve a higher ray count.

You are describing an offline BVH creation here, what if there are many dynamic objects, which can be moved or destructed like in BFV?

Lurkmass said:
giving a 'hard' ray count figure on their hardware it could imply that they have a fixed BVH structure

This couldn't imply anything, that's just a real number for the simplest case - primary rays for a high poly model

w0lfram · Mar 23, 2020

OlegSH said:
What do you mean by saying a "custom"?

Pretty sure a bounding-box data format and precision can't be changed without loosing HW acceleration capability since HW works with fixed formats (not some arbitrary data)
You can't change bounding volumes shapes either, this will break HW compatibility too.

What makes you think there are no empty space optimizations in driver's BVH builder?

In reality it will take much more tests per ray because there will be more than 2 triangles in the last node.

You are describing an offline BVH creation here, what if there are many dynamic objects, which can be moved or destructed like in BFV?

This couldn't imply anything, that's just a real number for the simplest case - primary rays for a high poly model

BFV doesn't have any ray tracing on things that move... it static ray tracing.

DavidGraham · Mar 23, 2020

w0lfram said:
BFV doesn't have any ray tracing on things that move... it static ray tracing.

My god man! Please help yourself with some information before posting! Just think for a moment, what would be the difference between static ray tracing and simple cube map reflections?

Reflective surfaces in Battlefield V reflect anything that moves. Down to the tiny fire flares of rocket tales.

Lurkmass · Mar 23, 2020

OlegSH said:
What do you mean by saying a "custom"?

Pretty sure a bounding-box data format and precision can't be changed without loosing HW acceleration capability since HW works with fixed formats (not some arbitrary data)
You can't change bounding volumes shapes either, this will break HW compatibility too.

You can change the tree structure itself and it can be programmer defined too because the hardware's shader unit tracks the BVH traversal state itself.

What makes you think there are no empty space optimizations in driver's BVH builder?

I never implied that the driver doesn't do empty space optimization.

By being able to customize the tree, the developers can appropriately optimize how deep parts of the BVH will be for their scene representation.

In reality it will take much more tests per ray because there will be more than 2 triangles in the last node.

Sure or the developers can create a tighter BVH with the AABBs if they need to save up on the ray-triangle tests.

You are describing an offline BVH creation here, what if there are many dynamic objects, which can be moved or destructed like in BFV?

I have yet to consider factoring in the cost of rebuilding the BVH regardless. There's also things like 'refitting' a BVH as well which 'degrades' the quality of the acceleration structure by reducing the hit-rate but it's cheaper than rebuilding the whole BVH.

This couldn't imply anything, that's just a real number for the simplest case - primary rays for a high poly model

Of course, I only raised this as a possibility since there's not a lot of low level details revealed about their number.

The number could also assume a perfect 100% hit-rate since it fits the area of the scene representation.

JoeJ · Mar 23, 2020

Lurkmass said:
With consoles, you can create a custom BVH to feed to the GPU.

Do you know what's the branching factor for AMD? Patent mentioned 4. If this would be flexible (which i doubt) we could trade lower tree depth vs. higher branching factor.

You mention the number of triangles in leaf nodes has no maximum. But how is it with the bounding box shape?
Personally i would be interested in dividing geometry into small patches (call it meshlets if you want).
For LOD i'd like to geomorph a number of such small patches to fit a lower res parent patch, and finally drop the detailed patches and replace then with the parent. Kind of hierarchical progressive mesh for the whole static scene.
(In contrast to the proposed stochastic solution for LOD, this would prevent a need to teleport rays to a lower detailed version of the scene which causes divergent memory access.)
To make this compatible with RT, it would be necessary to enlarge bounding volumes so they bound the whole morphing transition of mesh patches. Technically that's surely possible an any HW.
And secondly it would be necessary to pick a BVH node, declare it a leaf, set triangles to it and zero the child nodes pointer(s). If this works it would be compatible with BL-AS and TL-AS approach and cause no other extra work.
So, do you think consoles could eventually allow to do this?

Finally, MS mentioned RT works with mesh shaders in their DX12 Complete presentation.
But i have no idea how this could work.
How to know the bound in advance?
And does each (or a small number of) ray(s) endup processing a meshlet hundrets of times?
Sounds a terrible idea. More likely there is a way to store mesh shader results in memory for RT reuse, but then i still don't get how a full BVH rebuild could be avoided.

Lurkmass · Mar 24, 2020

JoeJ said:
Do you know what's the branching factor for AMD? Patent mentioned 4. If this would be flexible (which i doubt) we could trade lower tree depth vs. higher branching factor.

You mention the number of triangles in leaf nodes has no maximum. But how is it with the bounding box shape?
Personally i would be interested in dividing geometry into small patches (call it meshlets if you want).
For LOD i'd like to geomorph a number of such small patches to fit a lower res parent patch, and finally drop the detailed patches and replace then with the parent. Kind of hierarchical progressive mesh for the whole static scene.
(In contrast to the proposed stochastic solution for LOD, this would prevent a need to teleport rays to a lower detailed version of the scene which causes divergent memory access.)
To make this compatible with RT, it would be necessary to enlarge bounding volumes so they bound the whole morphing transition of mesh patches. Technically that's surely possible an any HW.
And secondly it would be necessary to pick a BVH node, declare it a leaf, set triangles to it and zero the child nodes pointer(s). If this works it would be compatible with BL-AS and TL-AS approach and cause no other extra work.
So, do you think consoles could eventually allow to do this?

Finally, MS mentioned RT works with mesh shaders in their DX12 Complete presentation.
But i have no idea how this could work.
How to know the bound in advance?
And does each (or a small number of) ray(s) endup processing a meshlet hundrets of times?
Sounds a terrible idea. More likely there is a way to store mesh shader results in memory for RT reuse, but then i still don't get how a full BVH rebuild could be avoided.

An example illustration of a BVH 100 is shown in FIG. 1 and a description of an example traversal is illustrated with reference to FIGS. 1-6. The illustration uses triangle nodes which contain a single triangle and box nodes which have four boxes per node. These configurations are illustrative. Different node types can be used, the triangle nodes can have multiple triangles within the triangle node and box nodes can have any number of boxes

There doesn't appear to be any limits to how the tree can be structured! It's implied that both the depth and the number of child nodes is up for the programmer to decide. On consoles you can customize the tree structure of the BVH in nearly any way.

nAo · Mar 24, 2020

Lurkmass said:
By fixing the BVH depth in hardware, you can guarantee more predictable performance by paying a constant traversal cost per ray but this means that the BVH layout can't be customized for the developer's content so they're going to have to rely on the driver to generate the BVH for them.

That makes no sense.

Fixing the acceleration structure depth either implies there is no upper bound on the number of children an internal node can have, making any efficient software or hardware implementation a lost battle, or that you always have to traverse the maximally deep tree in its entirety, which would also kill perf/defies the point of having an acceleration structure in the first place.

Lurkmass said:
By offering a customizable BVH to developers, performance is highly variable depending on which parts of the sub-tree and how often they are getting traversed. How 'shallow' or 'deep' a sub-tree runs will let the hardware scale from higher performance to lower performance.

I believe you might be confusing the acceleration structure branching factor with its depth.
If you are implying a software implementation could change the branching factor for different parts of the tree... that's definitely a possibility, but not a particularly useful one though. There are lots of reasons for this, ranging from making traversal unnecessarily complex (variable branching factor traversal stack?!) with little to show for it, to quickly coming to terms the only branching factors you want to use are the ones that fits nicely in one or more cache lines, if you don't want to often fetch data from memory you'll never use.

Also the idea that game developers are going to spend considerable amount of time developing their own acceleration structure is for the birds. There is a very vast literature on the subject and all the low hanging fruits have been picked a long time ago. There is no magical BVH (or any other acceleration structure) out there that will suddenly give you much better performance, unless one starts from zero ignoring the last 2 decades of publicly available research on the subject.

Scott_Arm · Mar 24, 2020

Wow a well respected poster returns.

JoeJ · Mar 24, 2020

Lurkmass said:
It's implied that both the depth and the number of child nodes is up for the programmer to decide.

Open ended branching factor seems indeed unpractical (as said above), but choosing 4 or 8 could make sense depending on HW implementation.

Sounds pretty good.

Lurkmass · Mar 24, 2020

JoeJ said:
Open ended branching factor seems indeed unpractical (as said above), but choosing 4 or 8 could make sense depending on HW implementation.

Sounds pretty good.

TBH just having those several modes for branching would be enough in most practical cases. I can't really see anything above 8 being all that useful since memory fetching overhead comes into play with larger nodes.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Qesa

Lurkmass

DavidGraham

DegustatoR

Lurkmass

Dictator

Lurkmass

eloic

Bondrewd

trinibwoy

Meh

OlegSH

w0lfram

DavidGraham

Lurkmass

JoeJ

Lurkmass

nAo

Nutella Nutellae

Scott_Arm

JoeJ

Lurkmass