AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
That is the case with every new major API feature, it's effective usability trickles from high end GPUs down the stack. It was the case with Tessellation, HDR lighting, Shader Model 3 .. etc.

In fact, most Ultra graphics settings in today's PC games are curiosity switches to 1060/580 owners. That doesn't stop developers from implementing them in EVERY PC title. And the way I see it, RT is no exception to that rule, especially now that consoles/major APIs/almost all engines/AAA games are supporting it.
This is true.
My concern is with the ”trickling down the stack”, because what has allowed this in the past is lithographic advances. But as that goes ever slower, particularly for high power chips, that trickling may simply dry up if the new technology doesn’t also offer increased efficiency.
There is nothing wrong with having part of the lighting being optionally computed using raytracing for high end PC graphics! But then again, if confined to that niche its overall impact on real time graphics in the industry will be modest.
 
That is why Dr Su said they will release Ray Tracing when it doesn't impact performance
Total non sens. How is this possible ?
There could be lot of reasons why AMD decided to skip the initial 'tier 1_0' hardware implementation. They could 1) work on preliminary 'tier 2_0' specs which offer performance benefits, 2) research improved heterogenous integration options that allow faster on-die memory and multi-die interconnects; or 3) just wait for game developers to learn the API and optimize their software paths. Either way their first generation implementation could actually be faster than competition's current implementation.

One can interpret their comments about the need to 'get the ecosystem ready' as primarily relying on case 3) - that said, AMD wouldn't really admit cases 1) and 2), they are very tight lipped on their future plans recently, probably owing to multiple delays with Navi.
it's just a ruse to cover up the fact that they are two years behind NVIDIA in Ray Tracing, worse yet they will have it next year in consoles while their current 700$ and 500$ GPUs lack even the most basic DXR support,
They probably already taped out a hardware raytracing implementation, but it's designed for a $200 mid-level APU in a $500 game console - so it would be indeed 'most basic' comparing to $1000-plus GPUs which still struggle to provide acceptable performance levels.

I'd rather take their mid-2020 implementation designed for a high-end desktop GPU.
 
Last edited:
Yea it's definitely something that I don't think I would personally use, preferring to have double the frame rate or a higher resolution. But going forward yea it would be a nice to have option, not necessarily a game changer, yet.
 
I wonder how much size amd rt implentation will take. With Navi, which is a "small" cheap on 7nm, it seems they have hard time competing with nvidia big tu 104 / 106 which are still in 12nm, efficiency / power wise (based on what they said, but I concede we need to wait for the review to be sure of that). 7nm won't save them if the implentation is not efficient...
 
Either way their first generation implementation could actually be faster than competition's current implementation.
That assumes the competition would set on their laurels and do nothing to improve their current RT solution, which we know won't likely happen, as NVIDIA will push their RT angle to the extremes.
I'd rather take their mid-2020 implementation designed for a high-end desktop GPU.
And I would rather take a CPU from 2025 paired with a GPU from the same era, but we are talking about the here and now.
 
I wonder how much size amd rt implentation will take. With Navi, which is a "small" cheap on 7nm, it seems they have hard time competing with nvidia big tu 104 / 106 which are still in 12nm, efficiency / power wise (based on what they said, but I concede we need to wait for the review to be sure of that). 7nm won't save them if the implentation is not efficient...
There are a number of different efficiency metrics.
Lets never forget that in the desktop market the by far most significant is performance/$. AMD is pretty much competitive here and differences will be small between the manufacturers unless one player in the duopoly makes a major push for market share.
If you’re a manufacturer you have reason to care about performance/mm2 since you pay for wafer starts. This is modified by different cost structures for different nodes, but even before Navi AMD did quite well. They have a distinct advantage of course using a denser node, pitching their 251mm2 Navi against Nvidias TU104 (rtx2070S) at 545 mm2 and TU106 (rtx2070) at 445 mm2.
To try to avoid the effects of process and attempt to evaluate architectural efficiency, you would look at performance/gate instead, where the same chips have 10.3, 13.6 and 10.6 billion transistors respectively. AMD is definitely competitive.
And of course you have performance/W which is a tricky one because it changes so drastically depending on frequency and voltages in the relevant intervals. This is also where the lack of test data from the rx5700 makes comparisons difficult at the moment, but that will be rectified within days. It matters mostly at the extreme limit of cooling ability. For mid range products, the current manufacturer scheme unfortunately seems to be to push the chips as far as they will go on 200W or so of power which is possible to cool at reasonable cost.

Looking at the overall picture, and bearing in mind that independent test data is lacking, AMD and Nvidia seem to be within spitting distance of each other, apart from performance/mm2 which is mostly a manufacturing concern that isn’t critical in the midrange. It will be interesting to see what Nvidia will achieve once they move to finer lithography.
 
And I would rather take a CPU from 2025 paired with a GPU from the same era, but we are talking about the here and now.

Yeah and I'd like a CPU and GPU combo from the 2030. That's a bullshit comparison. Waiting one year doesn't equate to waiting five.
 
The hardware itself doesn't have that extreme of a granularity.

It looks like it does indirectly through occupancy considerations. My interpretation is that is constrained in consider splitting 256 VPGRs budgets (4x64).
256/1 = 256 (4x64, 1 wave)
256/2 = 128 (4x32, 2 wave)
256/3 = 84 (4x21, 3 wave)
256/4 = 64 (4x16, 4 wave)
256/5 = 48 (4x12, 5 wave)
etc.
As you see there is no divider between 1 and 2 which would allow 4x48 VGPRs as resultant. The code then decides to maximize register use within the occupancy bin.
 
There are a number of different efficiency metrics.
Lets never forget that in the desktop market the by far most significant is performance/$. AMD is pretty much competitive here and differences will be small between the manufacturers unless one player in the duopoly makes a major push for market share.
If you’re a manufacturer you have reason to care about performance/mm2 since you pay for wafer starts. This is modified by different cost structures for different nodes, but even before Navi AMD did quite well. They have a distinct advantage of course using a denser node, pitching their 251mm2 Navi against Nvidias TU104 (rtx2070S) at 545 mm2 and TU106 (rtx2070) at 445 mm2.
To try to avoid the effects of process and attempt to evaluate architectural efficiency, you would look at performance/gate instead, where the same chips have 10.3, 13.6 and 10.6 billion transistors respectively. AMD is definitely competitive.
And of course you have performance/W which is a tricky one because it changes so drastically depending on frequency and voltages in the relevant intervals. This is also where the lack of test data from the rx5700 makes comparisons difficult at the moment, but that will be rectified within days. It matters mostly at the extreme limit of cooling ability. For mid range products, the current manufacturer scheme unfortunately seems to be to push the chips as far as they will go on 200W or so of power which is possible to cool at reasonable cost.

Looking at the overall picture, and bearing in mind that independent test data is lacking, AMD and Nvidia seem to be within spitting distance of each other, apart from performance/mm2 which is mostly a manufacturing concern that isn’t critical in the midrange. It will be interesting to see what Nvidia will achieve once they move to finer lithography.


I really like your post, it's very informative.

But IMO, it's much simplier. Navi is 250mm2 chip on 7nm @ 200w + for the xt version. nVidia is doing bigger, faster, on a bigger node, in the same enveloppe (or less)...
 
I wonder how much size amd rt implentation will take. With Navi, which is a "small" cheap on 7nm, it seems they have hard time competing with nvidia big tu 104 / 106 which are still in 12nm, efficiency / power wise (based on what they said, but I concede we need to wait for the review to be sure of that). 7nm won't save them if the implentation is not efficient...

We can make some assumptions from the AMD RT patent, and from what we can guess about NV:



AMD uses TMUs to process one iteration of the RT loop, which means the shader program issues to intersect one level of BVH BBox / triangle intersection. A compute shader would look like this (simplified):

queue.push(BVH_root_node)
while (!queue.empty())
{
intersection_info = TMU.interesct(ray, queue) // may push new nodes to the queue, maybe implemented using LDS memory
if (intersection_info.hitType == triangle) closestHit = min(closestHit, intersection_info.triangleID)
}
store closest intersection for later processing...

This means the shader is busy while raytracing, but also there is felxibility in programming (could terminate if takes too long, maybe really interesting things...)



On NV it looks more likely just like this:

intersection_info = RT.Core.FindClosestIntersection(ray);

Which means the shader core likely becomes available to other pending taskes after this command (like hit point shading, or async compute, ...).
Also we have no indication NV RT cores would use TMU or share the cache to access BVH.



Conclusion is NV RT is likely faster but takes more chip area. AMD likely again offers more general compute performance which could compensate this.
But it could also happen AMD adds a FF unit to process the outer loop i have written above. Patent mentions this as optional. Still, fetching textures while raytracing would compromise perf more than on NV - maybe. (Patent mentions sharing TMU/ VGPR advantage is to avoid the need for specialized large buffers to hold BVH or ray payload data.)

It will become very interesting to compare performance, and to see what programming felxibility (if so) can add...


That assumes the competition would set on their laurels and do nothing to improve their current RT solution, which we know won't likely happen, as NVIDIA will push their RT angle to the extremes.
My bet (better said my hope) is, the next logical step would be to make BVH generation more flexible.
For example if they want to be compatible with mesh shaders, they just have to make this dynamic.
This would be awesome because it solves the LOD limitation. (I would not even care if BVH generation becomes FF too :) )

After that i would make sense to decrease ROPs and increase RT Cores. Up to the point where rasteirzation is implemented only with compute. (Texture filtering remains, ofc)

And only after that i would see a need for ray reordering. (Which i initally thought to be FF already now, and the assumed complexity was a main reason to question RTX.)
 
Last edited:
If you’re a manufacturer you have reason to care about performance/mm2 since you pay for wafer starts. This is modified by different cost structures for different nodes, but even before Navi AMD did quite well.

Here I'd say AMD did "well" in performance/mm^2 largely by sacrificing performance/watt a lot, and by investing more in PCBs with higher-end voltage regulation components to bring their graphics chips well beyond their ideal performance/watt curves. A fully enabled Polaris 11 has a 35W TDP in a Macbook Pro with ~85% the performance of the desktop version that needs ~70W. A Vega 64 can be set to consume 180W at over 90% of its performance, but then it'd perform consistently lower than a GTX 1080 and the marketing team couldn't have that happening.
But even then Vega 10 had around 55% more transistors than GP104, though that can somewhat be attributed to the fact that Vega 10 was designed for a multitude of loads (gaming, server, compute, etc.) and not just rasterization. And of course to the fact that AMD had been counting on Vega 10 to reach much higher clocks, probably closer to what Radeon VII hit on a new process.
 
Here I'd say AMD did "well" in performance/mm^2 largely by sacrificing performance/watt a lot, and by investing more in PCBs with higher-end voltage regulation components to bring their graphics chips well beyond their ideal performance/watt curves. A fully enabled Polaris 11 has a 35W TDP in a Macbook Pro with ~85% the performance of the desktop version that needs ~70W. A Vega 64 can be set to consume 180W at over 90% of its performance, but then it'd perform consistently lower than a GTX 1080 and the marketing team couldn't have that happening.
Well, that’s the reality of the PC market unfortunately. Reviewers have their part in this I feel since you often see ”winners” declared even when the differences are miniscule, often with words like ”dominates”/”crushes”/ and so on even when describing what in actual gameplay would be imperceptible.
And that translates into market value. When another 15% performance is enough to shift your product into a different pricing bracket and correspondingly better margins, it’s not surprising that the chips are pushed as far as the cooling allows.
I’m getting too old for the PC gaming market. It is geared towards the exciteable youth.
 
Variable rate shading is transparent to the user, and it "only" boosts performance, and according to the results we've seen from Ice Lake U presentations, it's mostly useful for GPUs that lack raw compute power for its segment (i.e. not most AMD solutions).
It does improve performance, even if not leaps and bounds, on high end RTXs too so it's benefits are not limited to weak hardware

There could be lot of reasons why AMD decided to skip the initial 'tier 1_0' hardware implementation. They could 1) work on preliminary 'tier 2_0' specs which offer performance benefits, 2) research improved heterogenous integration options that allow faster on-die memory and multi-die interconnects; or 3) just wait for game developers to learn the API and optimize their software paths. Either way their first generation implementation could actually be faster than competition's current implementation.

One can interpret their comments about the need to 'get the ecosystem ready' as primarily relying on case 3) - that said, AMD wouldn't really admit cases 1) and 2), they are very tight lipped on their future plans recently, probably owing to multiple delays with Navi.

They probably already taped out a hardware raytracing implementation, but it's designed for a $200 mid-level APU in a $500 game console - so it would be indeed 'most basic' comparing to $1000-plus GPUs which still struggle to provide acceptable performance levels.

I'd rather take their mid-2020 implementation designed for a high-end desktop GPU.
Considering the timeframes it's pretty safe to assume that consoles use next gen RDNA instead of first gen, and I think AMD already confirmed somewhere 2nd gen RDNA includes RT-hardware (probably the TMU-thing they patented)
 
It does improve performance, even if not leaps and bounds, on high end RTXs too so it's benefits are not limited to weak hardware

Considering the timeframes it's pretty safe to assume that consoles use next gen RDNA instead of first gen, and I think AMD already confirmed somewhere 2nd gen RDNA includes RT-hardware (probably the TMU-thing they patented)

We could be looking at 3 different RTRT implementations: Sony + Microsoft XBox + PC AMD.
Though preferrably it should be only one, for developers' sanity.
 
Status
Not open for further replies.
Back
Top