AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

Hm... Navi21 has 50% more transistors than GA104 and offers only 25% (6800XT vs 3070) more performance. GA102 has 7% more transistor than Navi 21 on the other hand...

I'm a little shocked that Navi doesnt offer better performance. Will be interesting to see the slower versions and how they will compete against GA104 and GA106.

Yes Ampere is still the better gaming arch it seems. NV needs to come with 16 or higher ram counts though.
 
Hm... Navi21 has 50% more transistors than GA104 and offers only 25% (6800XT vs 3070) more performance. GA102 has 7% more transistor than Navi 21 on the other hand...

I'm a little shocked that Navi doesnt offer better performance. Will be interesting to see the slower versions and how they will compete against GA104 and GA106.

https://www.realworldtech.com/transistor-count-flawed-metric/

Note the comments around cache densities relative to logic.
 
So RDNA2 doesn't overclock, basically maintains a clock in a 50-100mhz range and it's power consumption what really varies. Which is very aligned with what PS5 does, and at even higher clocks.
This might come down to the improved circuit implementation and better characterization of the silicon. There was often more consistent behavior out of earlier GPUs once the overgenerous voltage levels were pruned, and it seems AMD has put physical optimization work that has been sorely lacking to an extent I hadn't considered.

Also, I've just realized that XSX is really the only RDNA2 based GPUs with 14 CUs per SA. The rest of them including the PS5 have 10. I wonder what are the pros and cons of the approach.
Wavefront launch seems like it could be handled at the shader engine level, which means the average lifetime of wavefronts need to be higher to avoid underutilization. Per shader array resources would have nearly 50% more contention, so things like the L1 and export bus could be more crowded.


123123atjau.png


Linus
It stands out how close the minimums and average are. Given how much broader the other implementation is between minimum and average, it feels like there could be some driver or software issue capping performance--or there is a very pervasive bottleneck.

4K is really poor. The Infinity Cache "virtual 2TB/s" bandwidth is even more wasted than 5700XT's 448GB/s. Wow.
Some elements of the architecture may not have seen the same improvement. It does seem like the L2 hasn't received much attention, and if we believe it played any role in amplifying bandwidth signficantly in prior GPUs, not increasing its bandwidth or concurrency means it's amplifying things significantly less.

I'm also curious about getting the full slide deck, including footnotes. The memory latency figures are something I'm curious about. The improvement numbers do seem to hint that there's substantial latency prior to the infinity cache, which deadens some of its benefits.
It's also possible that if it's functioning as a straightforward victim cache that it's thrashing more due to streaming data. The driver code mentioning controlling allocation would seem to point to more guidance being needed to separate working sets that can work versus those that thrash even 128MB caches.

From my developer conversations AMD is just behind - it is a hardware difference. Their RT implementation does not accelerate as much of the RT pipeline. It does not accelerate ray traversal, so an AMD GPU is spending normal compute on that in comparison to NV having a hw unit in the RT core doing that. It is also getting contention due to how it is mapped in hardware. I have been told AMD RT is decidedly slower at incoherent rays than NV RT implementation. So the more incoherent rays a game has, the more rays shot out, the more objects possibly hit... the greater the difference in relative performance in the architectures becomes. But I would like to test it to see where exactly the differences lie.
I'm curious if there's also an effect based on not just incoherence but also how quickly the rays in a wavefront resolve. Incoherence can bring in a lot of extra memory accesses, but being tied together in batches of 32 or so can also lead to a greater number of SIMD-RT block transactions if at least one ray winds up needing measurably more traversal steps than the rest of the wavefront.
It looks like AMD does credit the infinity cache for holding much of the BVH working set, and apparently the latency improvement is noted. That does point to there being a greater sensitivity to latency with all the pointer chasing, but that can go back to my question about the actual latency numbers. Even if it's better, it seems like it can be interpreted that the latency figures prior to the infinity cache are still substantial.

View attachment 4939

I think this explains it. AMD said 58% cache hit rate at 4K and it looks like ~75% at 1080p and ~67-68% at 1440p so it's a significant drop off at 4K.
It does seem to show that even at the large capacity offered that there's a lot of accesses that remain very intractable for cache behavior. If the old rule of thumb that indicated that miss rate drops in proportion to the square root of cache size, the cache would have been more than sufficient to get very high hit rates.
 
Maybe because in "their" upcoming titles they expect ray tracing would not be so bad? This may be an explanation.

And perhaps there is more to performance than just ray tracing? And they offer more VRAM as well as lower power consumption than the competition.
That Intel comparison is a good point. Over all the deserved praise, Ryzen 5000 is getting, people seem to forget that it's main opponent is something like the contingency plan for the continengy plan for the contingency plan should something happen to Intel's original plan. If Intel had executed as planned, Zen 3 would still be a great product, but it would not stomp over the competition like it does in unfolded reality. Nvidia otoh did not screw up that badly, even though their success with Samsung N8 and the accompanying power consumption is debatable at least.

Ryzen's success is somewhat making people forget where AMD's graphics performance was before RDNA, just over a year and a half back. To reach where they have in a relatively short period of time is still a commendable achievement. RDNA 2 is more like Zen 2 in that respect, bringing them back into contention. Will RDNA 3 be their Zen 3 for graphics? Apparently AMD have committed to another 50% perf/W improvement for RDNA 3.
I'm also curious about getting the full slide deck, including footnotes. The memory latency figures are something I'm curious about. The improvement numbers do seem to hint that there's substantial latency prior to the infinity cache, which deadens some of its benefits.
It's also possible that if it's functioning as a straightforward victim cache that it's thrashing more due to streaming data. The driver code mentioning controlling allocation would seem to point to more guidance being needed to separate working sets that can work versus those that thrash even 128MB caches.

Techpowerup did have the slides posted here, they cover some of the memory latency aspects - https://www.techpowerup.com/review/amd-radeon-rx-6800-xt/2.html
 
Well, between this and the Apple M1 review I'm convinced "Infinity Cache" was designed primarily for APUs in mobile devices; and was just used for RDNA2 as well because the designs were simple and cheap to scale. The M1's ultra efficient performance, especially for single core stuff, looks to be at least partially due to its extreme bandwidth saturation per core for both main memory and their system wide last level cache. With AMD wanting a piece of that high margin laptop market no doubt they found much the same. Which is good news for their laptop chips next year, bad news for RDNA2's 4k performance. Despite all the "benefits" it's still no substitute for an actual main memory bus, ohwell.

What I'm not worried about though is the poor hardware raytracing performance. Hardware raytracing is good for coherent rays and that's about it anyway. In fact it's primarily good for sharp reflections and that's it period. Even shadows are questionably useful at best since you have so much content restriction with it (severely limited movement, environment detail, etc.) EG the Demon Souls remake doesn't use a whiff of hardware raytracing, and looks better than either Watchdogs Legion or Miles Morales. It feels typically weird to watch Digital Foundry do their "the difference is obvious" thing as at least half the time I'm narrowly squinting at the glass in a game hoping to see what's so obvious about it.

That being said it is a victory for Nvidia PR wise. This is what AMD gets for letting their rival control the narrative of what's "important" in GPU tech.
 
Last edited:
Yep I did notice that of course and even 48MB has a reasonably high cache hit rate for 1080p so even that is possible.


N21 does not support HBM but overclocked AIB versions are coming on the 25th.

Buzzkill :mad:

So, 6 Shader Engine, 120CUs, HBM3 version when?

There were rumors that AMD were targeting GA104 with Navi21, but probably revised their targets once they saw Ampere underperformance.
256-bit bus is quite the cost-cutting measure , wonder how many at AMD are now wishing that they could have gone for the jugular to regain the crown, at least till nvidia sort out their Ampere woes.
 
It looks like AMD does credit the infinity cache for holding much of the BVH working set, and apparently the latency improvement is noted. That does point to there being a greater sensitivity to latency with all the pointer chasing, but that can go back to my question about the actual latency numbers. Even if it's better, it seems like it can be interpreted that the latency figures prior to the infinity cache are still substantial.

About this point, I'm thinking that the BVH structure retention / discarding by the cache may be a driver and application matter, more than being hardwired. This also explains a part of the need of having specific optimization for the AMD ray tracing implementation.
 
Too much reviews, too much pages. But the one thing i want to know seems still unclear.

TechPowerUp:
The RDNA2 graphics architecture uses Ray Accelerators, fixed-function hardware which calculate ray intersections with boxes and triangles (4-box intersections per clock, or one triangle intersection per clock). Intersection is the most math-intensive step, which warranted special hardware. Most other stages of the raytracing pipeline leverage the vast SIMD resources at the GPU's disposal, while NVIDIA's RT core offers full BVH traversal processing via special hardware.

That's what i have expected and hoped for. If there is no traversal HW, BVH might be arbitary, DXR implementation is backed with compute traversal and AMDs software BVH data structures.
If this is fast enough to b useful, it means full flexibility - custom BVH can be shared for multiple purposes, LOD is possible. All my initial complainds about RT might be resolved. (But AMD needs to expose intersection extensions.)

PCGH:
Ferner gibt man auf Nachfrage zu Protokoll, dass es sich beim Raytracing um reine Compute Shader Launches handelt. Shading, Texture Fetching und BVH Traversal (die mit Abstand teuerste Komponente) laufen in Navi 21 stets parallel.
umm... this sentences contradict themselves. Not sure what he tried to say (although german is my natural language), but it seems he means:
'AMD has confirmed traversal and shading can run in parallel' - which would imply traversal runs on FF unit like NV does.

I guess the truth is somewhat in between, but maybe somebody can clarify?
 
It stands out how close the minimums and average are.
Minimums are very often close in many titles across many architectures and can be caused simply by game logic - chuck generation/shaders changes/etc.
I would not make any conclusions based on the min FPS results.
Here you can see exactly what I am talking about, another test and they did not hit the same min fps hiccup
 
If this is fast enough to b useful, it means full flexibility - custom BVH can be shared for multiple purposes, LOD is possible. All my initial complainds about RT might be resolved. (But AMD needs to expose intersection extensions.)

PCGH:

umm... this sentences contradict themselves. Not sure what he tried to say (although german is my natural language), but it seems he means:
'AMD has confirmed traversal and shading can run in parallel' - which would imply traversal runs on FF unit like NV does.

I guess the truth is somewhat in between, but maybe somebody can clarify?
In this case based on the hotchips presentation and other information I have seen - it means you can run other compute tasks on the CUs while the CUs are also doing traversal.
 
Its a bit of a zen2 moment. Almost there but not quite. I'm looking forward to next years navi 3, but they could have beat the 3080 completely if they used gddr6x. Infinity cache mitigates, but doesn't match.
 
So big discrepancy between AMD's internal pre-launch benchmarks and actual performance results from 17 reviews. No surprises that review dates and launch date coincided.
The same result runs through all visited test reports, only the differences differ: Minimum +3.3%, but a maximum of +12.4% the nVidia card is ahead – on average by +7.4%. Compared to AMD's own benchmarks,which even using the SAM feature saw the Radeon RX 6800 XT minimally (by +1.7%) ahead of the GeForce RTX 3080, this is an astonishingly large discrepancy. The same SAM feature has been left out of these independent benchmarks , usually automatically by choosing the test platform (mainly Intel or Zen2 systems).
...
But the 8.6 percentage point difference in the performance image of the Radeon RX 6800 XT between AMD and the results of the independent test reports is somewhat harsh – and just above what is usually given to manufacturer-own benchmarks as a "margin of error". Above all, it clearly shifts the perception at least to the Radeon RX 6800 XT: no longer on par with the GeForce RTX 3080, but noticeably underneath.
https://www.3dcenter.org/news/radeo...ltate-zur-ultrahd4k-performance-im-ueberblick
 
I notice some quite impressive multi monitor idle power draw (with 7 mhz memory, which only gives a quarter of the required bandwith). https://www.techpowerup.com/review/amd-radeon-rx-6800-xt/31.html
I guess that's a strong hint that they are simply presenting the screen from infinity cache. Maybe basicly running the gpu with the infinity cache as main memory - I wonder how much they can extend this to other "2d" usage scenarios. Obviously they are not doing it for video playback right now, where video memory is running full! speed (see the same page above).

Also it seems that the (at least) 64mb requirement for 2*4k is too much for this mode: https://www.computerbase.de/2020-11..._leistungsaufnahme_desktop_youtube_und_spiele (again probably falling back to full speed instead of some 2d memory clock)
 
PCGH:

umm... this sentences contradict themselves. Not sure what he tried to say (although german is my natural language), but it seems he means:
'AMD has confirmed traversal and shading can run in parallel' - which would imply traversal runs on FF unit like NV does.

I guess the truth is somewhat in between, but maybe somebody can clarify?
I don't speak german, but from your post I'd guess they're trying to say that the traversal isn't reserving any CUs/stream processors. The CU chucks along shaders while Ray Accelerator does it's thing and queues traversal to be run on that CU next
 
That's what i have expected and hoped for. If there is no traversal HW, BVH might be arbitary, DXR implementation is backed with compute traversal and AMDs software BVH data structures.
If this is fast enough to b useful, it means full flexibility - custom BVH can be shared for multiple purposes, LOD is possible. All my initial complainds about RT might be resolved. (But AMD needs to expose intersection extensions.)
As I understand it, DXR specifically obfuscates the data format of the BVH. It seems the intention here is that each IHV can optimise the data format to match the way the hardware works.

So, for example, perhaps inline ray tracing (DXR 1.1) is preferred on AMD and the BVH data format is optimised for that.
 
Back
Top