AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

I think the 6800 appeal will be for overclockers... if the card uses the same PCB (and cooling) of the 6800XT then it would be quite easy to bring it up to 2,25 Ghz and maybe more with limited thermal issues, with a good scaling due to having the same memory subsystem.
wasn´t it 2 slots cooling for 6800 and 2.5 for 6800XT/6900XT?
 
Something isn't right there. Why would the 2080Ti be only half as fast as the 3080? It's RT performance is usually more or less in line with its rasterization performance vs the 3080, even in the fully path traced demos as far as I've seen.
From the looks of it, this RT test scene has only about 20 BVH nodes, each using custom intersection code for some SDF primitives but no triangles at all.
So this SDK sample does not tell us anything about RT perf. in practice.
 
Food for thought ...

In addition, the general question to be raised is whether AMD's " Smart Access Memory " is not a step in the wrong direction. The exact mode of operation of the feature is still a bit unclear, but in any case the interaction of the Zen 3 processor, 500 motherboard and RDNA2 graphics card should be used - in other words, it is a feature that only 100% AMD technology works. If something like that came from Intel or nVidia, the devil would certainly break loose - at AMD this is now drowning in all the euphoria, but still provides evidence that the market design of a supplier of PC processors and PC graphics cards together out of competitive and fairness view is just suboptimal.

In particular, AMD now enables Intel & nVidia to initiate comparable developments without hesitation - which can be justified with the enjoyable reference to this AMD feature. It is not very edifying that AMD, as soon as you sniff the real success for a second, immediately copies the (bad) methods of the previous top dogs.
https://www.3dcenter.org/news/hardware-und-nachrichten-links-des-28-oktober-2020
 
Last edited by a moderator:
Based on all that they have spoken of the L3 cache, and all the leaks, I think that the Infinity Cache consists of 16 slices of 8MB with a 512b interface on each of them. The easiest, most sensible method of varying it's size is to increase/decrease the amount of those slices. As the driver will definitely need to know the amount of cache on the card, I think it's a good guess that they would include this as a new property in the drivers. Looking at the mac os driver properties, there are 3 new ones. unknown2 seems to be the total CU count, and unknown0 doesn't fit, but unknown1 is 16 on Navi 21, which is exactly right.

Based on entirely this, I predict that the 40-cu Navi 22 will have 96MB of infinity cache, that the 32-cu Navi 23 will have 64MB, and that the upcoming APUs will both have 32MB.

The APUs sound about right, assuming they are targeting 1080p gaming (framebuffers scale linearly with resolution, and 1080p is 1/4th of 4k). The middle chips are a bit weird. They definitely can't do 4k (not just because lack of cache, but also compute power), but if you are targeting 2560x1440, you don't really even need 64MB. Maybe the amount of slices is high for the bandwidth, not for the cache amount?
 
Because amd has always contributed to public standards instead of using the market to force proprietary technologies, and smart access memory just increase a little the fps and nothing more.
 
Yea it's not anything proprietary, just a slight performance increase similar to smartshift in laptops. And AMD is obviously trying to incentivize people to buy their own CPU+GPU combinations, nothing wrong with it. The competition are free to do something of their own.
 
Based on all that they have spoken of the L3 cache, and all the leaks, I think that the Infinity Cache consists of 16 slices of 8MB with a 512b interface on each of them. The easiest, most sensible method of varying it's size is to increase/decrease the amount of those slices. As the driver will definitely need to know the amount of cache on the card, I think it's a good guess that they would include this as a new property in the drivers. Looking at the mac os driver properties, there are 3 new ones. unknown2 seems to be the total CU count, and unknown0 doesn't fit, but unknown1 is 16 on Navi 21, which is exactly right.

Unknown0 seem to be the same as DCUs per SE.

Based on entirely this, I predict that the 40-cu Navi 22 will have 96MB of infinity cache, that the 32-cu Navi 23 will have 64MB, and that the upcoming APUs will both have 32MB.

The APUs sound about right, assuming they are targeting 1080p gaming (framebuffers scale linearly with resolution, and 1080p is 1/4th of 4k). The middle chips are a bit weird. They definitely can't do 4k (not just because lack of cache, but also compute power),

They have enough compute power for 4k if you don't play the newest games and do not want to have more than 60 FPS. Most 4k displays are still on 60 Hz. I played some (non-FPS) games at 4k even at low-end RX460.

But they are optimized for 1440p (to play also new games at better 60 FPS), not for 4k. So for these it's acceptable/normal that the cache capacity will be a bottleneck at 4k.

but if you are targeting 2560x1440, you don't really even need 64MB. Maybe the amount of slices is high for the bandwidth, not for the cache amount?

The pixel count of 1440p is 44% of pixel count of 4k, so if 128M is sweet spot for pixel-related data in 4k, then 64 MiB should be the sweet spot for 1440p.

But, there is one major thing consuming memory which is does not scale by the resolution: The geometry data and the BVH tree required for ray tracing, and they easily consume couple of tens of megabytes.

We really want to fit most of the BVH tree into the cache because traversing the tree is very latency-critical. So that's why the "optimal" cache size will not scale linearly with the pixel count, instead we want to have the cache capacity like (typical geometry + BVH size) + (C * target_pixel_count )
 
Based on all that they have spoken of the L3 cache, and all the leaks, I think that the Infinity Cache consists of 16 slices of 8MB with a 512b interface on each of them.

From AMD's bandwidth numbers, we arrived at the conclusion that the GPU LLC is 4096bit wide.

4096bit * 2.25GHz = 9216Gbps = 1152GB/s. 1152GB LLC + 512GB GDDR6 = 1664GB/s is how much bandwidth AMD is claiming on their slides.
Your suggestion of 16 slices at 512bit each would result in 8192bit wide total. I guess it could be that too, but the LLC would need to clock at half the value of the core clocks.

Or perhaps we're looking at 8 slices, one for every 10 CUs / 5 WGPs in Navi 21.
However it doesn't look like the cache amount is related to the CUs, considering even the 6800 with 25% of its CUs deactivated is getting access to the full 128MB of LLC.


There is nothing 'proprietary' about SAM. What are they even on about? AMD basically just made it work on Windows. Intel and Nvidia can do the same.

https://docs.microsoft.com/en-us/windows-hardware/drivers/display/resizable-bar-support
I wonder if SAM is only effective if the GPU is connected through a PCIe 4.0 x16 bus, in which case an Intel implementation would need a Tiger Lake, Rocket Lake or later CPU model.
 
From AMD's bandwidth numbers, we arrived at the conclusion that the GPU LLC is 4096bit wide.

4096bit * 2.25GHz = 9216Gbps = 1152GB/s. 1152GB LLC + 512GB GDDR6 = 1664GB/s is how much bandwidth AMD is claiming on their slides.
Your suggestion of 16 slices at 512bit each would result in 8192bit wide total. I guess it could be that too, but the LLC would need to clock at half the value of the core clocks.
The Zen2 L3 saturates at 667GB/s on my 3700X at around 4.3GHz, based on that it's around 256b per slice at 4MB per slice. 16 * 8MB seems reasonable, as that would match the slice size of Zen3 and give you a 32B cacheline.
 
Based on all that they have spoken of the L3 cache, and all the leaks, I think that the Infinity Cache consists of 16 slices of 8MB with a 512b interface on each of them. The easiest, most sensible method of varying it's size is to increase/decrease the amount of those slices. As the driver will definitely need to know the amount of cache on the card, I think it's a good guess that they would include this as a new property in the drivers. Looking at the mac os driver properties, there are 3 new ones. unknown2 seems to be the total CU count, and unknown0 doesn't fit, but unknown1 is 16 on Navi 21, which is exactly right.

Based on entirely this, I predict that the 40-cu Navi 22 will have 96MB of infinity cache, that the 32-cu Navi 23 will have 64MB, and that the upcoming APUs will both have 32MB.

The APUs sound about right, assuming they are targeting 1080p gaming (framebuffers scale linearly with resolution, and 1080p is 1/4th of 4k). The middle chips are a bit weird. They definitely can't do 4k (not just because lack of cache, but also compute power), but if you are targeting 2560x1440, you don't really even need 64MB. Maybe the amount of slices is high for the bandwidth, not for the cache amount?


The L3 cache is described in one of the slides footnotes, as was mentioned here before:
"" Measurement calculated by AMD engineering, on a Radeon RX 6000 series card with 128 MB AMD Infinity Cache and 256-bit GDDR6. Measuring 4k gaming average AMD Infinity Cache hit rates of 58% across top gaming titles, multiplied by theoretical peak bandwidth from the 16 64B AMD Infinity Fabric channels connecting the Cache to the Graphics Engine at boost frequency of up to 1.94 GHz. RX-535"

So bus to the cache is 16 x 512 bit, at 1.94 GHz, For 1.94 TB/s.
This L3 cache seems work completely transparent, caching reused pages with 58% cache hit rate. That must be data reused within one frame. They could be pinning the frame buffer to L3 cache, but that would even not be needed or contra productive I guess.
 
The L3 cache is described in one of the slides footnotes, as was mentioned here before:
"" Measurement calculated by AMD engineering, on a Radeon RX 6000 series card with 128 MB AMD Infinity Cache and 256-bit GDDR6. Measuring 4k gaming average AMD Infinity Cache hit rates of 58% across top gaming titles, multiplied by theoretical peak bandwidth from the 16 64B AMD Infinity Fabric channels connecting the Cache to the Graphics Engine at boost frequency of up to 1.94 GHz. RX-535"

Thank you. I was certain I read that yesterday before going to sleep, but had no idea where it was and was looking for it to source my previous statement.


So bus to the cache is 16 x 512 bit, at 1.94 GHz, For 1.94 TB/s.
This L3 cache seems work completely transparent, caching reused pages with 58% cache hit rate. That must be data reused within one frame. They could be pinning the frame buffer to L3 cache, but that would even not be needed or contra productive I guess.

That would somehow at the same time be extremely surprising and not surprising at all. I think there has to be ways to get better utilization than just a simple LRU policy like that, but at the same time those would easily get very complex.

I think at 4k they would almost certainly want to pin the framebuffer, because otherwise that and the texture data would just flush everything every frame and only leave them gaining advantage in locality in address, without any benefit from locality in time.
 
The Zen2 L3 saturates at 667GB/s on my 3700X at around 4.3GHz, based on that it's around 256b per slice at 4MB per slice. 16 * 8MB seems reasonable, as that would match the slice size of Zen3 and give you a 32B cacheline.

Zen cache lines are 64B, like every cache line in any remotely recent x86 cpu. (They have to keep it, as too many programmers have depended on that size so that it is now essentially part of the living x86 spec.)

The L3 in Zen serves a single line over two cycles.
 
3xQgfOY.png
 
The Zen2 L3 saturates at 667GB/s on my 3700X at around 4.3GHz, based on that it's around 256b per slice at 4MB per slice. 16 * 8MB seems reasonable, as that would match the slice size of Zen3 and give you a 32B cacheline.
RDNA has 128B cache lines, btw.

But, there is one major thing consuming memory which is does not scale by the resolution: The geometry data and the BVH tree required for ray tracing, and they easily consume couple of tens of megabytes.

We really want to fit most of the BVH tree into the cache because traversing the tree is very latency-critical. So that's why the "optimal" cache size will not scale linearly with the pixel count, instead we want to have the cache capacity like (typical geometry + BVH size) + (C * target_pixel_count )
Infinity Cache sounds very likely a proper hardware cache that is transparent to IPs, not a software managed memory pool like HBCC.

So it shouldn't need to fit a whole render target/buffer to be effective. Say, pixel shader dispatch is already locality aware with the introduction of DSBR.

For BVH though, the farther away from the root, the lower the reuse rate it is by nature. So I feel like for large BVH that doesn't fit the cache, it is more likely that only several levels of the tree would be cacheable for speeding up first few iterations of traversals.
 
I think at 4k they would almost certainly want to pin the framebuffer, because otherwise that and the texture data would just flush everything every frame and only leave them gaining advantage in locality in address, without any benefit from locality in time.

I'm not convinced, reading a 4k frame buffer at 100 FPS, requires only 3.2 GB/s, and could be even less with compression.
GPUs render the frame buffer in some tiled way, frequently reusing tiles preventing them being swapped out the cache by other data.
 
Back
Top