AMD: RDNA 3 Speculation, Rumours and Discussion

Jawed · Oct 26, 2022

Throughout the history of GPUs, texture processing has never been dependent upon massive caches, simply because latency hiding comes to the rescue. Additionally, very short-term re-use, e.g. during trilinear filtering, depends upon a tiny amount of cache capacity.

It's occurred to me that a simple refinement in RDNA 3 would be to exclude texture data from Infinity Cache - as it's data that essentially pollutes and doesn't directly improve performance while occupying lines there. All of the performance can be achieved with other levels of cache, closer to the TMUs.

So, RDNA 3's Infinity Cache would no longer be a "dumb" cache for all memory transactions against VRAM.

Of course, we don't really know whether RDNA 2's Infinity Cache is that dumb. Maybe there are synthetic tests out there that could identify the behaviour, but I'm not aware of such right now.

Frenetic Pony · Oct 26, 2022

Jawed said:
Throughout the history of GPUs, texture processing has never been dependent upon massive caches, simply because latency hiding comes to the rescue. Additionally, very short-term re-use, e.g. during trilinear filtering, depends upon a tiny amount of cache capacity.

It's occurred to me that a simple refinement in RDNA 3 would be to exclude texture data from Infinity Cache - as it's data that essentially pollutes and doesn't directly improve performance while occupying lines there. All of the performance can be achieved with other levels of cache, closer to the TMUs.

So, RDNA 3's Infinity Cache would no longer be a "dumb" cache for all memory transactions against VRAM.

Of course, we don't really know whether RDNA 2's Infinity Cache is that dumb. Maybe there are synthetic tests out there that could identify the behaviour, but I'm not aware of such right now.

In general improving infinity cache behavior towards selectivity would improve it.

I suppose you could do that at shader build time? All triple a games are slowly switching to Vulkan/DX12 for various reasons, go through and have the drivers use a heuristic to judge how many times X is accessed throughout a frame, prioritizing LLC caching of anything accessed many times over. Hypothetically I guess you could even do that for RDNA2 support through a driver update.

Florin · Oct 27, 2022

digitalwanderer said:
I think it'll overshadow the whole 4000 series and will loom large over nVidia for this generation, but it'll be interesting to see either way.

I recall similarly pessimistic views expressed at the time of the 3080 instability/crashes to desktop reports shortly after release - anyone remember those?

vjPiedPiper · Oct 27, 2022

Jawed said:
Throughout the history of GPUs, texture processing has never been dependent upon massive caches, simply because latency hiding comes to the rescue. Additionally, very short-term re-use, e.g. during trilinear filtering, depends upon a tiny amount of cache capacity.

It's occurred to me that a simple refinement in RDNA 3 would be to exclude texture data from Infinity Cache - as it's data that essentially pollutes and doesn't directly improve performance while occupying lines there. All of the performance can be achieved with other levels of cache, closer to the TMUs.

So, RDNA 3's Infinity Cache would no longer be a "dumb" cache for all memory transactions against VRAM.

Of course, we don't really know whether RDNA 2's Infinity Cache is that dumb. Maybe there are synthetic tests out there that could identify the behaviour, but I'm not aware of such right now.

This is a really great idea.
So what would be kept in the Infinity cache then?
- Shader instructions
- Vertex / mesh's

I'm not so familiar with GPU data structures and usages, more so with CPU stuff,
but it makes a lot of sense to avoid polluting the cache with the stuff you know is not latency sensitive.
I wonder if there is any info about what type of Cache the infinity cache actually is? Victim cache? inclusive, exclusive?
( a quick google shows no info? )

I guess you would want to cache all vertex and Mesh data, all BVH style data, all instruction data, maybe some thread control and state.
Everything BUT textures?
maybe reserve a certain % for non texture data, and then use the rest as a normal eviction cache for you texture Data,
it should still help on the most used textures, and keeping all the non-texture data on die ( or close to the die ) woudl be a big win.

Granath · Oct 27, 2022

Jawed said:
Throughout the history of GPUs, texture processing has never been dependent upon massive caches, simply because latency hiding comes to the rescue. Additionally, very short-term re-use, e.g. during trilinear filtering, depends upon a tiny amount of cache capacity.

It's occurred to me that a simple refinement in RDNA 3 would be to exclude texture data from Infinity Cache - as it's data that essentially pollutes and doesn't directly improve performance while occupying lines there. All of the performance can be achieved with other levels of cache, closer to the TMUs.

So, RDNA 3's Infinity Cache would no longer be a "dumb" cache for all memory transactions against VRAM.

Of course, we don't really know whether RDNA 2's Infinity Cache is that dumb. Maybe there are synthetic tests out there that could identify the behaviour, but I'm not aware of such right now.

I remember that AMD explicitly said that the cache would be smarter, they won't cache everything this time.

Rootax · Oct 27, 2022

I believed simple things like AF used bandwith too, hence why some console ports didn't have it enabled ?

Jawed · Oct 27, 2022

vjPiedPiper said:
This is a really great idea.
So what would be kept in the Infinity cache then?

If a game is 100GB installed on disk, 80% of that is probably texture data. Even if a game is only using 1GB of texture data for a scene, at 60fps that's 60GB of data - or put another way, within one frame that texture data would fill 6900XT's Infinity Cache 8 times. Now you can argue that mipmapping and occlusion both reduce the quantity of texture data that is read, but at the same time it's clear that only with transistor budgets of 10s of billions have GPUs been able to afford on-die caches at the multi-MB scale. Of course it also coincides with VRAM bandwidth hitting a plateau that looks kinda terminal...

There are some uses of texture data that arguably shouldn't be considered polluting - these typically involve post-processing, where a render target is read back by the GPU for "whole screen" effects. Algorithms like TAA use even more data, too (multiple frames, and motions vectors).

Then you get into more interesting algorithms where memory is bound for read and write, but ordering constraints and collisions make those kinds of use either depend upon global atomics or work with tiling - the former would suit large-scale caching I suppose while the latter wouldn't.

Render targets are generally considered the main bandwidth bottleneck in GPUs, which is why some kind of tiled rendering automation (in hardware), if not implemented by the developer, is considered beneficial. With tiled rendering a lot of possible cases of unwanted overdraw are eliminated. Visibility buffer techniques, which are very much a developer implemented algorithm, demonstrate substantial speed-ups and some of that genuinely does come from the reduction in unwanted overdraw which wastes bandwidth.

Geometric data is also pretty substantial in games and should also be considered polluting, I suspect - locality due to on-chip caches and/or pipelining buffers solve the large scale performance issues such that a large cache should have no real effect. Developer implemented techniques such as mesh shading require pretty advanced algorithms to beat GPUs at their intrinsic capability, showing just how effective fixed function hardware backed by fairly small on-die memories really is.

So, render targets are the traditional big consumer of bandwidth. Advanced techniques, such as Nanite, radically change the relationship between scene complexity and render target bandwidth, arguably making render target caching less and less interesting as time goes by.

In the ray tracing era, the construction and use of bounding volume hierarchies is a critical bottleneck, seemingly dominated by bandwidth and latency. Temporal caching techniques implemented by developers (such as irradiance grids) also add a layer of bandwidth usage that can't be considered polluting.

pTmdfx · Oct 27, 2022

Jawed said:
So, RDNA 3's Infinity Cache would no longer be a "dumb" cache for all memory transactions against VRAM.

RDNA 2 (gfx10) supports No Allocate policy at page level (through a PTE bit), i.e., mostly fixed at memory/buffer allocation time. Also unclear if this is exposed for any form of developer control.

RDNA 3 (gfx11) additionally supports ISA-level control of Infinity Cache policy on individual reads and writes. So it is more flexible, and definitely accessible by developers.

DieH@rd · Oct 27, 2022

Jawed said:
If a game is 100GB installed on disk, 80% of that is probably texture data.

The moment devs switch from one BD disc to two, they will splurge on their wildest dreams. FF7Remake [2 BD game] has ~20 min video file for end credits.

chris1515 · Oct 27, 2022

Jawed said:
If a game is 100GB installed on disk, 80% of that is probably texture data. Even if a game is only using 1GB of texture data for a scene, at 60fps that's 60GB of data - or put another way, within one frame that texture data would fill 6900XT's Infinity Cache 8 times. Now you can argue that mipmapping and occlusion both reduce the quantity of texture data that is read, but at the same time it's clear that only with transistor budgets of 10s of billions have GPUs been able to afford on-die caches at the multi-MB scale. Of course it also coincides with VRAM bandwidth hitting a plateau that looks kinda terminal...

There are some uses of texture data that arguably shouldn't be considered polluting - these typically involve post-processing, where a render target is read back by the GPU for "whole screen" effects. Algorithms like TAA use even more data, too (multiple frames, and motions vectors).

Then you get into more interesting algorithms where memory is bound for read and write, but ordering constraints and collisions make those kinds of use either depend upon global atomics or work with tiling - the former would suit large-scale caching I suppose while the latter wouldn't.

Render targets are generally considered the main bandwidth bottleneck in GPUs, which is why some kind of tiled rendering automation (in hardware), if not implemented by the developer, is considered beneficial. With tiled rendering a lot of possible cases of unwanted overdraw are eliminated. Visibility buffer techniques, which are very much a developer implemented algorithm, demonstrate substantial speed-ups and some of that genuinely does come from the reduction in unwanted overdraw which wastes bandwidth.

Geometric data is also pretty substantial in games and should also be considered polluting, I suspect - locality due to on-chip caches and/or pipelining buffers solve the large scale performance issues such that a large cache should have no real effect. Developer implemented techniques such as mesh shading require pretty advanced algorithms to beat GPUs at their intrinsic capability, showing just how effective fixed function hardware backed by fairly small on-die memories really is.

So, render targets are the traditional big consumer of bandwidth. Advanced techniques, such as Nanite, radically change the relationship between scene complexity and render target bandwidth, arguably making render target caching less and less interesting as time goes by.

In the ray tracing era, the construction and use of bounding volume hierarchies is a critical bottleneck, seemingly dominated by bandwidth and latency. Temporal caching techniques implemented by developers (such as irradiance grids) also add a layer of bandwidth usage that can't be considered polluting.

This is the disk usage of Marvel Spiderman. Texture. It is 17 GB of textures. I count diffuse lightning and indoor lighting(probably lightmap). People are underestimating animation size for example.

Malo · Oct 27, 2022

I didn't realize animations can take up so much space!

Xmas · Oct 27, 2022

Jawed said:
If a game is 100GB installed on disk, 80% of that is probably texture data. Even if a game is only using 1GB of texture data for a scene, at 60fps that's 60GB of data - or put another way, within one frame that texture data would fill 6900XT's Infinity Cache 8 times. Now you can argue that mipmapping and occlusion both reduce the quantity of texture data that is read, but at the same time it's clear that only with transistor budgets of 10s of billions have GPUs been able to afford on-die caches at the multi-MB scale. Of course it also coincides with VRAM bandwidth hitting a plateau that looks kinda terminal...

At 4k resolution 1 GiB is ~129B per pixel. Considering most textures will be compressed that strikes me as a lot of texture data. Even with relatively high average anisotropy and overdraw factored in, that's a fair few layers to store various material properties. But only a fraction of texture data needs to be read more than once within the same frame. I would hazard a guess that in most games, inter-frame reuse is much higher than intra-frame reuse, so high-level-caching of textures is indeed not that useful until you can cache most of them so you benefit from the inter-frame reuse.

Jawed · Oct 27, 2022

pTmdfx said:
RDNA 2 (gfx10) supports No Allocate policy at page level (through a PTE bit), i.e., mostly fixed at memory/buffer allocation time. Also unclear if this is exposed for any form of developer control.

RDNA 3 (gfx11) additionally supports ISA-level control of Infinity Cache policy on individual reads and writes. So it is more flexible, and definitely accessible by developers.

Are there any synthetics that demonstrate that texturing (with/without anisotropic filtering) in RDNA 2 is bypassing Infinity Cache?...

pTmdfx · Oct 28, 2022

Jawed said:
Are there any synthetics that demonstrate that texturing (with/without anisotropic filtering) in RDNA 2 is bypassing Infinity Cache?...

I missed the part where RDNA 2 texture descriptors also have a LLC NoAlloc bit, that can override the page NoAlloc bit. So that is one avenue to look for.

Seanspeed · Oct 28, 2022

Gipsel said:
For cost efficiency one wants to reuse the MCD die across all models. An increase of the MCD integrated cache that appears to be justifiable if one looks solely at N31, may not be looking very appealing for smaller models. But on the other hand, the price AMD can achieve for the fastest version of N31 might justify the additional cost of stacking some additional cache dies. At the high end, the price/performance ratio is far from linear. Even relatively small performance increases at the top justify relatively larger cost increases for the manufacturer. That is not the case at the low end. Furthermore, stacking cache increases the cache size by a multiple, not just a small increment, so it ends up with a way bigger return (in terms of performance) than just a moderately larger cache on all MCDs.
In other words: Increasing the cache integrated in the MCD increases the cost across the complete range of products, which may be as cost inefficient as increasing the L3 in all AMD CPUs (instead of just using a cache stack on the 5800X3D). Using stacking only in an enthusiast model AMD incurs the associated costs only for this model. So the situation with RDNA3 may in fact not that different.

Only Navi 31 and 32 are using these MCD's as far as we know. And I'd expect both to consist of all the higher end($500+) models.

pTmdfx said:
To make the grand assertion of yours at this point of time, you are basically saying ceteris paribus i.e. nothing else outside the Infinity Cache has changed. Like no cache policy changes, no access pattern optimisation, no change to L0 or GL1, no improvements to RBE/DCC, etc. There is simply no information.

I dont really need to know the architectural specifics for what I'm saying. And to be clear, I'm not making any single conclusion, either. I'm saying that AMD either a) gimped their top part out of cheapness, or b) the bandwidth increases from Vcache wont be that significant in terms of performance improvements. It seems to me one of these has to be true.

DavidGraham · Oct 28, 2022

fehu said:
So 2X the RT performance and still about 2X less than the competition?

https://twitter.com/x/status/1585984908412928001

Subtlesnake · Oct 28, 2022

https://twitter.com/x/status/1582630645674934272

https://twitter.com/x/status/1582631013242785792

Cyan · Oct 28, 2022

Subtlesnake said:
https://twitter.com/x/status/1582630645674934272

https://twitter.com/x/status/1582631013242785792

that's the most important thing, after playing several games with RT on, imho traditional rasterization with fake lighting everywhere is a dead end, it will disappear, it's condemned

techuse · Oct 28, 2022

Non-RT performance could be ballpark 4090. RT performance ballpark 3090/3090ti.

Jawed · Oct 28, 2022

techuse said:
Non-RT performance could be ballpark 4090. RT performance ballpark 3090/3090ti.

The hardware would have to be broken for it to reach only 3090Ti for ray tracing...

AMD: RDNA 3 Speculation, Rumours and Discussion

Jawed

Frenetic Pony

Florin

Merrily dodgy

vjPiedPiper

Granath

Rootax

Jawed

pTmdfx

DieH@rd

chris1515

Malo

Yak Mechanicum

Xmas

Porous

Jawed

pTmdfx

Seanspeed

DavidGraham

Subtlesnake

Cyan

orange

techuse

Jawed

Similar threads