AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Leoneazzurro5 · Oct 25, 2020

Because AFAIK 512bit and GDDR6 is quite complex - I heard many people saying that you need quite clean signalling (in terms of noise) already or a 384bit bus (with GDDR6), and with 512bit it starts to become prohibitive. I think even the GDDR6X PCB on A102 boards must have very strict requirements. Maybe for AMD the reasoning was that increased cost of die <= increased cost of PCB and RAM (especially high cost RAM like HBM) and this gives more possibility of easier mobile designs, too.

trinibwoy · Oct 25, 2020

Jawed said:
I've seen patent documents talk about the problem of dealing with the overheads of small cachelines (e.g. 64 bytes) while using monster caches. The solution is based upon regions (from/to addresses). Which would map nicely to things like render targets and UAVs.

Anyone taking bets on Infinity Cache being gigabytes of 3D stacked memory? It goes without saying that would be pretty amazeballs.

https://www.freepatentsonline.com/y2020/0183848.html - Cache for Storing Regions of Data

One example of the memory technology is three-dimensional integrated circuits (3D ICs) used to include two or more layers of active electronic components integrated both vertically and horizontally into a single circuit. The 3D packaging, known as System in Package (SiP) or Chip Stack multi-chip module (MCM), saves space by stacking separate chips in a single package. Components within these layers communicate using on-chip signaling, whether vertically or horizontally. This signaling provides reduced interconnect signal delay over known two-dimensional planar layout circuits.

The manufacturing trends in the above description lead to gigabytes of integrated memory within a single package. In some cases, the computing system uses the additional on-chip storage as a last-level cache before accessing off-chip memory. A reduced miss rate achieved by the additional memory helps hide the latency gap between a processor and its off-chip memory. However, cache access mechanisms for row-based memories are inefficient for this additional integrated memory. Large tag data arrays, such as a few hundred megabytes for a multi-gigabyte cache, are expensive to place on the microprocessor die and provide a high latency for lookups of the large tag arrays. The lookup and data retrieval consume too much time as the tags and data are read out in a sequential manner.

Increasing the size of a data cache line for the additional integrated memory, such as growing from a 64-byte line to a 4-kilobyte (KB) line, reduces both a number of cache lines in the integrated memory and the size of a corresponding tag. However, dirty bits and coherency information are still maintained on a granularity of the original cache line size (64-byte line). Therefore, the on-package DRAM provides a lot of extra data storage, but cache and DRAM access mechanisms are inefficient.

Scott_Arm · Oct 25, 2020

https://wccftech.com/asus-rog-strix...-big-navi-gpu-boosts-beyond-2-5-ghz-290w-tgp/

It's looking like the top 6800XT cards are going to have pretty high power consumption. If 290W TGP is processor plus VRAM, there's still probably another 20W for fans etc. Even with a 250/260W TGP, they're pushing close to 300W.

pTmdfx · Oct 25, 2020

trinibwoy said:
Anyone taking bets on Infinity Cache being gigabytes of 3D stacked memory? It goes without saying that would be pretty amazeballs.

https://www.freepatentsonline.com/y2020/0183848.html - Cache for Storing Regions of Data

The linage of DRAM cache researches and patents from AMD have been around for half a decade, including tricks to lower/hide DRAM access latency, coarse-grained caching and region-based coherence. Most literature I’ve read put focus on evaluating its potential as CPU LLC, and its latency & performance impact on cache coherence protocols.

and TBH if they are going to put stacked DRAM, I feel like doing it as multi VRAM pools with HBCC page migrations is more likely a starting point, especially if the main VRAM pool would not be drastically larger than the DRAM cache.

yuri · Oct 25, 2020

Scott_Arm said:
It's looking like the top 6800XT cards are going to have pretty high power consumption. If 290W TGP is processor plus VRAM, there's still probably another 20W for fans etc. Even with a 250/260W TGP, they're pushing close to 300W.

Is this really surprising at all? It's been 7 years since AMD went for 300+-25W highend (Fury X, Vega 64, Vega VII).

So all those reports like "top model 250W" were strage.

3dilettante · Oct 25, 2020

Digidi said:
Anybody understand this:

[PATCH] drm/amdgpu: correct the cu and rb info for sienna cichlid
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055069.html

The code has existing functions for registering which blocks like the RBEs are active or have been disabled by fuses or BIOS settings.
It seems new to this is the disabling of shader arrays. Before this, there seemed to have been facilities for disabling at the shader engine or individual RBE level, so Sienna Cichlid would appear to be adding an intermediate level of salvage.

I was able to search for a mention of new function (gfx_v10_3_get_disabled_sa) that showed up:
https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg53790.html

[PATCH] drm/amdgpu: add function to program pbb mode for sienna cichlid
Add function for sienna_cichlid to force PBB workload mode to zero by
checking whether there have SE been harvested.

What PBB mode does exactly, I'm not sure of. It does seem that there is at least some distinction between a fully-enabled GPU and one with one or more deactivated shader arrays. Perhaps this means load-balancing is handled differently due to a shift in the ROP versus rasterizer capacity, or the algorithm for allocating screen space is altered if the RBEs and rasterizers remained linked at a shader array level.

Jawed said:
Seems to imply that Navi 21 is the only GPU where ROPs and shader arrays are disabled.

I wonder if ROPs are bound to shader arrays.

For RDNA, the RBEs were clients to the L1, which is per shader array for Navi 10.

Having the option to disable at a shader array level may be a change due to how redundant many of the resources are. There are many CUs in an array, and the code seems to have a separate pre-existing bitmask for handling disabled RBEs.
This may indicate 8 shader arrays is enough to warrant the trouble versus a similar amount of rasterizer and RBE hardware per array in Navi10, or that additional less-redundant hardware is at that level.

Another possibility is that the function gfx_v10_3_program_pbb_mode from my earlier link actually goes through quite a bit of setup just to check whether a shader engine's shader arrays are active. Perhaps it's meant for future scalability or consistency in the code, but building a bitmask based on system parameters when the traditional configuration is 1 or 2 per SE may mean a larger number could be possible for this family.

trinibwoy said:
it is a bit odd that all the leaks are for 3dmark. Maybe AMD did like Nvidia and whitelisted only certain apps for AIB drivers.

3dmark leaks seem to happen with regularity for AMD, including multiple console chips. It happens enough that I'd have to suspect it's on purpose or policies are such that AMD doesn't stop it from happening. I'd imagine the benchmark is a readily available non-trivial 3D application for early testing and validation, and one that the vendor has put more effort into optimizing for or dedicating functions to in the driver. This might make it more likely that there's programming and debugging resources available, and possibly special frameworks or driver paths explicitly for getting early testing functional on it.
That it happens to upload results to the world at this point would be well-understood, and might be part of a controlled leak for marketing or maybe giving certain interested parties an idea of how to plan their market segmentation.

trinibwoy said:
Anyone taking bets on Infinity Cache being gigabytes of 3D stacked memory? It goes without saying that would be pretty amazeballs.

https://www.freepatentsonline.com/y2020/0183848.html - Cache for Storing Regions of Data

3D on the SOC seems unlikely given the TDP numbers bandied about.
On-package would be possible, if AMD committed to an interposer solution. That might not scale down as well for the smaller SKUs that allegedly have something like this.
As a counterargument, I'm not sure as much die area would be needed for this compared to what rumors indicate the extra area is. The concerns for cache coherence and latency aren't traditionally GPU-related.

The rumors also didn't clearly cite this sort of MCM package or links, and while HBM2 has bandwidth the amount it provides versus GDDR6 is similar enough that I'd be interested if it would be worth the complexity balancing between the types.

Frenetic Pony · Oct 25, 2020

I'm not sure the tgp numbers matter for stacked ram. Memory access as source of heat is going to be negligible for these gpus regardless of how they're structured.

And it could be that the new cache structure is as much about being forward looking and planning for a chiplet arch as it about rdna2. Bandwidth requirements are obviously starting to outrun gddr, and so some solution is needed for it. What it technically looks like, other than hbm, still baffles me though. Sram is very die expensive, once you get to an l3 cache the latency advantage over just going out to main memory starts getting pretty small, and unlike CPUs memory access patterns for gpus are fairly coherent and predictable. So while having enough sram is important, needing a lot of data in cache just in case isn't as much of a thing.

Well, only a few days. Hopefully they'll offer at least an initial explanation of it's not hbm.

3dilettante · Oct 25, 2020

Frenetic Pony said:
I'm not sure the tgp numbers matter for stacked ram. Memory access as source of heat is going to be negligible for these gpus regardless of how they're structured.

If it's in the same stack as the SOC, it matters more since nominal refresh rates are typically maintained at 85C or below. Higher temperatures force higher refresh rates, which AMD's papers on stacked GPU+DRAM combinations listed as threshold to avoid. That proposal capped stack power at ~8-17W for logic, which is an order of magnitude below the chip TDPs mentioned in the rumors.

Jawed · Oct 26, 2020

Frenetic Pony said:
And it could be that the new cache structure is as much about being forward looking and planning for a chiplet arch as it about rdna2.

Yes. I've talked about L2s being distributed across chiplets and that L1s need to be able to talk to all L2s, therefore the interposer (or other 3D stacking tech) becomes vital for this network at several TB/s.

Bandwidth requirements are obviously starting to outrun gddr, and so some solution is needed for it. What it technically looks like, other than hbm, still baffles me though. Sram is very die expensive, once you get to an l3 cache the latency advantage over just going out to main memory starts getting pretty small, and unlike CPUs memory access patterns for gpus are fairly coherent and predictable.

GPUs care most about bandwidth (thence power), so any argument against a large L3 cache based on latency is moot.

Bandwidth on-die is always going to be far higher, and at far lower power, than GDDR or HBM off-die.

So while having enough sram is important, needing a lot of data in cache just in case isn't as much of a thing.

At 60fps with 512GB/s of memory, the GPU can only access 8.5GB of memory per frame. Ideally the GPU would read or write no byte more than once.

I still don't subscribe to the monster last level cache theory though. I suspect cache mechanics in GPUs have a lot of unexploited depth.

MuteyM · Oct 26, 2020

I can't post links yet, but there's been several very interesting public commits to the amd-staging-drm-next branch of the Linux kernel:

The first one is titled "add GC 10.3 NOALLOC registers" and adds a bunch of register bitfields with "LLC_NOALLOC" in their name.
The next one is "add support to configure MALL for sienna_cichlid" where MALL is "Memory Access at Last Level"
The last one is "display: add MALL support" and includes this gem:

+ // TODO: remove hard code size
+ if (surface_size < 128 * 1024 * 1024) {
Putting it all together I'm guessing that this 128MB "Infinity Cache" last-level-cache rumor has some truth to it, and at least one use will be to pin framebuffers to it (including displayable color buffers and depth/stencil) for some crazy high fillrate.

pjbliverpool · Oct 26, 2020

MuteyM said:
I can't post links yet, but there's been several very interesting public commits to the amd-staging-drm-next branch of the Linux kernel:

The first one is titled "add GC 10.3 NOALLOC registers" and adds a bunch of register bitfields with "LLC_NOALLOC" in their name.

The next one is "add support to configure MALL for sienna_cichlid" where MALL is "Memory Access at Last Level"

The last one is "display: add MALL support" and includes this gem:

+ // TODO: remove hard code size
+ if (surface_size < 128 * 1024 * 1024) {
Putting it all together I'm guessing that this 128MB "Infinity Cache" last-level-cache rumor has some truth to it, and at least one use will be to pin framebuffers to it (including displayable color buffers and depth/stencil) for some crazy high fillrate.

So similar to the XB360's esram but without requiring developer intervention?

MuteyM · Oct 26, 2020

pjbliverpool said:
So similar to the XB360's esram but without requiring developer intervention?

There was also this in one of the commit messages:

"We need to add UAPI so userspace can request MALL per buffer"

So it seems like it's under control of the GL and/or Vulkan drivers exactly which buffers go into the MALL region.

Ethatron · Oct 26, 2020

MuteyM said:
size
+ if (surface_size < 128 * 1024 * 1024) {

I love that surface_size is unitialized ...

Lurkmass · Oct 26, 2020

MuteyM said:
There was also this in one of the commit messages:

"We need to add UAPI so userspace can request MALL per buffer"
So it seems like it's under control of the GL and/or Vulkan drivers exactly which buffers go into the MALL region.

Can this be used to store the framebuffer state and accessed by the fragment shaders ?

Scott_Arm · Oct 26, 2020

MuteyM said:
I can't post links yet, but there's been several very interesting public commits to the amd-staging-drm-next branch of the Linux kernel:

The first one is titled "add GC 10.3 NOALLOC registers" and adds a bunch of register bitfields with "LLC_NOALLOC" in their name.

The next one is "add support to configure MALL for sienna_cichlid" where MALL is "Memory Access at Last Level"

The last one is "display: add MALL support" and includes this gem:

+ // TODO: remove hard code size
+ if (surface_size < 128 * 1024 * 1024) {
Putting it all together I'm guessing that this 128MB "Infinity Cache" last-level-cache rumor has some truth to it, and at least one use will be to pin framebuffers to it (including displayable color buffers and depth/stencil) for some crazy high fillrate.

That actually makes a good amount of sense. Instead of a general cache you have a cache for specific framebuffers that will have high read/write access.

andermans · Oct 26, 2020

Jawed said:
GPUs care most about bandwidth (thence power), so any argument against a large L3 cache based on latency is moot.

While GPUs will always crave bandwidth I believe GPUs these are slowly getting more latency sensitive. Remember that for a fixed latency, if we want to increase bandwidth we also need to increase the number of operations in flight, which has two limitations:

More operations in flight means we have to keep more shader invocations alive, which results in bigger register usage. Note that as shaders become more complicated over the years they're already taking more registers per invocation while the number of registers per flop hasn't really. When you are register limited (or LDS space etc.) then a latency reduction can result in a real speedup.
As we see with some of the wider GPUs these days and 1080p/1440p/4K, the workload needs to be parallel enough to actually do that which is hard due to actual usecases.

Probably still significantly less sensitive than a CPU, but I think the days of pretty much not caring at all are over. Consider that GCN->RDNA1 mostly improved on this metric (lower cache latencu for non-filtered fetches, and allowing to use more compute with a low number of shader invocations).

Scott_Arm said:
That actually makes a good amount of sense. Instead of a general cache you have a cache for specific framebuffers that will have high read/write access.

Given that the flags are for NOALLOC instead of ALLOC, I suspect everything will be cached by default, and I asusme MALL is just the option of also reading the framebuffer to the display through the last level cache as well (instead of first requiring a flush to memory. I think that gets expensive with 128 MiB cache ...)

Funnily enough I think the size check is an indication this might be 128 MiB for real.

3dilettante · Oct 26, 2020

I'm listing entries that appear to be related:
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055005.html
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055006.html
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055024.html
https://lists.freedesktop.org/archives/amd-gfx/2020-October/055007.html

MuteyM said:
I can't post links yet, but there's been several very interesting public commits to the amd-staging-drm-next branch of the Linux kernel:

The first one is titled "add GC 10.3 NOALLOC registers" and adds a bunch of register bitfields with "LLC_NOALLOC" in their name.

The next one is "add support to configure MALL for sienna_cichlid" where MALL is "Memory Access at Last Level"

The last one is "display: add MALL support" and includes this gem:

+ // TODO: remove hard code size
+ if (surface_size < 128 * 1024 * 1024) {
Putting it all together I'm guessing that this 128MB "Infinity Cache" last-level-cache rumor has some truth to it, and at least one use will be to pin framebuffers to it (including displayable color buffers and depth/stencil) for some crazy high fillrate.

That is a possible size for storage, although in the context of surfaces for display or buffers 128*1024*1024 shows up in multiple places as a maximum buffer size even for architectures without large amounts of dedicated storage.
Perhaps there's limitations to what the handling hardware can address, or the TODO indicates that some parameter will be introduced to give a per-implementation limit. That could leave the possibility that Sienna Cichlid's implementation happens to have a resource that matches this upper limit.

pjbliverpool said:
So similar to the XB360's esram but without requiring developer intervention?

It's apparently visible to some kind of software, although it's the driver that is mentioned so far and that can have varying levels of visibility to client software. There's an addition to the page table format for Sienna Cichlid that indicates that allocation for this mode can be handled at a page granularity. Using page table entries for that purpose has been done before, such as with the ESRAM for the Xbox One. That could go either way as far as developer visibility. Some elements point to the driver making some decisions based on the type of target or functionality.

trinibwoy · Oct 26, 2020

This is likely a key benefit of the RDNA vs CDNA split. The LLC scratchpad is more amenable to pinned render targets than it would be to general HPC.

Erinyes · Oct 26, 2020

Jubei said:
So in RTGs latest video, he claims he was shown slides that compare gaming performance between 6800XT and 3080. 6800XT wins 5, loses 3 and ties in 2 out of 10 at 4K resolution. And wins 8/10 in 1440p

I wonder if this is a controlled leak by AMD

Yeah does seem like it. The other leaks have been coming from AIBs supposedly, and more on N22. If AMD is being secretive, unlikely they'd have given AIBs those slides.

SimBy said:
It's interesting that the biggest change in both Zen3 and RDNA2 seems to be massively reworked cache.

With Zen3 I'd argue that it's more like rearranged cache rather than reworked. It's got much bigger changes in the other parts of the chip.

yuri said:
Is this really surprising at all? It's been 7 years since AMD went for 300+-25W highend (Fury X, Vega 64, Vega VII).

So all those reports like "top model 250W" were strage.

Yea nothing surprising really, it was to be expected. And even if AMD themselves don't release 300W+ cards, seems very likely that AIBs will.

3dilettante · Oct 26, 2020

trinibwoy said:
This is likely a key benefit of the RDNA vs CDNA split. The LLC scratchpad is more amenable to pinned render targets than it would be to general HPC.

There's already blocking for matrix multiplication to fit within the register files and on-die caches. A large cache might be another level for bandwidth optimization or locality. AMD's posited more complex tiers for HPC already to keep hot data sets on-board or on-die, although maybe the counter is that it's getting complex enough?

On the RDNA vs CDNA split, there's also mention of a Navi 10 blockchain SKU.
Example: https://lists.freedesktop.org/archives/amd-gfx/2020-October/055070.html

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Leoneazzurro5

trinibwoy

Meh

Scott_Arm

pTmdfx

yuri

3dilettante

Frenetic Pony

3dilettante

Jawed

MuteyM

pjbliverpool

B3D Scallywag

MuteyM

Ethatron

Lurkmass

Scott_Arm

andermans

3dilettante

trinibwoy

Meh

Erinyes

3dilettante