AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
L1 is shared by all CUs in a shader array, 10 in RX 5700 XT for example.
Quoting from the RDNA whitepaper:
The graphics L1 cache is shared across a group of dual compute units
Are you certain?
I disagree. L1s need to be able to talk to all L2 slices. That isn't changed by a "chiplet" design where LLC is spread amongst chiplets. 2-4TB/s bandwidth amongst chiplets over an interposer seems pretty easy.
Store-and-forward pattern with a proxy acting as client to both L2 shards is also applicable when partitioning the LLC, a monolithic crossbar above all L2 slices is not strictly required. That has even been proposed in that specific patent.

I have to disagree on that bandwidth figure. With only a simple phy, there is still a practical limit of 2 Gbit/s per pin. Targeting only the current performance level of 2TB/s (full duplex!), that would require an >2000 pin interconnect. More than twice as much as a GPU die with 4 stacks of HBM2 would require. Blow up the phy, and you can save on the number of pins, but not for free. Either way, a combined 4TB/s bandwidth over an interconnect with strict requirements on latency, is, while not impossible, still far from a realistic option yet, when minimizing die size is a goal. Only hurdle I'm not worried about is power consumption, if my math is correct, that should go as low as ~20W for such an interconnect.
 
Last edited:
Are you certain?
There are four L1s in RX 5700 XT. One per shader array.

Either way, a combined 4TB/s bandwidth over an interconnect with strict requirements on latency
GPUs don't really care about latency, especially when the primary victim of latency in a chiplet design would be TEX (and MEM). Hiding the latency of TEX is easy. The side-effect, ultimately, is on the count of work items in flight (because they are waiting for data), affecting register file use and cache coherency.

As for interconnect bandwidth, remember the ROPs will be using none! That's what tiled rasterisation (as already seen in RDNA) gives you. It should be years before 4TB/s is required.
 
GL1 in RDNA 1 is a quad-banked 128KB cache. It absorbs all traffic (ingress from L2/egress to L2) from the 10 inner L0 V$, 5 L1 K$, 5 L1 K$ and other non-CU clients (e.g., RBEs and Prim Unit). The bullet points for GL1 also checked some boxes that the mechanism in this patent were set to solve, e.g., reducing L2 traffic & increase effective cache capacity.

The patent doesn't seem to fit the RDNA paradigm, unless another level of CU caches are to be introduced, or GL1 got ditched in a way that non-CU clients get compensated for the loss of cache capacity. :-?
 
Last edited:
There are enough exceptions in L2 configurations through GCN's history that, in conjuction with RDNA 2 allegedly being clocked higher, these combinations are logical guesses IMO:

1. 40 CUs with 12 L2 slices and 256-bit GDDR6 bus
2. 80 CUs with 16 L2 slices and 384-bit GDDR6 bus

:mrgreen:

L2-L1 bandwidth loss due to fewer slices can be offset by higher clocks, and also doubling L2-L1 fabric bus width if desired (given that RDNA 1 introduced 128B cache lines).
 
Last edited:
Anyone has any idea what num_packer_per_sc is? Appears to be doubled from RDNA1.

Packers are the subsystem behind ordered append/consume buffers, and rasterizer ordered view.

https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf

DS_ORDERED_COUNT

GDS-only. Add (count_bits(exec_mask)) to one of 4 dedicated ordered-count counters (aka 'packers'). Additional bits ofinstr.offset field are overloaded to hold packer-id, 'last'.

GDS Only: Intercepted by GDS and processed by ordered append module. The ordered append module will queue request until this request wave is the oldest in the queue at which time the oldest wave request will be dispatched to the DS with an atomicadd for execution and broadcast back to ALL lanes of a wave.This is a ordered count operation and can only be called once per issue with the release flag set. If the release flag is not sent, the wave will have full control over the order count module until it sends a request with the release flag. Unlike append/consume this op needs to be sent even if there are no valid lanes when it is issued. The GDS will add zero and advance the tracking walker that needs to match up with the dispatch counter.
 
I would prefer something between, like 64CUs. Unless Big Navi would be cheap, then no problem, I can take that.

If history is any indication, there will be one if not two cut-down SKUs from the full 80CU Navi21, so you will likely get your wish. 72CU and 64CU cut down versions would align exactly with what AMD did with Navi10 and RX5600 + RX5700.
 
If history is any indication, there will be one if not two cut-down SKUs from the full 80CU Navi21, so you will likely get your wish. 72CU and 64CU cut down versions would align exactly with what AMD did with Navi10 and RX5600 + RX5700.
That's right, but 5600 was an outlier. I don't think they ever cut down any GCN chip that much. Probably something related with yields on N7.
 
With the 3 Navi2 GPUs, AMD seems to have a lot of space between their tiers.

Navi 21 with 80 CUs then Navi 22 with 40 CUs then Navi 23 with 20? CUs.

This is a lot of performance difference between the GPU tiers.
 
With the 3 Navi2 GPUs, AMD seems to have a lot of space between their tiers.

Navi 21 with 80 CUs then Navi 22 with 40 CUs then Navi 23 with 20? CUs.

This is a lot of performance difference between the GPU tiers.

GPUs <> Card SKUs.

They could have:

N21: 80cu = 6900 (XT) and <80ccu = 6900 (plain) or 6800.
N22: 40cu = 6800 (XT) and <40ccu = 6800 (plain) or 6700.

etc.
 
By having both HBM and GDDR phy/io, could AMD potentially increase their wafer yields?
In the sense of having more of any kind of unit means there's more redundancy, I suppose it might be possible depending on the chip counts for both bus types.
If there were multiple HMB stacks and a significant number of GDDR6 channels, maybe.
If the idea is that there are significantly fewer channels overall, perhaps it's worse. The failure granularity for HBM is seemingly at the stack level, while GDDR6 might be at the 64-bit or 32-bit granularity.
If it's 2xHBM and 4x64-bit GDDR6 (32-bit per chip, but AMD seems to like pairing channels) versus a potentially impractical 512-bit GDDR6 bus, it's worse.
It's a wash with a 384-bit bus, unless AMD went with more granular GDDR6 controllers.

Even at the roughly equivalent level of redundancy, I would have some concern that the HBM side makes yields incrementally worse overall due to the higher failure rate of interposer integration versus PCB mounting--coupled with the fact that there's some multiplicative yield failure due to needing both interposer integration and then some small yield loss due to the GDDR6 bus.

Perhaps it would help from a product mix standpoint, if there were GDDR6-only products and HBM-only products, then the same production could more readily satisfy multiple categories versus AMD's historically poor ability to juggle multiple product lines/types/channels. However, given the very specific needs of one type over the other, and maybe questions like die-thinning or whatever preparatory work is needed for an interposer, it may not be that flexible. Even for yield-recovery purposes, there could also be a question of when failures can be detected. Some issues may not be fully detectable until after some point of no return, like after interposer mounting.

If the chip were to always have an interposer, even if HBM weren't in use (discounting what is needed to have GDDR6 through an interposer) it would be a cost adder that would probably be more significant that whatever yield gain there was from redundancy in the controller.


Even though I'm having trouble seeing where this is applicable? It appears to be focused primarily at reducing the load on the (shared) L2 cache?
Is the L2 cache on GPU actually still that much slower than L1, so that even 3 trips over the crossbar can outweigh an L2 hit?
The L2 is globally shared and tends to be significantly longer latency. As the LLC, it is also in high demand with contention making things worse.
The exact threshold is something that isn't necessarily clear.
For Nvidia, the L2 has been benchmarked as being nearly 7x longer latency for Volta and Kepler, whereas the Maxwell/Pascal generations were in the nearly 3x range. This is more related to the much lower L1 latency for Volta and Kepler rather than a massive shift in L2 latency.
https://arxiv.org/pdf/1804.06826.pdf
GCN's L2 is only 1.7x the latency of the L1, mainly due to an estimated latency of ~114 cycles per a GDC2018 presentation. The L2 is apparently a little faster in cycle terms than Nvidia's although this may be closer in wall-clock terms due to Nvidia's typically higher clocks in the past.

Ampere's a question mark, though if it's similar to Volta the L1 latency should be on the lower end of the GPU scale.
I haven't seen number for RDNA1/2 officially, although if we believe the github leak for the PS5 it might be roughly in the same area as GCN to a little lower-latency in the L1.
The overall picture wouldn't be that different at maybe 90-100 cycles.
The improvement or degradation in latency may depend on how quickly L1 remote accesses can be resolved. If every leg in the transaction were on the order of 90 cycles, I'd say it wouldn't be a good latency bargain. The significant constraints when it comes to L2 bandwidth and contention might still provide an upside, and there seem to be at least some reasonable ways to skip parts the the L1 pipeline in a remote hit scenario given how straightforward it should be to detect that an access won't be locally cached.

I ran across a link from Anandtech's forum that points to a paper by the individuals involved in the patent concerning this sort of scheme:
https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf

I haven't yet had the time to really read the paper. One thing I did notice is that the model GPU architecture isn't quite a fit for GCN or RDNA, with 48KB local scratchpad, 16KB L1, and a separate texture cache. There's perhaps some irony that an AMD-linked paper/patent has analysis profiled on an exemplar that reminds me more of Kepler, although I think I may have seen something like that happen before.
Whether the math necessarily holds at 90-114 cycles latency versus 28 is an unanswered question.

edit: Actually, I just ran across a mention of the coherence mechanism assumed by the analysis, and it's Nvidia's L1 flush at kernel or synchronization boundaries. That could significantly deviate from the expected behavior of a GCN/RDNA cache.

By the looks of it, yes. (3.90TB/s L1 bandwidth, 1.95TB/s L2 bandwidth on the Radeon RX 5700XT.)
Depends on the context of L1 vs L0. The terminology is being handled loosely, and some portions of the description that indicate L1 capacity changes with CU count may mean L0/L1 depending on the GPU.

For what's it's worth, the author also often spoke only about small number of CUs in each cluster, and about locality on the chip. So possibly this isn't even aiming at a global crossbar, but actually only at 4-8 CUs maximum in a single cluster?
I suppose 8 CUs with 128kB of L1 each still yield a 1MB memory pool. And you got to keep in mind L0 cache still sits above that, so at least some additional latency isn't going to ruin performance.
The CUs have 16KB L0/L1 caches, and one of the main benefits of this scheme is that its area cost should be lower than increasing the capacity of AMD's miniscule caches. I'm not sure how many tenths of a mm2 the L0 takes up in a current non-Kepler GPU, however.


ROPs are clients of L1 in Navi. ROP write operations are L1-write-through though, i.e. L1 doesn't support writes per se, so L2 is updated directly by ROP writes.
It's an odd arrangement with the ROPs versus the L1. The RDNA whitepaper considers the ROPs clients of the L1 and touts how it reduces memory traffic. However, if their output is considered write-through, what savings would making them an L1 client bring?
Given screen-space tiling, a given tile will not be loaded by any other RBE but one (no sharing), and unless the RBE loads a tile and proceeds to not write anything to it, the L1 at best holds ROP data that is read once and must be evicted once it leaves the RBE caches (no reuse).
 
GPUs <> Card SKUs.

They could have:

N21: 80cu = 6900 (XT) and <80ccu = 6900 (plain) or 6800.
N22: 40cu = 6800 (XT) and <40ccu = 6800 (plain) or 6700.

etc.
Yea but even then, usually the cut down chips are at most 20% less performance but that still leaves a massive gap between the tiers. A cut down 80 CU would probably still be 72CUs or close to it which is still massively above a 40CU part.
 
Status
Not open for further replies.
Back
Top