AMD Execution Thread [2023]

Status
Not open for further replies.
CDNA 3 whitepaper https://www.amd.com/content/dam/amd...-docs/white-papers/amd-cdna-3-white-paper.pdf

Chips and cheese piece on CDNA 3 https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/

Haven't read through it properly yet but some stuff like 2100MHz peak clock is very high for 304 CUs (and 1.23x higher than last gen), they doubled the width of the matrices, have full rate Int8 (doubled from same as FP16 with CDNA 2) so now that's 6.8x faster, added TF32 and FP8 which we already knew. FP64 is 81.7Tflops, up from 47.9 (1.7x) for classic HPC stuff. Cache size and bandwidth is a bit bonkers, 256MB LLC/infinity cache (called both in the whitepaper) has 17.2TB/s total bandwidth (128x 2MB slices, 64bytes/cycle per slice for 8192bytes/cycle total, 17.2TB/s at 2.1GHz)

In general seems like a big improvement over CDNA2/MI250X and I'm sure the unified memory in MI300A will be big for efficiency/perf in cases that take advantage of it, that'll be a big inefficiency barrier removed
 
Last edited:
Haven't read everything yet, but do we know how they achieved this : "Even better, MI300X can expose all of those CUs as a single GPU" ?
 
by providing ~4tb/s of peak bandwidth to each tile. L2 is per title , no global LLC but per memory controller MALL.

But I guess performances is not the only thing allowing that. I mean, I would guess they need a software or hardware "dispatcher" to allocate the workload between tile / gpu ? As I write this, in fact I just wonder If my "old" definition of a gpu is just obsolete, and for the software, why would there be a difference between 2 cu or 2 whole title/gpus ...
 
But I guess performances is not the only thing allowing that. I mean, I would guess they need a software or hardware "dispatcher" to allocate the workload between tile / gpu ? As I write this, in fact I just wonder If my "old" definition of a gpu is just obsolete, and for the software, why would there be a difference between 2 cu or 2 whole title/gpus ...

When you are writing for something like CUDA, dispatch is almost "built-in," because your program is already written as dispatching to many "CUs" which operate quite independently (there are sometimes locks involved but for a NUMA system there should be a global lock available). So if each CU can access from the same memory pool logically it should be fine. Of course, you'll still want to reduce remote memory access but that's mostly a question of optimization.
 
The LLC being attached to memory controllers seems like an interesting decision. So the cache in each tile will only include data in memory in the same tile. If you're naively treating it as UMA, 3/4 of your L2 hits are going to be across tiles served at 1.2-1.5 TB/s rather than the on-paper 17 TB/s of the cache, which won't be too different (at least ignoring power) from hitting memory directly.

I'd have thought the cache prioritising data from other tiles would have made the most sense. Unless you're exposing an explicit NUMA arrangement.
 
But I guess performances is not the only thing allowing that. I mean, I would guess they need a software or hardware "dispatcher" to allocate the workload between tile / gpu ? As I write this, in fact I just wonder If my "old" definition of a gpu is just obsolete, and for the software, why would there be a difference between 2 cu or 2 whole title/gpus ...

Dispatching work is easy. The problem with treating multiple pieces of silicon as a single GPU is how do you manage memory coherency, when your program touches the same value in two different places, how do you make sure they see the same thing.

The LLC being attached to memory controllers seems like an interesting decision. So the cache in each tile will only include data in memory in the same tile. If you're naively treating it as UMA, 3/4 of your L2 hits are going to be across tiles served at 1.2-1.5 TB/s rather than the on-paper 17 TB/s of the cache, which won't be too different (at least ignoring power) from hitting memory directly.

I'd have thought the cache prioritising data from other tiles would have made the most sense. Unless you're exposing an explicit NUMA arrangement.
How would you do coherency?

I would expect there to be significant software optimization opportunities in trying to make sure that kernels use the local 1/4th of memory whenever possible, even when just partitioning the GPU into 4 is not an option.
 
How would you do coherency?
With a snooper for eventual coherence. But also just don't expect coherence without using atomics or synchronisation.
I would expect there to be significant software optimization opportunities in trying to make sure that kernels use the local 1/4th of memory whenever possible, even when just partitioning the GPU into 4 is not an option
Likewise. However I've seen a fair bit of talk about it seamlessly behaving like a single chip which still doesn't seem to be the case.
 
Last edited:
So the cache in each tile will only include data in memory in the same tile.
No.
MALL is physically striped across the IMCs much the same way it's striped on Navi2, Navi3, STX-halo, you name 'em.
even when just partitioning the GPU into 4 is not an option.
You can, but why.
However I've seen a fair bit of talk about it seamlessly behaving like a single chip which still doesn't seem to be the case.
It is very much one, no matter how you cope.
Twice so for 300A which has only UMA modes.
How would you do coherency?
MALL has a chongus directory baked in, lifted straight up from Genoa.
 
The LLC being attached to memory controllers seems like an interesting decision. So the cache in each tile will only include data in memory in the same tile. If you're naively treating it as UMA, 3/4 of your L2 hits are going to be across tiles served at 1.2-1.5 TB/s rather than the on-paper 17 TB/s of the cache, which won't be too different (at least ignoring power) from hitting memory directly.

I'd have thought the cache prioritising data from other tiles would have made the most sense. Unless you're exposing an explicit NUMA arrangement.
It looks like memory interleaving is fairly fine-grained (every 256B like many discrete GPUs).

Memory access pattern can still be optimised to make better use of the local LLC, if one can influence the cross-XCD interleaving granularity of workgroups.
 
It looks like memory interleaving is fairly fine-grained (every 256B like many discrete GPUs).

Memory access pattern can still be optimised to make better use of the local LLC, if one can influence the cross-XCD interleaving granularity of workgroups.
Where did you read/hear that it's 256B fine-grained interleaving? The whitepaper says (for MI300A at least) that memory is "Interleaved between HBM stacks: switch stack every 4KiB through physical memory space" (Page 13 of https://www.amd.com/content/dam/amd...-docs/white-papers/amd-cdna-3-white-paper.pdf)

There's a tricky trade-off for interleaving where ideally for maximum DRAM power efficiency (& performance), you want the entire DRAM page (aka row) which is 1KiB for HBM2/3 in pseudo-channel mode I think. If you can read 1024 bytes in one go, that'll be the most efficient path. But by requesting a bunch of data from a single page at a time (and therefore also a single LLC partition) you risk congestion on the fabric (and cache possibly) with bandwidth demand being bursty/unequal between links over very short periods (even if the long-term average is still balanced).

Even if you do >=1KiB interleaving though, because your lower-level caches deal in 128B lines, it can be tricky to get the data requests to come at (roughly) the same time making the matter kind of moot unless your DRAM controller is very good at coalescing those requests (at the cost of higher latency) making the whole thing potentially moot... (as a side note I wonder if NVIDIA's TMA could help here by effectively doing larger requests than the cacheline size?)

256B is a common compromise for multiple vendors AFAIK, and it'd still make more sense to me than 4KiB which is higher than the HBM DRAM row page. Maybe 4KiB is only for MI300A with CPU/GPU coherency but I'm not quite sure why they'd need to match the OS page size for how data is distributed across HBM stacks?!
 
Where did you read/hear that it's 256B fine-grained interleaving? The whitepaper says (for MI300A at least) that memory is "Interleaved between HBM stacks: switch stack every 4KiB through physical memory space" (Page 13 of https://www.amd.com/content/dam/amd...-docs/white-papers/amd-cdna-3-white-paper.pdf)
It is interleaving across stacks every 4KiB. Each HBM3 stack has 16 channels, so that most likely implies 256B interleaving within a stack, which is also (not so coincidentally) the interleaving granularity having been used by AMD discrete GPUs and (past? *) iGPU private aperture.

Edit: If I remember correctly, the unified memory pool of newer console APUs also operates similarly, i.e., everything is fine grained interleaving for maximal data parallelism friendliness, at the expense of CPU loads that generally favour page-level interleaving as you mentioned.

* Not quite sure about consumer Zen APUs since Raven Ridge Renior…
 
Last edited:
It is interleaving across stacks every 4KiB. Each HBM3 stack has 16 channels, so that most likely implies 256B interleaving within a stack, which is also (not so coincidentally) the interleaving granularity having been used by AMD discrete GPUs and (past? *) iGPU private aperture.

Edit: If I remember correctly, the unified memory pool of newer console APUs also operates similarly, i.e., everything is fine grained interleaving for maximal data parallelism friendliness, at the expense of CPU loads that generally favour page-level interleaving as you mentioned.

* Not quite sure about consumer Zen APUs since Raven Ridge Renior…
Ahhh thank you, I think you're right, I've not personally worked with HBM designs (mostly LPDDR) so I was confused about the terminology. From what I can tell there are multiple banks per (pseudo-)channel though, so it could theoretically be divided by more than 16 if it's 4KiB per stack, but I suppose "use one bank from each channel on a single stack" would be a reasonable strategy so maybe not.

You said the following previously:
Memory access pattern can still be optimised to make better use of the local LLC, if one can influence the cross-XCD interleaving granularity of workgroups.
It sounds like it should be possible but how exactly would you do it in practice? You'd need full knowledge of the entire hashing algorithm *and* you'd need the virtual=>physical memory mapping to remain consistent across way more than 4KiB.

I think virtual=>physical mapping on its own might be OK (with 4KiB per stack using all 16 pseudo-channels and "8 stacks * 8 banks per pseudo-channel", you'd need >=256KiB allocations to guarantee things don't get shuffled around, so 2MiB pages might work but I don't think 64KiB pages would. I think 2MiB pages are pretty common on desktop GPUs (certainly they aren't on Android mobile GPUs, sigh) so it might be OK but I'm not 100% sure whether 2MiB pages would be as prevalent on something like a MI300A).

You'd still need the hashing algorithm not to take into account higher bits of the physical address that are dependent on the virtual=>physical mapping, or for all the data inside a given 2KiB row/bank to be inside the same 2MiB page (or whatever the page size is). Again, that might work assuming you "come back" to the same set of banks every 256KiB (see above) and you have 256B per bank, as 256B * (2MiB / 256KiB) = 256B * 8 = 2KiB...

Assuming the LLC hashing matches the DRAM row/bank hashing (which isn't strictly guaranteed) I was thinking I could use atomics to figure this out: if two memory addresses target the same LLC bank, then there will be congestion and atomic performance will be lower (typically 2x if the atomic HW is the bottleneck rather than the fabric) than if the atomics were being processed in independent LLC banks, even if the addresses are different in both cases so there is no address-level collision.

It all seems somewhat feasible but I'm more than a little bit skeptical that any developers or even AMD are considering doing this. I hope I'm wrong though!
 
Status
Not open for further replies.
Back
Top