AMD CDNA Discussion Thread

https://github.com/llvm/llvm-project/commit/47bac63d3f6b9e64fdf997aff1f145bc948f02d9
It seems like cache coherence in GFX940 (CDNA 3?) is achieved by:

* memory local to L2: reads & writes are cacheable
* remote memory (other L2s or CPU): reads are uncached; writes are write through and send an invalidation to the home L2.
* Each L2 keeps a probe filter for CPU cached lines from its local memory, and forces CPU invalidation or writeback as appropriate.

It appears that GFX940 no longer has a unified L2 cache shared by all CUs. It is configurable:

1. from many smaller agents/virtual devices, each having their own private L2 cache;

2. to one single agent having multiple L2 caches.

Each L2 owns a disjoint(?) region of the device memory now, while they appear to still have internal interleaved “channel” partitions.

Makes perfect sense in multi-agent mode where each small agent gets a fixed contiguous region, which can be owned outright by a single L2. But I am not sure how more “monolithic” configurations where one agent sees multiple standalone L2s would work effectively. Page level interleaving, eh?
 
Last edited:
Does that "MI250X/1" indicate, that it's running on half a MI250X, i.e. only on one of the two devices?
 
Yes, same as for SVD. They clearly stated "single GPU" in figure description and "MI250X/1" in graphs legends and "Single GCD" for SVD figure desc.

Also, some comparisons are for V100 vs MI100 with old ROCm etc.
 
TOP500 June is out with MI250X systems, including Frontier breaking 1EF in HPL.

HPL
HPCG

MI250X looks amazing at linpack. Crushes everything else per node and per Watt. At HPCG it seems not so great. Frontier didn't submit a run, and judging by LUMI's result I think that's because they would've come in just under Fugaku and landed at #2. It also looks like the dual-die MI250X only barely outperforms a single A100 here, if you compare LUMI to Perlmutter (5120 MI250X performing very similarly to 6159 A100s), with worse perf/W. That seems to mirror the benchmarks above.

Also, the Rpeak on TOP500 seems a bit odd/inconsistent for the GPUs with matrix acceleration. For MI250X it uses the vector performance, i.e. 48 TFLOPS. For A100 it's about 14 TFLOPS, somewhere between the vector and matrix performance (geomean of the two?). I initially thought AMD managed to seriously up their HPL utilisation compared to early benchmarks but it doesn't seem like it.
 
Are they really going to update their HPL and HPCG runs? It's likely that the machines will be up and running actual customers' workloads and it can be costly to run these benchmarks again. In the past, they generally only rerun if the configuration has been changed (e.g. added more nodes). Of course, it'd be great if they actually do that with much better results.
 
What is the point of such configuration? Sharing the HBM memory with the CPU doesnt sound logical.

In the footnote the >8x more performance comes from using FP8 for training. So with >850W it will be barely better than nVidia's H100 PCIe card with 350W. And the Grace-Hopper Superchip at ~1000W will be much better for workloads needing huge data - i.e. AI training...
 
There are different interconnects on the market with memory coherency. And this whole concept makes only sense when informations arent going of chip. So in a distributed computing system using CDNA3 APUs has no advantage over traditional concepts.
 
There are different interconnects on the market with memory coherency. And this whole concept makes only sense when informations arent going of chip. So in a distributed computing system using CDNA3 APUs has no advantage over traditional concepts.
Your interconnects on same packaging utilizing modern bridges etc can offer bandwidth and latency in completely different league from anything going over to separate sockets/slot. Memory coherency makes sense going even off socket/slot and more so within same packaging. And if the problem is with the company, look at NVIDIA doing same on Grace/Hopper and settling for going off socket even if it's a short way.
 
Your interconnects on same packaging utilizing modern bridges etc can offer bandwidth and latency in completely different league from anything going over to separate sockets/slot. Memory coherency makes sense going even off socket/slot and more so within same packaging. And if the problem is with the company, look at NVIDIA doing same on Grace/Hopper and settling for going off socket even if it's a short way.

Intel is putting HBM on a CPU, AMD is putting a CPU on HBM. 18 months from now isolated use cases in data centers will be only small part of the business.
Accelerators exists because x86 CPUs are not fast enough. Combining a slow x86 CPU with an accelerator is a paradox.
 
Your interconnects on same packaging utilizing modern bridges etc can offer bandwidth and latency in completely different league from anything going over to separate sockets/slot. Memory coherency makes sense going even off socket/slot and more so within same packaging. And if the problem is with the company, look at NVIDIA doing same on Grace/Hopper and settling for going off socket even if it's a short way.
I am excited about this. AMD has support for GPU accessing pageable system memory since Kaveri, just that it has always been capped at PCIe link or system memory bandwidth, whichever is lower. Enabling GPU to tap into demand paging with multi-TBps of bandwidth* can truly change how problems with never-fully-resident (i.e., just very large) data sets are approached. I would also speculate that, with CXL.mem support (if true), we may see options to configure some PCIe links instead for CXL-attached high-capacity DRAM/NVRAM as first-level swap.

(edit: * and getting rid of the system-device page migration overhead)

There is also a hypothetical world where in the consumer GPU space, sparse texture/buffer can simply be a normal mmap-ed range, and getting fault in from storage devices by the OS on demand. :mrgreen:
 
Last edited:
Back
Top