AMD CDNA Discussion Thread

Granath · Mar 29, 2022

https://www.nextplatform.com/2022/0...computer-tests-show-decent-performance-leaps/

pTmdfx · Mar 31, 2022

https://github.com/llvm/llvm-project/commit/47bac63d3f6b9e64fdf997aff1f145bc948f02d9
It seems like cache coherence in GFX940 (CDNA 3?) is achieved by:

* memory local to L2: reads & writes are cacheable
* remote memory (other L2s or CPU): reads are uncached; writes are write through and send an invalidation to the home L2.
* Each L2 keeps a probe filter for CPU cached lines from its local memory, and forces CPU invalidation or writeback as appropriate.

It appears that GFX940 no longer has a unified L2 cache shared by all CUs. It is configurable:

1. from many smaller agents/virtual devices, each having their own private L2 cache;

2. to one single agent having multiple L2 caches.

Each L2 owns a disjoint(?) region of the device memory now, while they appear to still have internal interleaved “channel” partitions.

Makes perfect sense in multi-agent mode where each small agent gets a fixed contiguous region, which can be owned outright by a single L2. But I am not sure how more “monolithic” configurations where one agent sees multiple standalone L2s would work effectively. Page level interleaving, eh?

Granath · Apr 28, 2022

https://videocardz.com/newz/amd-ins...t-compute-dies-support-for-hbm3-and-pcie-gen5

xpea · May 2, 2022

A100 compared to MI250 in HPC workloads:

https://twitter.com/x/status/1521185654075539458

CarstenS · May 2, 2022

Does that "MI250X/1" indicate, that it's running on half a MI250X, i.e. only on one of the two devices?

trinibwoy · May 2, 2022

CarstenS said:
Does that "MI250X/1" indicate, that it's running on half a MI250X, i.e. only on one of the two devices?

Footnote in the first pic says "single GPU of Crusher/MI250X".

Krteq · May 2, 2022

Yes, same as for SVD. They clearly stated "single GPU" in figure description and "MI250X/1" in graphs legends and "Single GCD" for SVD figure desc.

Also, some comparisons are for V100 vs MI100 with old ROCm etc.

Deleted member 2197 · May 3, 2022

Linked document (page 48) for explanation of HPC benchmark results and performance for MI250X vs MI100, A100, V100.
March 31, 2022
CEED ECP Milestone Report: High-order algorithmic developments and optimizations for more robust exascale applications | Zenodo

Granath · May 3, 2022

https://twitter.com/x/status/1521265815139627009

Qesa · May 31, 2022

TOP500 June is out with MI250X systems, including Frontier breaking 1EF in HPL.

HPL
HPCG

MI250X looks amazing at linpack. Crushes everything else per node and per Watt. At HPCG it seems not so great. Frontier didn't submit a run, and judging by LUMI's result I think that's because they would've come in just under Fugaku and landed at #2. It also looks like the dual-die MI250X only barely outperforms a single A100 here, if you compare LUMI to Perlmutter (5120 MI250X performing very similarly to 6159 A100s), with worse perf/W. That seems to mirror the benchmarks above.

Also, the Rpeak on TOP500 seems a bit odd/inconsistent for the GPUs with matrix acceleration. For MI250X it uses the vector performance, i.e. 48 TFLOPS. For A100 it's about 14 TFLOPS, somewhere between the vector and matrix performance (geomean of the two?). I initially thought AMD managed to seriously up their HPL utilisation compared to early benchmarks but it doesn't seem like it.

Granath · May 31, 2022

https://twitter.com/x/status/1531523434567589888

pcchen · May 31, 2022

Are they really going to update their HPL and HPCG runs? It's likely that the machines will be up and running actual customers' workloads and it can be costly to run these benchmarks again. In the past, they generally only rerun if the configuration has been changed (e.g. added more nodes). Of course, it'd be great if they actually do that with much better results.

Leoneazzurro5 · Jun 9, 2022

AMD: Combining CDNA 3 and Zen 4 for MI300 Data Center APU in 2023

www.anandtech.com

Many interesting hints at what CDNA3 will be, with the use of Infinity cache and improved interconnects. Confirmed MI300 is an APU.

troyan · Jun 10, 2022

What is the point of such configuration? Sharing the HBM memory with the CPU doesnt sound logical.

In the footnote the >8x more performance comes from using FP8 for training. So with >850W it will be barely better than nVidia's H100 PCIe card with 350W. And the Grace-Hopper Superchip at ~1000W will be much better for workloads needing huge data - i.e. AI training...

Granath · Jun 10, 2022

do you mean memory coherency?

troyan · Jun 10, 2022

There are different interconnects on the market with memory coherency. And this whole concept makes only sense when informations arent going of chip. So in a distributed computing system using CDNA3 APUs has no advantage over traditional concepts.

Kaotik · Jun 10, 2022

troyan said:
There are different interconnects on the market with memory coherency. And this whole concept makes only sense when informations arent going of chip. So in a distributed computing system using CDNA3 APUs has no advantage over traditional concepts.

Your interconnects on same packaging utilizing modern bridges etc can offer bandwidth and latency in completely different league from anything going over to separate sockets/slot. Memory coherency makes sense going even off socket/slot and more so within same packaging. And if the problem is with the company, look at NVIDIA doing same on Grace/Hopper and settling for going off socket even if it's a short way.

troyan · Jun 11, 2022

Kaotik said:
Your interconnects on same packaging utilizing modern bridges etc can offer bandwidth and latency in completely different league from anything going over to separate sockets/slot. Memory coherency makes sense going even off socket/slot and more so within same packaging. And if the problem is with the company, look at NVIDIA doing same on Grace/Hopper and settling for going off socket even if it's a short way.

Intel is putting HBM on a CPU, AMD is putting a CPU on HBM. 18 months from now isolated use cases in data centers will be only small part of the business.
Accelerators exists because x86 CPUs are not fast enough. Combining a slow x86 CPU with an accelerator is a paradox.

Granath · Jun 11, 2022

could you run Crysis only on GPU ?

pTmdfx · Jun 11, 2022

Kaotik said:
Your interconnects on same packaging utilizing modern bridges etc can offer bandwidth and latency in completely different league from anything going over to separate sockets/slot. Memory coherency makes sense going even off socket/slot and more so within same packaging. And if the problem is with the company, look at NVIDIA doing same on Grace/Hopper and settling for going off socket even if it's a short way.

I am excited about this. AMD has support for GPU accessing pageable system memory since Kaveri, just that it has always been capped at PCIe link or system memory bandwidth, whichever is lower. Enabling GPU to tap into demand paging with multi-TBps of bandwidth* can truly change how problems with never-fully-resident (i.e., just very large) data sets are approached. I would also speculate that, with CXL.mem support (if true), we may see options to configure some PCIe links instead for CXL-attached high-capacity DRAM/NVRAM as first-level swap.

(edit: * and getting rid of the system-device page migration overhead)

There is also a hypothetical world where in the consumer GPU space, sparse texture/buffer can simply be a normal mmap-ed range, and getting fault in from storage devices by the OS on demand. :mrgreen:

AMD CDNA Discussion Thread

Granath

pTmdfx

Granath

xpea

CarstenS

Moderator

trinibwoy

Meh

Krteq

Deleted member 2197

Guest

Granath

Qesa

Granath

pcchen

Moderator

Leoneazzurro5

AMD: Combining CDNA 3 and Zen 4 for MI300 Data Center APU in 2023

troyan

Granath

troyan

Kaotik

Drunk Member

troyan

Granath

pTmdfx