AMD CDNA Discussion Thread

itsmydamnation · Jun 11, 2022

troyan said:
Intel is putting HBM on a CPU, AMD is putting a CPU on HBM. 18 months from now isolated use cases in data centers will be only small part of the business. Accelerators exists because x86 CPUs are not fast enough. Combining a slow x86 CPU with an accelerator is a paradox.

Then why do so many clusters have CPU only nodes?

Your view is stupidly simplistic ,
remember executing data is easy moving data is hard. CPU's are orders of magnitude better at moving data that you don't know you need yet or you don't know which data you need (OOE machinery) .
GPU's are orders of magnitude wider then CPU's and have far more concurrency.
CPU's SIMD's are stupidly fast(latency, 4-5 cycle FMA at 2.5-5ghz

HBM limits how large your working set is but has high throughput ( but not really "fast" per say)
DIMM's have low throughput but can have TB of working space
DIMM's can have persistent memory

By putting all these on package , by having coherency end to end , you can:
reduce unnecessary coping of data between "engines" ( remember move hard , execute easy)
remove slow power hungry off package I/O interfaces
increase working set size
accelerate the workload in either direction easier

Team Nv of this forums beloved matrix engines are nothing more then exactly above at play but for a very specific workload , very large number of reads to very low number of writes.

Granath · Jun 22, 2022

https://www.hpcwire.com/2022/06/21/amds-mi300-apus-to-power-exascale-el-capitan-supercomputer/

Granath · Jun 22, 2022

Lawrence Livermore’s “El Capitan” To Take AMD’s Instinct APU Mainstream

In March 2020, when Lawrence Livermore National Laboratory announced the exascale “El Capitan” supercomputer contract had been awarded to system builder

www.nextplatform.com

TopSpoiler · Jul 18, 2022

https://twitter.com/x/status/1547901766183710727

A single MI250X shows similar performance and efficiency against two A100s.

Compare with their paper specs

:
MI250X - 47.9 TFLOPS @ 560W
2xA100 - 19.4 TFLOPS @ 800W

DavidGraham · Jul 18, 2022

TopSpoiler said:
Compare with their paper specs:
MI250X - 47.9 TFLOPS @ 560W
2xA100 - 19.4 TFLOPS @ 800W

32 A100s still consumed significantly lower wattage than 16 MI250X (32 logical GPUs), while achieving similar performance. The situation becomes inverted at 128 logical GPUs though, with the 128 A100s using slightly more power.

536571616e74 · Jul 18, 2022

itsmydamnation said:
Then why do so many clusters have CPU only nodes?

Your view is stupidly simplistic ,
remember executing data is easy moving data is hard. CPU's are orders of magnitude better at moving data that you don't know you need yet or you don't know which data you need (OOE machinery) .
GPU's are orders of magnitude wider then CPU's and have far more concurrency.
CPU's SIMD's are stupidly fast(latency, 4-5 cycle FMA at 2.5-5ghz

HBM limits how large your working set is but has high throughput ( but not really "fast" per say)
DIMM's have low throughput but can have TB of working space
DIMM's can have persistent memory

By putting all these on package , by having coherency end to end , you can:
reduce unnecessary coping of data between "engines" ( remember move hard , execute easy)
remove slow power hungry off package I/O interfaces
increase working set size
accelerate the workload in either direction easier

Team Nv of this forums beloved matrix engines are nothing more then exactly above at play but for a very specific workload , very large number of reads to very low number of writes.

Thats a lot of stupids for one post, hard to have a civil discussion when you’re reducing people and processes to idiotic and extreme.

Qesa · Jul 18, 2022

DavidGraham said:
32 A100s still consumed significantly lower wattage than 16 MI250X (32 logical GPUs), while achieving similar performance. The situation becomes inverted at 128 logical GPUs though, with the 128 A100s using slightly more power.

I got the impression that was their point? On paper (at least if the only metrics you look at are peak TFLOPS and TDP) MI250X should be ~3.5x more efficient than A100, but IRL it's very similar for this problem.

itsmydamnation · Jul 18, 2022

536571616e74 said:
Thats a lot of stupids for one post, hard to have a civil discussion when you’re reducing people and processes to idiotic and extreme.

No it isn't, it's the fundamentals of computation for the last 70 years and the problem is it's really hard and there is no free lunch. As technology(process, package, etc) improves it's just the wheel of reincarnation.

But feel free to actually put some sort of meaningful objection with some kind of actual substance forward......

536571616e74 · Jul 18, 2022

itsmydamnation said:
No it isn't, it's the fundamentals of computation for the last 70 years and the problem is it's really hard and there is no free lunch. As technology(process, package, etc) improves it's just the wheel of reincarnation.

But feel free to actually put some sort of meaningful objection with some kind of actual substance forward......

The meaningful objection is clear I think: we cannot have productive conversations when you’re hyperbolic and personally insulting people

TopSpoiler · Jul 27, 2022

https://twitter.com/x/status/1552226811454623746

troyan · Jul 27, 2022

"Single GCD is theoretical 1.6TB/s, in ideal synthetic benchmark case <1.4TB/s. They market it as a combined 3.2TB/s on MI250. Clearer would be "2x 1.6TB/s"."

Bingo. MI250X is just two CDNA2 chips glued together without an advcanced interconnection (and only 400GB/s...).

DavidGraham · Jul 28, 2022

So in these tests, the A100 40GB is faster than a single GCD of the MI250 by:

FP64/FP64: 41%
FP64/FP32: 37%
FP32/FP32: 49%,
FP32/FP16s: 90%
FP32/FP16c: 35%

Memory effeciency is around 60% for the MI250, and 90% for the A100.

TopSpoiler said:
https://twitter.com/x/status/1552226811454623746

Granath · Aug 1, 2022

First Benchmarks with AMD Instinct MI250 GPUs at JSC

A few months ago, we extended the JURECA Evaluation Platform1 at JSC by two nodes with AMD Instinct MI250 GPUs (four GPUs each). The nodes are Gigabyte G262-ZO0 servers, each with a dual socket AMD EPYC 7443 processor (24 cores per socket, SMT-2) and with four MI250 GPUs (128 GB memory)...

x-dev.pages.jsc.fz-juelich.de

Deleted member 2197 · Aug 7, 2022

August 2, 2022

AMD instinct MI250 ベンチマーク | HPCシステムズ Tech Blog

史上初エクサスケールシステム、Frontier ISC2022において、発表されたTOP500にて、Frontierが2位の富岳（442.01 PFLOPS）を抜いて、1.102 EFLOPS と、初めて「EFLOPS」の数値を達成して1位となりました。 Frontierは、74台のCray EXキャビネットに9,408ノードを収容し、それぞれにAMD Milan "Trento" 7A53 Epyc CPUを1個とAMD Instinct MI250X GPUを4個搭載して構成されています。総GPU数は 37,632基です。 AMD Instinct MI250...

www-hpc-co-jp.translate.goog

This time, I had the opportunity to benchmark a machine (SuperMicro 4124GQ-TNMI) equipped with four AMD Instinct MI250, a lower model of AMD Instinct MI250X used in Frontier. Speaking of GPUs, NVIDIA has been around for a long time, but I compared it with NVIDIA A100 to see if AMD could compete

Deleted member 2197 · Aug 24, 2022

AMD MI250X and Toplogies Explained at HC34

At Hot Chips 34, we learned more about the AMD MI250X for HPC and the MI250 OAM for mainstream GPU compute and AI/ ML workloads

www.servethehome.com

troyan · Nov 2, 2022

Looking at our broader Data Center portfolio, as expected, Data Center GPU sales were down significantly from a year ago when we had substantial shipments supporting the build-out of the Frontier exascale supercomputer.

Advanced Micro Devices, Inc. (AMD) Q3 2022 Earnings Call Transcript

Advanced Micro Devices, Inc. (NASDAQ:NASDAQ:AMD) Q3 2022 Earnings Conference Call November 1, 2022 5:00 PM ETCompany ParticipantsRuth Cotter - Senior Vice...

seekingalpha.com

The HPC world isnt just Linpack.

xpea · Nov 2, 2022

troyan said:
Advanced Micro Devices, Inc. (AMD) Q3 2022 Earnings Call Transcript

Advanced Micro Devices, Inc. (NASDAQ:NASDAQ:AMD) Q3 2022 Earnings Conference Call November 1, 2022 5:00 PM ETCompany ParticipantsRuth Cotter - Senior Vice...

seekingalpha.com

The HPC world isnt just Linpack.

That's what I said long time ago. Out of these government subsidized deals, MI250 is nowhere to be seen. But kudo to AMD that got the right solution for a stupid exascale race that the US wanted to win.
More interesting now is what will happen next with CDNA as it's so behind in most commercial AI/ML workloads. I don't see CDNA being profitable with the tiny revenue it generates. Maybe that's the reason they are going to big HPC hybrid CPU+GPGPU APU chiplets...

Qesa · Nov 17, 2022

Frontier finally submitted HPCG in the november TOP500 list, despite another 6 months to optimise they still fell ~12% behind Fugaku

HPCG - November 2022 | TOP500

www.top500.org

Newguy · Dec 3, 2022

CDNA2/MI200 arch and node talk at HC34

Silent_Buddha · Dec 4, 2022

Qesa said:
Frontier finally submitted HPCG in the november TOP500 list, despite another 6 months to optimise they still fell ~12% behind Fugaku

HPCG - November 2022 | TOP500

www.top500.org

Pretty impressive nonetheless for a supercomputer leveraging GPUs. Fugaku doesn't use GPUs and instead relies on customized ARM cores.

Regards,
SB

AMD CDNA Discussion Thread

itsmydamnation

Granath

Granath

Lawrence Livermore’s “El Capitan” To Take AMD’s Instinct APU Mainstream

TopSpoiler

DavidGraham

536571616e74

Qesa

itsmydamnation

536571616e74

TopSpoiler

troyan

DavidGraham

Granath

First Benchmarks with AMD Instinct MI250 GPUs at JSC

Deleted member 2197

Guest

AMD instinct MI250 ベンチマーク | HPCシステムズ Tech Blog

Deleted member 2197

Guest

AMD MI250X and Toplogies Explained at HC34

troyan

Advanced Micro Devices, Inc. (AMD) Q3 2022 Earnings Call Transcript

xpea

Advanced Micro Devices, Inc. (AMD) Q3 2022 Earnings Call Transcript

Qesa

HPCG - November 2022 | TOP500

Newguy

Silent_Buddha

HPCG - November 2022 | TOP500