AMD CDNA Discussion Thread

CarstenS · Nov 18, 2020

Erinyes said:
Wouldn't the 4th be for connectivity to the CPU? So you'd have each GPU connected to the CPU and 3 other GPUs? There still might be more for redundancy as you suggest, though they do have all HBM PHYs enabled unlike Nvidia with the A100. I do expect at least one more further cut down part later in the lifecycle. They had two with the Vega 20 chip, and that was much smaller (though on a new node at the time).

Good point! I'm sure though if they can use IF over PCIe for connecting to the CPU, for normal systems they are using PCIe 4.0 for CPU <->GPU-Hive.

Deleted member 2197 · Nov 18, 2020

madhatter said:
MI100 is not primarily geared towards AI/ML.

AMD's MI100 specs indicate otherwise. Benchmark results should be interesting.

Erinyes · Nov 18, 2020

CarstenS said:
Good point! I'm sure though if they can use IF over PCIe for connecting to the CPU, for normal systems they are using PCIe 4.0 for CPU <->GPU-Hive.

For Frontier, they're said to be using a "custom" CPU with IF links for CPU to GPU. This is probably an EPYC Milan with a new IO die and the same Zen 3 chiplets. They had mentioned that it would be available to other customers a well and I guess this is what they'd be pushing to HPC users for Instinct.

Deleted member 2197 · Nov 18, 2020

MI100 Block Diagram

AMD Instinct MI100 Enhanced Compute Unit With SIMD View

AMD-Radeon-Instinct-MI100-Enhanced-Compute-Unit-with-SIMD-View.jpg

https://www.servethehome.com/amd-radeon-instinct-mi100-32gb-cdna-gpu-launched/

yuri · Nov 18, 2020

Erinyes said:
For Frontier, they're said to be using a "custom" CPU with IF links for CPU to GPU. This is probably an EPYC Milan with a new IO die and the same Zen 3 chiplets. They had mentioned that it would be available to other customers a well and I guess this is what they'd be pushing to HPC users for Instinct.

It's rumored that Frontier uses Zen 3 cores together with a custom IO die featuring stacked memory (TSMC X3D) - code name "Trento". So enabling IF links to accelerators sounds rather easy.

Newguy · Nov 19, 2020

CDNA whitepaper https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf

pcchen · Nov 19, 2020

While many HPL jobs still heavily rely on FP64, with the current emphasis on AI computation, people are starting to look on mixed precision algorithms. For example, the HPL-AI benchmark utilizes FP32 and FP16 units in various iterations to improve its solutions to FP64 accuracy. It achieves more than 4X performance compared to traditional FP64 LINPACK on Fugaku, the fastest supercomputer today. Of course, this is not necessarily applicable to all algorithms, but the performance gain is surely going to get the attention of some researchers to consider a similar approach.

Erinyes · Nov 19, 2020

yuri said:
It's rumored that Frontier uses Zen 3 cores together with a custom IO die featuring stacked memory (TSMC X3D) - code name "Trento". So enabling IF links to accelerators sounds rather easy.

I had read about that as well, but stacked memory isn't very exciting tbh. Stacked silicon is the real deal. And would stacked HBM offer enough capacity? Or will it have both HBM and DRAM? I think the Fujitsu chip does have a similar setup.

Frenetic Pony · Nov 19, 2020

Erinyes said:
I had read about that as well, but stacked memory isn't very exciting tbh. Stacked silicon is the real deal. And would stacked HBM offer enough capacity? Or will it have both HBM and DRAM? I think the Fujitsu chip does have a similar setup.

Stacked silicon just screams thermal problems to me. I suspect it's why stacked memory is so heavily invested in. Very little thermal headroom problems with stacking, and bandwidth is a definitive issue coming up more and more. See both RDNA2 getting capped by it's LLC at high resolutions and even that taking up a huge die space, as well as Apple's M1 likely getting capped in multi core scenarios by bandwidth. I'd say it's more relevant and sexier than you might give it credit for.

Erinyes · Nov 19, 2020

Frenetic Pony said:
Stacked silicon just screams thermal problems to me. I suspect it's why stacked memory is so heavily invested in. Very little thermal headroom problems with stacking, and bandwidth is a definitive issue coming up more and more. See both RDNA2 getting capped by it's LLC at high resolutions and even that taking up a huge die space, as well as Apple's M1 likely getting capped in multi core scenarios by bandwidth. I'd say it's more relevant and sexier than you might give it credit for.

Thermal density is definitely a concern and we'll have to see what solutions the industry takes to tackle it. Stacked memory does serve a purpose and there will be a number of cases where it is very advantageous. However interconnect speed and power between different dies on the same package has been growing at a rapid rate and that's one of the issues that stacked silicon is intended to solve.

no-X · Nov 19, 2020

DavidGraham said:
These chips are directed towards AI markets primarily, it's as if you are saying MI100 is irrelevant.

A100 is over 3 times faster in AI workloads, while being behind by 15% or so in traditional FP32/FP64 workloads. A100 has 33% to 66% more memory bandwidth, a very important factor for reaching peak throughput more often.

A100 vs M100
FP32: 19.5 vs 23.5
FP64: 9.5 vs 11.5
FP16: 624 vs 184
BF16: 624 vs 92
TB/s: 1.6 to 2 vs 1.2
Matrix FP32: 310 vs 46
Matrix FP64: 19.5 vs nill?

And that's not even the full A100 core, it has 20 CUs disabled, 108 out of 128 CUs.

The FP16 value for A100 is wrong. It's not 624, but 78. So less than MI100. Source.
The Matrix FP32 value for A100 is also wrong. Not 310 but 19,5. So less than MI100, again. (310 is only for TF32 = reduced precision).

manux · Nov 19, 2020

no-X said:
The FP16 value for A100 is wrong. It's not 624, but 78. So less than MI100. Source.
The Matrix FP32 value for A100 is also wrong. Not 310 but 19,5. So less than MI100, again. (310 is only for TF32 = reduced precision).

You can't discount tensor core performance if the goal is to measure ai/dnn training/inference performance. TF32 is neat format in between of fp16/fp32. Holy grail is to use as little precision as possible while getting good results. tf32 is unproven but might be very useful if full fp32 accuracy is not needed while fp16 either is not accurate enough or there isn't resources to optimize. Coolness in tf32 is that other than accuracy it's drop in replacement for fp32. No engineering needed to take advantage of tf32. Another neat thing about tf32 is that input, output and accumulation are done in full fp32 precision. afaik. only multiplies are done in lower accuracy

Flops alone are only so important. Closest to real thing is to compare mlperf results. mlperf shows how those flops are put into use in real life scenarios while also providing categories from single chip to datacenter. I believe mlperf is the basis many organizations use to evaluate how different hw's perform/scale and what to buy/rent.

Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.

https://mlperf.org/
https://www.nvidia.com/en-us/data-center/mlperf/

DavidGraham · Nov 19, 2020

no-X said:
The FP16 value for A100 is wrong. It's not 624, but 78. So less than MI100. Source.

That's Matrix FP16 for both A100 and MI100.

no-X · Nov 19, 2020

manux said:
You can't discount tensor core performance if the goal is to measure ai/dnn training/inference performance.

I'm not discounting tensor core performance, I'm correcting misplaced values. DavidGraham stated "Matrix FP32", but provided number for TF32 precision, not FP32 precision. TF32 precision may be a "neat format", but it's not identical to FP32, because of lower precision. The formats aren't interchangeable.

CarstenS · Nov 19, 2020

Oh noes!! Not a single-number-denominator any more to judge HPC accelerators by. Who would've thought.

Voxilla · Nov 19, 2020

DavidGraham said:
These chips are directed towards AI markets primarily, it's as if you are saying MI100 is irrelevant.

Matrix FP32: 310 vs 46

What about:
Matrix TF32 310 vs FP32 46 ?
It's easy to confuse, but that obviously was the purpose.
(Similar as QLED vs OLED kind of marketing trick)

Esrever · Nov 19, 2020

The peak performance of any of the numbers mean very little anyways. People who buy these are going to test their custom code on it.

Would be interesting even to see how well something basic like ImageNet runs for a comparison.

CarstenS · Dec 19, 2020

CDNA ISA reference Guide is out - actually, a couple of days already:
https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf
(had not time yet to skim through it)

DmitryKo · Jan 1, 2021

ROCm 4.0 release notes
https://github.com/RadeonOpenCompute/ROCm#INTRODUCING-AMD-INSTINCT-MI100

INTRODUCING AMD INSTINCT MI100

The AMD Instinct™ MI100 accelerator is the world’s fastest HPC GPU, and a culmination of the AMD CDNA architecture, with all-new Matrix Core Technology, and AMD ROCm™ open ecosystem to deliver new levels of performance, portability, and productivity. AMD CDNA is an all-new GPU architecture from AMD to drive accelerated computing into the era of exascale computing. The new architecture augments scalar and vector processing with new Matrix Core Engines and adds Infinity Fabric™ technology to scale up to larger systems. The open ROCm ecosystem puts customers in control and is a robust, mature platform that is easy to develop for and capable of running the most critical applications. The overall result is that the MI100 is the first GPU to break the 10TFLOP/s FP64 barrier designed as the steppingstone to the next generation of Exascale systems that will deliver pioneering discoveries in machine learning and scientific computing.

Key Features of AMD Instinct™ MI100

Important features of the AMD Instinct™ MI100 accelerator include:

Extended matrix core engine with Matrix Fused Multiply-Add (MFMA) for mixed-precision arithmetic and operates on KxN matrices (FP32, FP16, BF16, Int8)
Added native support for the bfloat16 data type
3 Infinity fabric connections per GPU enable a fully connected group of 4 GPUs in a ‘hive’

Matrix Core Engines and GFX908 Considerations

The AMD CDNA architecture builds on GCN’s foundation of scalars and vectors and adds matrices while simultaneously adding support for new numerical formats for machine learning and preserving backward compatibility for any software written for the GCN architecture. These Matrix Core Engines add a new family of wavefront-level instructions, the Matrix Fused MultiplyAdd or MFMA. The MFMA family performs mixed-precision arithmetic and operates on KxN matrices using four different types of input data: 8-bit integers (INT8), 16-bit half-precision FP (FP16), 16-bit brain FP (bf16), and 32-bit single-precision (FP32). All MFMA instructions produce either a 32-bit integer (INT32) or FP32 output, which reduces the likelihood of overflowing during the final accumulation stages of matrix multiplication.

On nodes with gfx908, MFMA instructions are available to substantially speed up matrix operations. This hardware feature is used only in matrix multiplications functions in rocBLAS and supports only three base types f16_r, bf16_r, and f32_r.

For half precision (f16_r and bf16_r) GEMM, use the function rocblas_gemm_ex, and set the compute_type parameter to f32_r.
For single precision (f32_r) GEMM, use the function rocblas_sgemm.
For single precision complex (f32_c) GEMM, use the function rocblas_cgemm.

References

For more information about bfloat16, see https://rocblas.readthedocs.io/en/master/usermanual.html
For more details about AMD Instinct™ MI100 accelerator key features, see https://www.amd.com/system/files/documents/instinct-mi100-brochure.pdf
For more information about the AMD Instinct MI100 accelerator, refer to the following sources:
- AMD CDNA whitepaper at https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf
- MI100 datasheet at https://www.amd.com/system/files/documents/instinct-mi100-brochure.pdf
AMD Instinct MI100/CDNA1 Shader Instruction Set Architecture (Dec. 2020) – This document describes the current environment, organization, and program state of AMD CDNA “Instinct MI100” devices. It details the instruction set and the microcode formats native to this family of processors that are accessible to programmers and compilers.
https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf

yuri · Feb 19, 2021

A next-gen CDNA has started showing at LLVM. It's IP name is GFX90A.

Full-rate FP64 (!?) with ability to run packed FP32, new stuff like ability to split thread groups somehow and a dose of new instructions.

Haven't got time to go through that yet.

Initial commit:
https://github.com/llvm/llvm-project/commit/a8d9d50762c42d726274d3f1126ec97ff96e2a22

// EDIT
There is a description of the tgsplit:
Enable/disable generating code that assumes work-groups are launched in threadgroup split mode. When enabled the waves of a work-group may be launched in different CUs.

The wavefronts for a single work-group are executed in the same CU but may be executed by different SIMDs. The exception is when in tgsplit execution mode when the wavefronts may be executed by different SIMDs in different CUs.

Each CU has a single LDS memory shared by the wavefronts of the work-groups executing on it. The exception is when in tgsplit execution mode when no LDS is allocated as wavefronts of the same work-group can be in different CUs.

EDIT #2
There seems to be 512 VGPRs compared to GCN/RDNA's 256. They probably merged the VGPRs and AccVGPRs together, given existence of ACCUM_OFFSET.

AMD CDNA Discussion Thread

CarstenS

Moderator

Deleted member 2197

Guest

Erinyes

Deleted member 2197

Guest

yuri

Newguy

pcchen

Moderator

Erinyes

Frenetic Pony

Erinyes

no-X

manux

DavidGraham

no-X

CarstenS

Moderator

Voxilla

Esrever

CarstenS

Moderator

DmitryKo

yuri