AMD CDNA Discussion Thread

Wouldn't the 4th be for connectivity to the CPU? So you'd have each GPU connected to the CPU and 3 other GPUs? There still might be more for redundancy as you suggest, though they do have all HBM PHYs enabled unlike Nvidia with the A100. I do expect at least one more further cut down part later in the lifecycle. They had two with the Vega 20 chip, and that was much smaller (though on a new node at the time).
Good point! I'm sure though if they can use IF over PCIe for connecting to the CPU, for normal systems they are using PCIe 4.0 for CPU <->GPU-Hive.
 
Good point! I'm sure though if they can use IF over PCIe for connecting to the CPU, for normal systems they are using PCIe 4.0 for CPU <->GPU-Hive.

For Frontier, they're said to be using a "custom" CPU with IF links for CPU to GPU. This is probably an EPYC Milan with a new IO die and the same Zen 3 chiplets. They had mentioned that it would be available to other customers a well and I guess this is what they'd be pushing to HPC users for Instinct.
 
MI100 Block Diagram
AMD-Radeon-Instinct-MI100-Block-Diagram.jpg



AMD Instinct MI100 Enhanced Compute Unit With SIMD View
AMD-Radeon-Instinct-MI100-Enhanced-Compute-Unit-with-SIMD-View.jpg

https://www.servethehome.com/amd-radeon-instinct-mi100-32gb-cdna-gpu-launched/
 
For Frontier, they're said to be using a "custom" CPU with IF links for CPU to GPU. This is probably an EPYC Milan with a new IO die and the same Zen 3 chiplets. They had mentioned that it would be available to other customers a well and I guess this is what they'd be pushing to HPC users for Instinct.
It's rumored that Frontier uses Zen 3 cores together with a custom IO die featuring stacked memory (TSMC X3D) - code name "Trento". So enabling IF links to accelerators sounds rather easy.
 
While many HPL jobs still heavily rely on FP64, with the current emphasis on AI computation, people are starting to look on mixed precision algorithms. For example, the HPL-AI benchmark utilizes FP32 and FP16 units in various iterations to improve its solutions to FP64 accuracy. It achieves more than 4X performance compared to traditional FP64 LINPACK on Fugaku, the fastest supercomputer today. Of course, this is not necessarily applicable to all algorithms, but the performance gain is surely going to get the attention of some researchers to consider a similar approach.
 
It's rumored that Frontier uses Zen 3 cores together with a custom IO die featuring stacked memory (TSMC X3D) - code name "Trento". So enabling IF links to accelerators sounds rather easy.

I had read about that as well, but stacked memory isn't very exciting tbh. Stacked silicon is the real deal. And would stacked HBM offer enough capacity? Or will it have both HBM and DRAM? I think the Fujitsu chip does have a similar setup.
 
I had read about that as well, but stacked memory isn't very exciting tbh. Stacked silicon is the real deal. And would stacked HBM offer enough capacity? Or will it have both HBM and DRAM? I think the Fujitsu chip does have a similar setup.

Stacked silicon just screams thermal problems to me. I suspect it's why stacked memory is so heavily invested in. Very little thermal headroom problems with stacking, and bandwidth is a definitive issue coming up more and more. See both RDNA2 getting capped by it's LLC at high resolutions and even that taking up a huge die space, as well as Apple's M1 likely getting capped in multi core scenarios by bandwidth. I'd say it's more relevant and sexier than you might give it credit for.
 
Stacked silicon just screams thermal problems to me. I suspect it's why stacked memory is so heavily invested in. Very little thermal headroom problems with stacking, and bandwidth is a definitive issue coming up more and more. See both RDNA2 getting capped by it's LLC at high resolutions and even that taking up a huge die space, as well as Apple's M1 likely getting capped in multi core scenarios by bandwidth. I'd say it's more relevant and sexier than you might give it credit for.

Thermal density is definitely a concern and we'll have to see what solutions the industry takes to tackle it. Stacked memory does serve a purpose and there will be a number of cases where it is very advantageous. However interconnect speed and power between different dies on the same package has been growing at a rapid rate and that's one of the issues that stacked silicon is intended to solve.
 
These chips are directed towards AI markets primarily, it's as if you are saying MI100 is irrelevant.

A100 is over 3 times faster in AI workloads, while being behind by 15% or so in traditional FP32/FP64 workloads. A100 has 33% to 66% more memory bandwidth, a very important factor for reaching peak throughput more often.

A100 vs M100
FP32: 19.5 vs 23.5
FP64: 9.5 vs 11.5
FP16: 624 vs 184
BF16: 624 vs 92
TB/s: 1.6 to 2 vs 1.2
Matrix FP32: 310 vs 46
Matrix FP64: 19.5 vs nill?

And that's not even the full A100 core, it has 20 CUs disabled, 108 out of 128 CUs.
The FP16 value for A100 is wrong. It's not 624, but 78. So less than MI100. Source.
The Matrix FP32 value for A100 is also wrong. Not 310 but 19,5. So less than MI100, again. (310 is only for TF32 = reduced precision).
 
The FP16 value for A100 is wrong. It's not 624, but 78. So less than MI100. Source.
The Matrix FP32 value for A100 is also wrong. Not 310 but 19,5. So less than MI100, again. (310 is only for TF32 = reduced precision).

You can't discount tensor core performance if the goal is to measure ai/dnn training/inference performance. TF32 is neat format in between of fp16/fp32. Holy grail is to use as little precision as possible while getting good results. tf32 is unproven but might be very useful if full fp32 accuracy is not needed while fp16 either is not accurate enough or there isn't resources to optimize. Coolness in tf32 is that other than accuracy it's drop in replacement for fp32. No engineering needed to take advantage of tf32. Another neat thing about tf32 is that input, output and accumulation are done in full fp32 precision. afaik. only multiplies are done in lower accuracy

Flops alone are only so important. Closest to real thing is to compare mlperf results. mlperf shows how those flops are put into use in real life scenarios while also providing categories from single chip to datacenter. I believe mlperf is the basis many organizations use to evaluate how different hw's perform/scale and what to buy/rent.

Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.
https://mlperf.org/
https://www.nvidia.com/en-us/data-center/mlperf/
 
Last edited:
You can't discount tensor core performance if the goal is to measure ai/dnn training/inference performance.
I'm not discounting tensor core performance, I'm correcting misplaced values. DavidGraham stated "Matrix FP32", but provided number for TF32 precision, not FP32 precision. TF32 precision may be a "neat format", but it's not identical to FP32, because of lower precision. The formats aren't interchangeable.
 
ROCm 4.0 release notes
https://github.com/RadeonOpenCompute/ROCm#INTRODUCING-AMD-INSTINCT-MI100

INTRODUCING AMD INSTINCT MI100

The AMD Instinct™ MI100 accelerator is the world’s fastest HPC GPU, and a culmination of the AMD CDNA architecture, with all-new Matrix Core Technology, and AMD ROCm™ open ecosystem to deliver new levels of performance, portability, and productivity. AMD CDNA is an all-new GPU architecture from AMD to drive accelerated computing into the era of exascale computing. The new architecture augments scalar and vector processing with new Matrix Core Engines and adds Infinity Fabric™ technology to scale up to larger systems. The open ROCm ecosystem puts customers in control and is a robust, mature platform that is easy to develop for and capable of running the most critical applications. The overall result is that the MI100 is the first GPU to break the 10TFLOP/s FP64 barrier designed as the steppingstone to the next generation of Exascale systems that will deliver pioneering discoveries in machine learning and scientific computing.

Key Features of AMD Instinct™ MI100

Important features of the AMD Instinct™ MI100 accelerator include:
  • Extended matrix core engine with Matrix Fused Multiply-Add (MFMA) for mixed-precision arithmetic and operates on KxN matrices (FP32, FP16, BF16, Int8)
  • Added native support for the bfloat16 data type
  • 3 Infinity fabric connections per GPU enable a fully connected group of 4 GPUs in a ‘hive’
Matrix Core Engines and GFX908 Considerations

The AMD CDNA architecture builds on GCN’s foundation of scalars and vectors and adds matrices while simultaneously adding support for new numerical formats for machine learning and preserving backward compatibility for any software written for the GCN architecture. These Matrix Core Engines add a new family of wavefront-level instructions, the Matrix Fused MultiplyAdd or MFMA. The MFMA family performs mixed-precision arithmetic and operates on KxN matrices using four different types of input data: 8-bit integers (INT8), 16-bit half-precision FP (FP16), 16-bit brain FP (bf16), and 32-bit single-precision (FP32). All MFMA instructions produce either a 32-bit integer (INT32) or FP32 output, which reduces the likelihood of overflowing during the final accumulation stages of matrix multiplication.

On nodes with gfx908, MFMA instructions are available to substantially speed up matrix operations. This hardware feature is used only in matrix multiplications functions in rocBLAS and supports only three base types f16_r, bf16_r, and f32_r.
  • For half precision (f16_r and bf16_r) GEMM, use the function rocblas_gemm_ex, and set the compute_type parameter to f32_r.
  • For single precision (f32_r) GEMM, use the function rocblas_sgemm.
  • For single precision complex (f32_c) GEMM, use the function rocblas_cgemm.
References
 
Last edited:
A next-gen CDNA has started showing at LLVM. It's IP name is GFX90A.

Full-rate FP64 (!?) with ability to run packed FP32, new stuff like ability to split thread groups somehow and a dose of new instructions.

Haven't got time to go through that yet.

Initial commit:
https://github.com/llvm/llvm-project/commit/a8d9d50762c42d726274d3f1126ec97ff96e2a22

// EDIT
There is a description of the tgsplit:
Enable/disable generating code that assumes work-groups are launched in threadgroup split mode. When enabled the waves of a work-group may be launched in different CUs.

The wavefronts for a single work-group are executed in the same CU but may be executed by different SIMDs. The exception is when in tgsplit execution mode when the wavefronts may be executed by different SIMDs in different CUs.

Each CU has a single LDS memory shared by the wavefronts of the work-groups executing on it. The exception is when in tgsplit execution mode when no LDS is allocated as wavefronts of the same work-group can be in different CUs.

EDIT #2
There seems to be 512 VGPRs compared to GCN/RDNA's 256. They probably merged the VGPRs and AccVGPRs together, given existence of ACCUM_OFFSET.
 
Last edited:
Back
Top