AMD CDNA: MI300 & MI400 (Analysis, Speculation and Rumors in 2024)

Couple things looking over the whitepaper again, they doubled the vector (CU) FP32 rate?
Yea.
more VOPD shenanigans?
No it's a more straightforward dual-issue.
940 has no VOPD (the ISA manual is out).
HWS and ACE have known to be firmware running on embedded cores for a few generations, dated back to at least “HWS” was introduced (Polaris?).
Tonga I think.
while a single agent GFX940 configuration can have only 1 HWS by nature
I'm not sure if that's correct given even the monodies have two HWS per die.
 


 
MI325X is the HBM3E CDNA3 refresh with 288GB and 6TB/s bandwidth with increased compute coming this year.

MI350 is CDNA4 on 3nm coming in 2025. 35x inference improvement(FP4/FP6) compared to CDNA3.
1.5x more memory and 1.2x "AI compute" TFlops compared to B200.

MI400 is CDNA Next coming in 2026.
AMD Computex 2024 stream link
 
Last edited:
AMD’s MI300X Outperforms NVIDIA’s H100 for LLM Inference

Edit: Tensorwave seems to have chosen the slowest and unoptimized inference engine for H100.
The thing is that NVDA has tripled H100 performance by huge software optimizations since it was announced. AMD is not competing a still target like they did with Intel CPUs...

MLPerf-Training-4.0-Nvidia-LLM-performance.png
 
Last edited:
Based on authenticated and published test results only Intel is the second best option according the MLPerf tests.
It's a marketing attempt by Tensorwave to generate sales as they did something similar a few months ago.

Pretty unusual for AMD to miss both MLPerf tests with hardware available. They seem to want to move on pretty quick from MI300x.
 
The thing is that NVDA has tripled H100 performance by huge software optimizations since it was announced. AMD is not competing a still target like they did with Intel CPUs...
AFAICT, they have a significant ~27% performance gain for 512 H100s (iso-GPU-count) from June 2023 to June 2024, but they are also comparing 2023 to 2024 numbers at 3.2x performance with 3.2x GPUs to claim "perfect scaling" - even though the June 2024 numbers for 3584 H100s wouldn be noticeably higher than that, so they do not actually achieve perfect scaling, but more like ~85% which is pretty standard...

EDIT: ... to bring it back to AMD/MI300X ...


For training, AMD is competing to something "close enough" to a fixed target in my opinion, +27% by SW in a year is nothing amazing in this industry, there have been a lot of optimisations in PyTorch and generic things like Flash Attention 2 which also applies to AMD. I wouldn't be surprised if NVIDIA-specific perf improvements for training were more like 10-15% which is kinda what you'd expect in that timeframe for a new-ish architecture.

For inference, both NVIDIA *and* AMD are extremely fast moving targets, partly due to inference framework improvements, and NVIDIA's unique advantage here is TensorRT which AMD doesn't have an in-house alternative to. But there are 3rd party optimised inference frameworks that can be compared, e.g. MK1 Flywheel which is used in TensorWave's blog post - however comparing that to vllm on NVIDIA is a bit silly. Either you compare the same frameworks (e.g. vllm or MK1 on both NVIDIA and AMD) or you compare the best you can find on AMD (e.g. MK1 Flywheel?) to the best you can find on NVIDIA (e.g. TensorRT). So TensorWave's blog post is definitely not a fair comparison, and there's a bit of a pattern from them of not very rigorous claims imo...
 
I have to hand it to NVIDIA's product & marketing teams, this is so confusingly misleading that I honestly don't know what they are trying to say...

AFAICT, they have a significant ~27% performance gain for 512 H100s (iso-GPU-count) from June 2023 to June 2024, but they are also comparing 2023 to 2024 numbers at 3.2x performance with 3.2x GPUs to claim "perfect scaling" - even though the June 2024 numbers for 3584 H100s wouldn be noticeably higher than that, so they do not actually achieve perfect scaling...
I don't find to much confusing about the statement. "Nearly perfect scaling" and "perfect scaling" are not the same thing, and they also mentioned the 3.2x increase was due to larger scale and software improvements.
The result of this work is a 3.2x performance increase in just a year coming from a larger scale, and significant software improvements. This combination also delivered nearly perfect scaling — as the number of GPUs increased by 3.2x, so did the delivered performance.

Anyway not the thread for this discussion.
 
Back
Top