CDNA 3 whitepaper https://www.amd.com/content/dam/amd...-docs/white-papers/amd-cdna-3-white-paper.pdf
Chips and cheese piece on CDNA 3 https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/
Haven't read through it properly yet but some stuff like 2100MHz peak clock is very high for 304 CUs (and 1.23x higher than last gen), they doubled the width of the matrices, have full rate Int8 (doubled from same as FP16 with CDNA 2) so now that's 6.8x faster, added TF32 and FP8 which we already knew. FP64 is 81.7Tflops, up from 47.9 (1.7x) for classic HPC stuff. Cache size and bandwidth is a bit bonkers, 256MB LLC/infinity cache (called both in the whitepaper) has 17.2TB/s total bandwidth (128x 2MB slices, 64bytes/cycle per slice for 8192bytes/cycle total, 17.2TB/s at 2.1GHz)
In general seems like a big improvement over CDNA2/MI250X and I'm sure the unified memory in MI300A will be big for efficiency/perf in cases that take advantage of it, that'll be a big inefficiency barrier removed
Chips and cheese piece on CDNA 3 https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/
Haven't read through it properly yet but some stuff like 2100MHz peak clock is very high for 304 CUs (and 1.23x higher than last gen), they doubled the width of the matrices, have full rate Int8 (doubled from same as FP16 with CDNA 2) so now that's 6.8x faster, added TF32 and FP8 which we already knew. FP64 is 81.7Tflops, up from 47.9 (1.7x) for classic HPC stuff. Cache size and bandwidth is a bit bonkers, 256MB LLC/infinity cache (called both in the whitepaper) has 17.2TB/s total bandwidth (128x 2MB slices, 64bytes/cycle per slice for 8192bytes/cycle total, 17.2TB/s at 2.1GHz)
In general seems like a big improvement over CDNA2/MI250X and I'm sure the unified memory in MI300A will be big for efficiency/perf in cases that take advantage of it, that'll be a big inefficiency barrier removed
Last edited: