AMD CDNA Discussion Thread

Discussion in 'Architecture and Products' started by Frenetic Pony, Nov 16, 2020.

  1. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,280
    Likes Received:
    2,952
    Location:
    Germany
    Good point! I'm sure though if they can use IF over PCIe for connecting to the CPU, for normal systems they are using PCIe 4.0 for CPU <->GPU-Hive.
     
  2. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    4,109
    Likes Received:
    3,230
    AMD's MI100 specs indicate otherwise. Benchmark results should be interesting.
     
  3. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    For Frontier, they're said to be using a "custom" CPU with IF links for CPU to GPU. This is probably an EPYC Milan with a new IO die and the same Zen 3 chiplets. They had mentioned that it would be available to other customers a well and I guess this is what they'd be pushing to HPC users for Instinct.
     
    Lightman likes this.
  4. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    4,109
    Likes Received:
    3,230
    Lightman, Krteq and Alexko like this.
  5. yuri

    Regular Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    263
    Likes Received:
    270
    It's rumored that Frontier uses Zen 3 cores together with a custom IO die featuring stacked memory (TSMC X3D) - code name "Trento". So enabling IF links to accelerators sounds rather easy.
     
    Lightman likes this.
  6. Newguy

    Regular Newcomer

    Joined:
    Nov 10, 2014
    Messages:
    258
    Likes Received:
    116
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,888
    Likes Received:
    374
    Location:
    Taiwan
    While many HPL jobs still heavily rely on FP64, with the current emphasis on AI computation, people are starting to look on mixed precision algorithms. For example, the HPL-AI benchmark utilizes FP32 and FP16 units in various iterations to improve its solutions to FP64 accuracy. It achieves more than 4X performance compared to traditional FP64 LINPACK on Fugaku, the fastest supercomputer today. Of course, this is not necessarily applicable to all algorithms, but the performance gain is surely going to get the attention of some researchers to consider a similar approach.
     
    Lightman, pharma, sonen and 1 other person like this.
  8. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    I had read about that as well, but stacked memory isn't very exciting tbh. Stacked silicon is the real deal. And would stacked HBM offer enough capacity? Or will it have both HBM and DRAM? I think the Fujitsu chip does have a similar setup.
     
  9. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    631
    Likes Received:
    323
    Stacked silicon just screams thermal problems to me. I suspect it's why stacked memory is so heavily invested in. Very little thermal headroom problems with stacking, and bandwidth is a definitive issue coming up more and more. See both RDNA2 getting capped by it's LLC at high resolutions and even that taking up a huge die space, as well as Apple's M1 likely getting capped in multi core scenarios by bandwidth. I'd say it's more relevant and sexier than you might give it credit for.
     
  10. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    Thermal density is definitely a concern and we'll have to see what solutions the industry takes to tackle it. Stacked memory does serve a purpose and there will be a number of cases where it is very advantageous. However interconnect speed and power between different dies on the same package has been growing at a rapid rate and that's one of the issues that stacked silicon is intended to solve.
     
  11. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,370
    Likes Received:
    353
    The FP16 value for A100 is wrong. It's not 624, but 78. So less than MI100. Source.
    The Matrix FP32 value for A100 is also wrong. Not 310 but 19,5. So less than MI100, again. (310 is only for TF32 = reduced precision).
     
  12. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,547
    Likes Received:
    1,682
    Location:
    Earth
    You can't discount tensor core performance if the goal is to measure ai/dnn training/inference performance. TF32 is neat format in between of fp16/fp32. Holy grail is to use as little precision as possible while getting good results. tf32 is unproven but might be very useful if full fp32 accuracy is not needed while fp16 either is not accurate enough or there isn't resources to optimize. Coolness in tf32 is that other than accuracy it's drop in replacement for fp32. No engineering needed to take advantage of tf32. Another neat thing about tf32 is that input, output and accumulation are done in full fp32 precision. afaik. only multiplies are done in lower accuracy

    Flops alone are only so important. Closest to real thing is to compare mlperf results. mlperf shows how those flops are put into use in real life scenarios while also providing categories from single chip to datacenter. I believe mlperf is the basis many organizations use to evaluate how different hw's perform/scale and what to buy/rent.

    https://mlperf.org/
    https://www.nvidia.com/en-us/data-center/mlperf/
     
    #32 manux, Nov 19, 2020
    Last edited: Nov 19, 2020
    pharma and DavidGraham like this.
  13. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,445
    Likes Received:
    3,975
    That's Matrix FP16 for both A100 and MI100.
     
  14. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,370
    Likes Received:
    353
    I'm not discounting tensor core performance, I'm correcting misplaced values. DavidGraham stated "Matrix FP32", but provided number for TF32 precision, not FP32 precision. TF32 precision may be a "neat format", but it's not identical to FP32, because of lower precision. The formats aren't interchangeable.
     
    ethernity, Erinyes and Lightman like this.
  15. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,280
    Likes Received:
    2,952
    Location:
    Germany
    Oh noes!! Not a single-number-denominator any more to judge HPC accelerators by. Who would've thought.
     
    Lightman and ethernity like this.
  16. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    829
    Likes Received:
    478
    What about:
    Matrix TF32 310 vs FP32 46 ?
    It's easy to confuse, but that obviously was the purpose.
    (Similar as QLED vs OLED kind of marketing trick)
     
    #36 Voxilla, Nov 19, 2020
    Last edited: Nov 19, 2020
    ethernity likes this.
  17. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    822
    Likes Received:
    616
    The peak performance of any of the numbers mean very little anyways. People who buy these are going to test their custom code on it.

    Would be interesting even to see how well something basic like ImageNet runs for a comparison.
     
    ethernity and CarstenS like this.
  18. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,280
    Likes Received:
    2,952
    Location:
    Germany
    Ethatron, tinokun, BRiT and 5 others like this.
  19. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    862
    Likes Received:
    970
    Location:
    55°38′33″ N, 37°28′37″ E
    ROCm 4.0 release notes
    https://github.com/RadeonOpenCompute/ROCm#INTRODUCING-AMD-INSTINCT-MI100

    INTRODUCING AMD INSTINCT MI100

    The AMD Instinct™ MI100 accelerator is the world’s fastest HPC GPU, and a culmination of the AMD CDNA architecture, with all-new Matrix Core Technology, and AMD ROCm™ open ecosystem to deliver new levels of performance, portability, and productivity. AMD CDNA is an all-new GPU architecture from AMD to drive accelerated computing into the era of exascale computing. The new architecture augments scalar and vector processing with new Matrix Core Engines and adds Infinity Fabric™ technology to scale up to larger systems. The open ROCm ecosystem puts customers in control and is a robust, mature platform that is easy to develop for and capable of running the most critical applications. The overall result is that the MI100 is the first GPU to break the 10TFLOP/s FP64 barrier designed as the steppingstone to the next generation of Exascale systems that will deliver pioneering discoveries in machine learning and scientific computing.

    Key Features of AMD Instinct™ MI100

    Important features of the AMD Instinct™ MI100 accelerator include:
    • Extended matrix core engine with Matrix Fused Multiply-Add (MFMA) for mixed-precision arithmetic and operates on KxN matrices (FP32, FP16, BF16, Int8)
    • Added native support for the bfloat16 data type
    • 3 Infinity fabric connections per GPU enable a fully connected group of 4 GPUs in a ‘hive’
    Matrix Core Engines and GFX908 Considerations

    The AMD CDNA architecture builds on GCN’s foundation of scalars and vectors and adds matrices while simultaneously adding support for new numerical formats for machine learning and preserving backward compatibility for any software written for the GCN architecture. These Matrix Core Engines add a new family of wavefront-level instructions, the Matrix Fused MultiplyAdd or MFMA. The MFMA family performs mixed-precision arithmetic and operates on KxN matrices using four different types of input data: 8-bit integers (INT8), 16-bit half-precision FP (FP16), 16-bit brain FP (bf16), and 32-bit single-precision (FP32). All MFMA instructions produce either a 32-bit integer (INT32) or FP32 output, which reduces the likelihood of overflowing during the final accumulation stages of matrix multiplication.

    On nodes with gfx908, MFMA instructions are available to substantially speed up matrix operations. This hardware feature is used only in matrix multiplications functions in rocBLAS and supports only three base types f16_r, bf16_r, and f32_r.
    • For half precision (f16_r and bf16_r) GEMM, use the function rocblas_gemm_ex, and set the compute_type parameter to f32_r.
    • For single precision (f32_r) GEMM, use the function rocblas_sgemm.
    • For single precision complex (f32_c) GEMM, use the function rocblas_cgemm.
    References
     
    #39 DmitryKo, Jan 1, 2021
    Last edited: Jan 6, 2021
    Lightman, Krteq, DavidGraham and 3 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...