AMD CDNA Discussion Thread

Discussion in 'Architecture and Products' started by Frenetic Pony, Nov 16, 2020.

  1. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    742
    Likes Received:
    419
    AMD Just announced it's CDNA arch based "Ml 100" card. Source

    Relevant specs:

    Design Full-height, Dual-slot, 10.5 in. long
    Compute Units 120
    Stream Processors 7,680
    FP64 TFLOPs (Peak) 11.5
    FP32 TFLOPs (Peak) 23.1
    FP32 Matrix TFLOPs (Peak) 46.1
    FP16/FP16 Matrix TFLOPs (Peak) 184.6
    Int4/Int8 TOPS (Peak) 184.6
    bFLOAT16 TFLOPs (Peak) 92.3
    HBM2 ECC Memory 32 GB
    Memory Interface 4,096-bit
    Memory Clock 1.2 GHz
    Memory Bandwidth 1.23 TB/s
    PCIe Support Gen4
    Infinity Fabric Links/Bandwidth 3 / 276 GB/s
    TDP 300 W
    Cooling Passively cooled
     
  2. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    742
    Likes Received:
    419
    Of note of course is that the leaks here were way off, at least those pinning it as a straight Vega derivative with a ton of standard FP32 flops. Instead this appears to be a straight up matrix/machine learning, at least for model training, competitor to challenge Nvidia and others. And it has, at least theoretically, the performance to do so if its priced right.
     
    #2 Frenetic Pony, Nov 16, 2020
    Last edited: Nov 16, 2020
    Leovinus and Lightman like this.
  3. Leovinus

    Newcomer

    Joined:
    May 31, 2019
    Messages:
    142
    Likes Received:
    73
    Location:
    Sweden
    I want to recall that there was some speculation on wether CDNA would be a short-lived product precisely because it was assumed to be mostly Vega. Would this more radical departure speak for CDNA continuing in parallel with RDNA? And would there be some degree of symbiosis in having two architectures developed in parallel like this?
     
  4. rSkip

    Newcomer

    Joined:
    Jan 10, 2012
    Messages:
    17
    Likes Received:
    35
    Location:
    Shanghai
    From dieshot and infinity links placement, I guess that MI100 uses 3x Infinity Fabric Links + WAFL(?), with 3x more links unused. Just guessing, I can be totally wrong.
    links.png
     
    pharma likes this.
  5. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    From the die shot pics, I've seen estimates of 709-713 mm2 posted on twitter. That's pretty large but still smaller than the A100. AMD enabling 120 CUs out of 128 is not bad at all points to a reasonably good yield. We'll probably see a slightly further cut down part later once they are in full production.

    Capacity is only 32GB but just like Nvidia, I'm guessing they'll have a part with HBM2E and double the capacity in a quarter or two.
    Not sure what the leaks were but anyone who's been following AMD's path in HPC and supercomputer wins would have known it's not just a rehashed Vega. This is going into the Frontier supercomputer next year and a number of other smaller installations. This is AMD's push back into the lucrative HPC segment and it certainly looks like they've invested significantly into it (looking at the readiness of ROCm 4.0 as well)

    Pricing is apparently around $6400 as per a link from @CarstenS in another thread but these products aren't usually sold at retail. AMD themselves claim between 1.8X - 2.1X higher performance per dollar than the Nvidia A100.
    AMD has already committed to CDNA 2, which is going into the El-Capitan Supercomputer in 2022. The key feature they've announced is that it's using a next gen Infinity fabric which allows cache coherency with EPYC CPUs. Not sure about Symbiosis but the whole point of CDNA and RDNA is to maximize the utility of each with specific features for the segments they are intended. Certain technologies such as Infinity cache, fabric, etc may of course be used across both, and some amount of physical design, etc but otherwise it seems like they will be developed in parallel.
    Well the official specs from AMD say 3 links - https://www.amd.com/en/products/server-accelerators/instinct-mi100

    And given that they're promoting 4 GPU configurations using the 3 links, I don't expect there to be more unused ones or they'd be using them.
     
    Alexko, BRiT, Leovinus and 1 other person like this.
  6. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,295
    Likes Received:
    7,248
    Yes but can it play Crysis?
    Narrator's voice: It could not play Crysis..


    BTW, why are they still calling this a GPU, if it can't render 2D graphics even?
     
    Leovinus likes this.
  7. bridgman

    Newcomer Subscriber

    Joined:
    Dec 1, 2007
    Messages:
    62
    Likes Received:
    123
    Location:
    Toronto-ish
    G is for GEMM :yes:... what were you thinking ?

    Seriously, it's a fair question... we do refer to it as an "accelerator" frequently but it is still more GPU-ish than non-GPU-accelerator-ish and there's an argument for using familiar terminology for a while.
     
    #7 bridgman, Nov 17, 2020
    Last edited: Nov 17, 2020
    ethernity and Lightman like this.
  8. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    Well..I suppose it still looks like a GPU at least.

    They have dropped the Radeon branding though. Now it's just AMD Instinct.
     
  9. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    396
    Likes Received:
    226
    Same Performance as GA100 with 100W less power at both have 7nm......

    AMD also have 100mm² less die space.
     
  10. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    GA100 does have more memory and a larger portion of the chip disabled though. The 400W is the (max) power of the SXM version. And there are PCIE versions of GA100 at less than 300W so it's not like its a 100W less in all conditions. AMD certainly looks to be ahead in FP64 but Nvidia seems to have the advantage in AI/ML related ops.

    Given the large die sizes, a chiplet style approach certainly seems to be the future of these kinds of chips. Rumours already point towards this for Nvidia Hopper and CDNA2.
     
  11. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    396
    Likes Received:
    226
    You should compare gflops value. At 400 watt Nvidia delivers 9,7 gflops, AMD is delivering at 300W card 11,5 glops. OK AMD has 8GB lass ram thats a point, but the rest is all faster. Only Matrix multiplikation at FP16/32 is higher.

    So for AI go with Nvidia, the rest please buy AMD
     
  12. Qesa

    Newcomer

    Joined:
    Feb 23, 2020
    Messages:
    28
    Likes Received:
    48
    Peak FP64 throughput is a poor predictor of actual performance when comparing between different architectures. This is as true for GPGPU as it is for graphics.
     
  13. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    As I mentioned, 400W is simply the max power draw, not actual. It was likely rated that high to ensure that the 80GB variants are drop in replacements in the existing infrastructure. 80GB of HBM memory can consume a significant amount of power. And there are PCIE versions with lower rated power draw as well. I already mentioned AMD is better for FP64, but you cannot discount NV's performance in other metrics or the CUDA ecosystem.
     
    Lightman and pharma like this.
  14. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    822
    Likes Received:
    616
    Maybe it can run crysis through software rendering with GPGPU
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,556
    Likes Received:
    4,729
    Location:
    Well within 3d
    The whitepaper also indicates a few things, like the apparent doubling of the register file to 128KB per SIMD.
    There is a separate class of registers mentioned for matrix operations, although I'm not clear if that's part of the total register file diagrammed.
    Matrix instructions are also diagrammed as being a separate issue type from vector instructions. I'm assuming it's still one instruction per clock per wave, although that wouldn't rule out a longer cadence for matrix instructions or concurrent issue from neighboring waves. Perhaps there are limits based on how many operands are being sourced from the register file.
    Some code commits mentioned that while it is possible to mix matrix and vector math, there was concern about thermal throttling being more likely if that happened.
    Clocks seem to reflect that while the matrix operations may be more efficient per FLOP, there's a lot of hardware being exercised across the chip.
     
    Lightman likes this.
  16. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,371
    Likes Received:
    320
    Location:
    San Francisco
    A100 FP64 peak via tensor cores is 19.5TF/s.
     
    Jensen Krage and pharma like this.
  17. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,412
    Likes Received:
    3,202
    Location:
    Germany
    I would guess, they'd have a 4th one at least for spares, since those cards seem not to be sold (yet) in a harvested edition and you'd need full connectivity in data centers. If one of your IF-links is broken, you can throw away the whole chip.
    Maybe the attached file helps for clarification, it's the original used in my CDNA-article here.
     

    Attached Files:

    Lightman and fellix like this.
  18. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,590
    Likes Received:
    4,317
    These chips are directed towards AI markets primarily, it's as if you are saying MI100 is irrelevant.

    A100 is over 3 times faster in AI workloads, while being behind by 15% or so in traditional FP32/FP64 workloads. A100 has 33% to 66% more memory bandwidth, a very important factor for reaching peak throughput more often.

    A100 vs M100
    FP32: 19.5 vs 23.5
    FP64: 9.5 vs 11.5
    FP16: 624 vs 184
    BF16: 624 vs 92
    TB/s: 1.6 to 2 vs 1.2
    Matrix FP32: 310 vs 46
    Matrix FP64: 19.5 vs nill?

    And that's not even the full A100 core, it has 20 CUs disabled, 108 out of 128 CUs.
     
    #18 DavidGraham, Nov 18, 2020
    Last edited: Nov 18, 2020
    pharma likes this.
  19. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    Wouldn't the 4th be for connectivity to the CPU? So you'd have each GPU connected to the CPU and 3 other GPUs? There still might be more for redundancy as you suggest, though they do have all HBM PHYs enabled unlike Nvidia with the A100. I do expect at least one more further cut down part later in the lifecycle. They had two with the Vega 20 chip, and that was much smaller (though on a new node at the time).
     
  20. madhatter

    Newcomer

    Joined:
    Jul 23, 2020
    Messages:
    29
    Likes Received:
    22
    MI100 is not primarily geared towards AI/ML.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...