AMD CDNA Discussion Thread

Discussion in 'Architecture and Products' started by Frenetic Pony, Nov 16, 2020.

  1. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    149
    Likes Received:
    183
    Nice results.3,3x from MI100 isn't looking bad. I'm wondering what's limiting in this results, as MI100 isn't scaling with more GPUs?

    And why aren't ORNL people comparing with their existing summit system? Would be much more interesting as additinal point. In the worst case this could be not much better than GV100s and we wouldn't know. I don't think that's the case, but it's always fishy to only include very selected competitors in the comparison.
     
    #341 Samwell, Nov 21, 2021
    Last edited: Nov 21, 2021
    Lightman, DavidGraham and pharma like this.
  2. itsmydamnation

    Veteran

    Joined:
    Apr 29, 2007
    Messages:
    1,349
    Likes Received:
    470
    Location:
    Australia
    They mention why they're comparing to Titan in the Video itself.
     
    Lightman likes this.
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Service Post:
    Explanation starts at 5:50 (timestamped link).
    • Results from Titan from 2018, system since then decomissioned
    • QPD-JIT accelerating non-QUDA parts of Chroma
    • Strong scaling constraints on MI100 from 64 to 128 GPUs (7:30)
    • Not so much on MI250X, which are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup
     
    Lightman likes this.
  4. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    V100 and A100 numbers:

     
    tinokun, xpea, DegustatoR and 3 others like this.
  5. Granath

    Newcomer

    Joined:
    Jul 26, 2021
    Messages:
    80
    Likes Received:
    81
    kind of shit storm in comments
     
  6. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,210
    Turns out this is indeed the case!

     
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    They said they're running the same version of the benchmark as Titan did back in 2018. It doesn't mean that results of the same benchmark aren't available from other systems though.

    Lol, super shady. The software sees and manages 128 distinct GPUs, not 64. You would think scientists are immune to those kinda marketing shenanigans. Either way MI250x isn't looking very impressive in this particular benchmark.
     
  8. itsmydamnation

    Veteran

    Joined:
    Apr 29, 2007
    Messages:
    1,349
    Likes Received:
    470
    Location:
    Australia
    What they said was is they had a target of speed up vs titan , thus why it was included in the graph. i was replying to why they included an old system in the comparison.
     
    Krteq and trinibwoy like this.
  9. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Around 86% Linpack-Efficiency for MI250X-based LUMI-G:
    https://www.lumi-supercomputer.eu/lumis-full-system-architecture-revealed/
    • The GPU partition will consist of 2560 nodes, each node with one 64 core AMD Trento CPU and four AMD MI250X GPUs.
    • [...]
    • The committed Linpack performance of LUMI-G is 375 Pflop/s
    • [...]
    • A single MI250X card is capable of delivering 42.2 TFLOP/s of performance in the HPL benchmarks. More in-depth performance results for the card can be found on AMD’s website
    And it can run Crysis:
    • For visualization workloads LUMI has 64 Nvidia A40 GPUs
    Unfortunately, it is a bit delayed.
     
    Lightman and Krteq like this.
  10. Qesa

    Newcomer

    Joined:
    Feb 23, 2020
    Messages:
    57
    Likes Received:
    107
  11. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,622
    That's efficiency of weak scaling over nodes I guess, total HPL efficiency should be calculated then like this 0.86*0.44 = 0.38 or 38%
    Lets check that out, the system has 2560 nodes, each is 4x MI250X GPUs, so 10240 accelerators in total or 20480 GPUs, each accelerator is 95.7 TFLOPS
    10240*95.7 = 979 968 TFLOPS or ~980 petaflops, the committed Linpack performance of LUMI-G is 375 Pflop/s, so that's 375/980 = 0,38 or 38% efficiency, as you can see numbers add up.
     
    pharma likes this.
  12. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I was referring to LUMI-G, going by 42,2 TFlops per MI250X, which is 432 PFlops (2560*4*42,26), as a reference point for HPE Cray EX system level efficiency. There, it's 86-ish %.

    Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to.
     
    Lightman likes this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Does it make sense? The top HPL crunchers are achieving 60-80% of peak flops system wide. How is it that a single MI250x is only hitting 44%. Do we know that the benchmark was run in the matrix cores?

    Maybe it used the regular cores with a peak of 47.9. That would put the individual card at 42.2/47.9 = 88% efficiency.

    Total system would be 375/490 = 76% efficiency which makes a lot more sense to me.
     
    Lightman likes this.
  14. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,122
    LUMI quotes AMD's own numbers and they did HPL with Matrix engines. That is 44% of peak performance.
     
  15. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,622
    MI100 DP efficiency in HPL was 69% for reference, so unless somehow MI250X not just doubled FP64 flops per regular cores, but also increased efficiency by a lot, the 88% efficiency is unlikely the case.
     
    T2098, pharma, DavidGraham and 2 others like this.
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Where does it say AMD’s numbers were from the Matrix engines?
     
    Lightman and no-X like this.
  17. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    ... and I even tried to give a subtle hint at why I did the comparison this way. *sigh* Maybe in bold, all caps? Or a larger font? Or just a repitition: "Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to."
     
    T2098 likes this.
  18. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    That would be a bit disappointing.

    FWIW, the Nvidia numbers for A100 that AMD quoted (15,33 TFlops) have to include the Tensor Cores. Why would AMD try and compete without their HPC-equivalent of it? Just lowballing?
     
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    AMD is being real generous. Dell only got 14.2 on their A100 test.

    AMD would want to show MI250x in the best light of course. You guys are probably right and it’s running on the Matrix cores. In the end it’s the effective flops that matter and 42.2 >>> 15.5.
     
    Lightman likes this.
  20. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    Flops per W, mainly, as this is one of the main metrics for HPC. So the comparison is 42.2@600W vs 15.5@400W (as the referenced A100 is not the PCIE version, I'll consider the nominal TDP for both as we have no data about thre real power draw). Still 70.3 GFLOPs/W vs 38.75 GFLOPs/W, or a 1.8x power efficiency in this particular benchmark. Of course going slower (than the process would allow to) and wider always pays in this metric, even if with the added silicon/packaging costs, which are probably not an issue for this market.
     
    Lightman likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...