AMD CDNA Discussion Thread

Discussion in 'Architecture and Products' started by Frenetic Pony, Nov 16, 2020.

  1. Samwell

    Samwell Newcomer

    Nice results.3,3x from MI100 isn't looking bad. I'm wondering what's limiting in this results, as MI100 isn't scaling with more GPUs?

    And why aren't ORNL people comparing with their existing summit system? Would be much more interesting as additinal point. In the worst case this could be not much better than GV100s and we wouldn't know. I don't think that's the case, but it's always fishy to only include very selected competitors in the comparison.
     
    Last edited: Nov 21, 2021
    Lightman, DavidGraham and pharma like this.
  2. itsmydamnation

    itsmydamnation Veteran

    They mention why they're comparing to Titan in the Video itself.
     
    Lightman likes this.
  3. CarstenS

    CarstenS Legend Subscriber

    Service Post:
    Explanation starts at 5:50 (timestamped link).
    • Results from Titan from 2018, system since then decomissioned
    • QPD-JIT accelerating non-QUDA parts of Chroma
    • Strong scaling constraints on MI100 from 64 to 128 GPUs (7:30)
    • Not so much on MI250X, which are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup
     
    Lightman likes this.
  4. nAo

    nAo Nutella Nutellae Veteran

    V100 and A100 numbers:

     
    tinokun, xpea, DegustatoR and 3 others like this.
  5. Granath

    Granath Newcomer

    kind of shit storm in comments
     
  6. DavidGraham

    DavidGraham Veteran

    Turns out this is indeed the case!

     
  7. trinibwoy

    trinibwoy Meh Legend

    They said they're running the same version of the benchmark as Titan did back in 2018. It doesn't mean that results of the same benchmark aren't available from other systems though.

    Lol, super shady. The software sees and manages 128 distinct GPUs, not 64. You would think scientists are immune to those kinda marketing shenanigans. Either way MI250x isn't looking very impressive in this particular benchmark.
     
  8. itsmydamnation

    itsmydamnation Veteran

    What they said was is they had a target of speed up vs titan , thus why it was included in the graph. i was replying to why they included an old system in the comparison.
     
    Krteq and trinibwoy like this.
  9. CarstenS

    CarstenS Legend Subscriber

    Around 86% Linpack-Efficiency for MI250X-based LUMI-G:
    https://www.lumi-supercomputer.eu/lumis-full-system-architecture-revealed/
    • The GPU partition will consist of 2560 nodes, each node with one 64 core AMD Trento CPU and four AMD MI250X GPUs.
    • [...]
    • The committed Linpack performance of LUMI-G is 375 Pflop/s
    • [...]
    • A single MI250X card is capable of delivering 42.2 TFLOP/s of performance in the HPL benchmarks. More in-depth performance results for the card can be found on AMD’s website
    And it can run Crysis:
    • For visualization workloads LUMI has 64 Nvidia A40 GPUs
    Unfortunately, it is a bit delayed.
     
    Lightman and Krteq like this.
  10. Qesa

    Qesa Newcomer

  11. OlegSH

    OlegSH Regular

    That's efficiency of weak scaling over nodes I guess, total HPL efficiency should be calculated then like this 0.86*0.44 = 0.38 or 38%
    Lets check that out, the system has 2560 nodes, each is 4x MI250X GPUs, so 10240 accelerators in total or 20480 GPUs, each accelerator is 95.7 TFLOPS
    10240*95.7 = 979 968 TFLOPS or ~980 petaflops, the committed Linpack performance of LUMI-G is 375 Pflop/s, so that's 375/980 = 0,38 or 38% efficiency, as you can see numbers add up.
     
    pharma likes this.
  12. CarstenS

    CarstenS Legend Subscriber

    I was referring to LUMI-G, going by 42,2 TFlops per MI250X, which is 432 PFlops (2560*4*42,26), as a reference point for HPE Cray EX system level efficiency. There, it's 86-ish %.

    Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to.
     
    Lightman likes this.
  13. trinibwoy

    trinibwoy Meh Legend

    Does it make sense? The top HPL crunchers are achieving 60-80% of peak flops system wide. How is it that a single MI250x is only hitting 44%. Do we know that the benchmark was run in the matrix cores?

    Maybe it used the regular cores with a peak of 47.9. That would put the individual card at 42.2/47.9 = 88% efficiency.

    Total system would be 375/490 = 76% efficiency which makes a lot more sense to me.
     
    Lightman likes this.
  14. troyan

    troyan Regular

    LUMI quotes AMD's own numbers and they did HPL with Matrix engines. That is 44% of peak performance.
     
  15. OlegSH

    OlegSH Regular

    MI100 DP efficiency in HPL was 69% for reference, so unless somehow MI250X not just doubled FP64 flops per regular cores, but also increased efficiency by a lot, the 88% efficiency is unlikely the case.
     
    T2098, pharma, DavidGraham and 2 others like this.
  16. trinibwoy

    trinibwoy Meh Legend

    Where does it say AMD’s numbers were from the Matrix engines?
     
    Lightman and no-X like this.
  17. CarstenS

    CarstenS Legend Subscriber

    ... and I even tried to give a subtle hint at why I did the comparison this way. *sigh* Maybe in bold, all caps? Or a larger font? Or just a repitition: "Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to."
     
    T2098 likes this.
  18. CarstenS

    CarstenS Legend Subscriber

    That would be a bit disappointing.

    FWIW, the Nvidia numbers for A100 that AMD quoted (15,33 TFlops) have to include the Tensor Cores. Why would AMD try and compete without their HPC-equivalent of it? Just lowballing?
     
  19. trinibwoy

    trinibwoy Meh Legend

    AMD is being real generous. Dell only got 14.2 on their A100 test.

    AMD would want to show MI250x in the best light of course. You guys are probably right and it’s running on the Matrix cores. In the end it’s the effective flops that matter and 42.2 >>> 15.5.
     
    Lightman likes this.
  20. Leoneazzurro5

    Leoneazzurro5 Regular

    Flops per W, mainly, as this is one of the main metrics for HPC. So the comparison is 42.2@600W vs 15.5@400W (as the referenced A100 is not the PCIE version, I'll consider the nominal TDP for both as we have no data about thre real power draw). Still 70.3 GFLOPs/W vs 38.75 GFLOPs/W, or a 1.8x power efficiency in this particular benchmark. Of course going slower (than the process would allow to) and wider always pays in this metric, even if with the added silicon/packaging costs, which are probably not an issue for this market.
     
    Lightman likes this.
Loading...

Share This Page

Loading...