AMD CDNA Discussion Thread

Not sure how many people watched this video with results from @AMD MI200 (MI250x) prepared by AMD/ORNL and others. Quite good results. Video: https://t.co/xFS12CwNEz #HPC https://t.co/IoDzyZdPc4

Nice results.3,3x from MI100 isn't looking bad. I'm wondering what's limiting in this results, as MI100 isn't scaling with more GPUs?

And why aren't ORNL people comparing with their existing summit system? Would be much more interesting as additinal point. In the worst case this could be not much better than GV100s and we wouldn't know. I don't think that's the case, but it's always fishy to only include very selected competitors in the comparison.
 
Last edited:
Nice results.3,3x from MI100 isn't looking bad. I'm wondering what's limiting in this results, as MI100 isn't scaling with more GPUs?

And why aren't ORNL people comparing with their existing summit system? Would be much more interesting as additinal point. In the worst case this could be not much better than GV100s and we wouldn't know. I don't think that's the case, but it's always fishy to only include very selected competitors in the comparison.
They mention why they're comparing to Titan in the Video itself.
 
They mention why they're comparing to Titan in the Video itself.
Service Post:
Explanation starts at 5:50 (timestamped link).
  • Results from Titan from 2018, system since then decomissioned
  • QPD-JIT accelerating non-QUDA parts of Chroma
  • Strong scaling constraints on MI100 from 64 to 128 GPUs (7:30)
  • Not so much on MI250X, which are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup
 
Service Post:
Explanation starts at 5:50 (timestamped link).
  • Results from Titan from 2018, system since then decomissioned
  • QPD-JIT accelerating non-QUDA parts of Chroma
  • Strong scaling constraints on MI100 from 64 to 128 GPUs (7:30)
  • Not so much on MI250X, which are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup
V100 and A100 numbers:

 
They mention why they're comparing to Titan in the Video itself.

They said they're running the same version of the benchmark as Titan did back in 2018. It doesn't mean that results of the same benchmark aren't available from other systems though.

MI250X...are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup

Lol, super shady. The software sees and manages 128 distinct GPUs, not 64. You would think scientists are immune to those kinda marketing shenanigans. Either way MI250x isn't looking very impressive in this particular benchmark.
 
They said they're running the same version of the benchmark as Titan did back in 2018. It doesn't mean that results of the same benchmark aren't available from other systems though.
What they said was is they had a target of speed up vs titan , thus why it was included in the graph. i was replying to why they included an old system in the comparison.
 
Around 86% Linpack-Efficiency for MI250X-based LUMI-G:
https://www.lumi-supercomputer.eu/lumis-full-system-architecture-revealed/
  • The GPU partition will consist of 2560 nodes, each node with one 64 core AMD Trento CPU and four AMD MI250X GPUs.
  • [...]
  • The committed Linpack performance of LUMI-G is 375 Pflop/s
  • [...]
  • A single MI250X card is capable of delivering 42.2 TFLOP/s of performance in the HPL benchmarks. More in-depth performance results for the card can be found on AMD’s website
And it can run Crysis:
  • For visualization workloads LUMI has 64 Nvidia A40 GPUs
Unfortunately, it is a bit delayed.
 
About 44% no? Its matrix fp64 throughput is 95.7 TFLOPS.
That's efficiency of weak scaling over nodes I guess, total HPL efficiency should be calculated then like this 0.86*0.44 = 0.38 or 38%
Lets check that out, the system has 2560 nodes, each is 4x MI250X GPUs, so 10240 accelerators in total or 20480 GPUs, each accelerator is 95.7 TFLOPS
10240*95.7 = 979 968 TFLOPS or ~980 petaflops, the committed Linpack performance of LUMI-G is 375 Pflop/s, so that's 375/980 = 0,38 or 38% efficiency, as you can see numbers add up.
 
About 44% no? Its matrix fp64 throughput is 95.7 TFLOPS.
I was referring to LUMI-G, going by 42,2 TFlops per MI250X, which is 432 PFlops (2560*4*42,26), as a reference point for HPE Cray EX system level efficiency. There, it's 86-ish %.

Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to.
 
Does it make sense? The top HPL crunchers are achieving 60-80% of peak flops system wide. How is it that a single MI250x is only hitting 44%. Do we know that the benchmark was run in the matrix cores?

Maybe it used the regular cores with a peak of 47.9. That would put the individual card at 42.2/47.9 = 88% efficiency.

Total system would be 375/490 = 76% efficiency which makes a lot more sense to me.
 
I was referring to LUMI-G, going by 42,2 TFlops per MI250X, which is 432 PFlops (2560*4*42,26), as a reference point for HPE Cray EX system level efficiency. There, it's 86-ish %.

Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to.

LUMI quotes AMD's own numbers and they did HPL with Matrix engines. That is 44% of peak performance.
 
LUMI quotes AMD's own numbers and they did HPL with Matrix engines. That is 44% of peak performance.
... and I even tried to give a subtle hint at why I did the comparison this way. *sigh* Maybe in bold, all caps? Or a larger font? Or just a repitition: "Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to."
 
Total system would be 375/490 = 76% efficiency which makes a lot more sense to me.
That would be a bit disappointing.

FWIW, the Nvidia numbers for A100 that AMD quoted (15,33 TFlops) have to include the Tensor Cores. Why would AMD try and compete without their HPC-equivalent of it? Just lowballing?
 
That would be a bit disappointing.

FWIW, the Nvidia numbers for A100 that AMD quoted (15,33 TFlops) have to include the Tensor Cores. Why would AMD try and compete without their HPC-equivalent of it? Just lowballing?

AMD is being real generous. Dell only got 14.2 on their A100 test.

AMD would want to show MI250x in the best light of course. You guys are probably right and it’s running on the Matrix cores. In the end it’s the effective flops that matter and 42.2 >>> 15.5.
 
AMD is being real generous. Dell only got 14.2 on their A100 test.

AMD would want to show MI250x in the best light of course. You guys are probably right and it’s running on the Matrix cores. In the end it’s the effective flops that matter and 42.2 >>> 15.5.

Flops per W, mainly, as this is one of the main metrics for HPC. So the comparison is 42.2@600W vs 15.5@400W (as the referenced A100 is not the PCIE version, I'll consider the nominal TDP for both as we have no data about thre real power draw). Still 70.3 GFLOPs/W vs 38.75 GFLOPs/W, or a 1.8x power efficiency in this particular benchmark. Of course going slower (than the process would allow to) and wider always pays in this metric, even if with the added silicon/packaging costs, which are probably not an issue for this market.
 
Back
Top