AMD CDNA Discussion Thread

Samwell · Nov 21, 2021

Granath said:
Not sure how many people watched this video with results from @AMD MI200 (MI250x) prepared by AMD/ORNL and others. Quite good results. Video: https://t.co/xFS12CwNEz #HPC https://t.co/IoDzyZdPc4

Nice results.3,3x from MI100 isn't looking bad. I'm wondering what's limiting in this results, as MI100 isn't scaling with more GPUs?

And why aren't ORNL people comparing with their existing summit system? Would be much more interesting as additinal point. In the worst case this could be not much better than GV100s and we wouldn't know. I don't think that's the case, but it's always fishy to only include very selected competitors in the comparison.

itsmydamnation · Nov 21, 2021

Samwell said:
Nice results.3,3x from MI100 isn't looking bad. I'm wondering what's limiting in this results, as MI100 isn't scaling with more GPUs?

And why aren't ORNL people comparing with their existing summit system? Would be much more interesting as additinal point. In the worst case this could be not much better than GV100s and we wouldn't know. I don't think that's the case, but it's always fishy to only include very selected competitors in the comparison.

They mention why they're comparing to Titan in the Video itself.

CarstenS · Nov 22, 2021

itsmydamnation said:
They mention why they're comparing to Titan in the Video itself.

Service Post:

Explanation starts at 5:50 (timestamped link).

Results from Titan from 2018, system since then decomissioned
QPD-JIT accelerating non-QUDA parts of Chroma
Strong scaling constraints on MI100 from 64 to 128 GPUs (7:30)
Not so much on MI250X, which are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup

nAo · Nov 22, 2021

CarstenS said:
Service Post:

Explanation starts at 5:50 (timestamped link).

Results from Titan from 2018, system since then decomissioned

QPD-JIT accelerating non-QUDA parts of Chroma

Strong scaling constraints on MI100 from 64 to 128 GPUs (7:30)

Not so much on MI250X, which are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup

V100 and A100 numbers:

https://twitter.com/x/status/1462603117267062791

Granath · Nov 22, 2021

kind of shit storm in comments

DavidGraham · Nov 22, 2021

Samwell said:
In the worst case this could be not much better than GV100s and we wouldn't know ... it's always fishy to only include very selected competitors in the comparison.

Turns out this is indeed the case!

https://twitter.com/x/status/1462647242536353799

trinibwoy · Nov 22, 2021

itsmydamnation said:
They mention why they're comparing to Titan in the Video itself.

They said they're running the same version of the benchmark as Titan did back in 2018. It doesn't mean that results of the same benchmark aren't available from other systems though.

CarstenS said:
MI250X...are treated as 128 separate GCDs by ORNL (7:50), but calculated as 64 for 166x speedup

Lol, super shady. The software sees and manages 128 distinct GPUs, not 64. You would think scientists are immune to those kinda marketing shenanigans. Either way MI250x isn't looking very impressive in this particular benchmark.

itsmydamnation · Nov 23, 2021

trinibwoy said:
They said they're running the same version of the benchmark as Titan did back in 2018. It doesn't mean that results of the same benchmark aren't available from other systems though.

What they said was is they had a target of speed up vs titan , thus why it was included in the graph. i was replying to why they included an old system in the comparison.

CarstenS · Nov 24, 2021

Around 86% Linpack-Efficiency for MI250X-based LUMI-G:
https://www.lumi-supercomputer.eu/lumis-full-system-architecture-revealed/

The GPU partition will consist of 2560 nodes, each node with one 64 core AMD Trento CPU and four AMD MI250X GPUs.
[...]
The committed Linpack performance of LUMI-G is 375 Pflop/s
[...]
A single MI250X card is capable of delivering 42.2 TFLOP/s of performance in the HPL benchmarks. More in-depth performance results for the card can be found on AMD’s website

And it can run Crysis:

For visualization workloads LUMI has 64 Nvidia A40 GPUs

Unfortunately, it is a bit delayed.

Qesa · Nov 24, 2021

CarstenS said:
Around 86% Linpack-Efficiency for MI250X-based LUMI-G:
https://www.lumi-supercomputer.eu/lumis-full-system-architecture-revealed/

About 44% no? Its matrix fp64 throughput is 95.7 TFLOPS.

OlegSH · Nov 24, 2021

Qesa said:
About 44% no? Its matrix fp64 throughput is 95.7 TFLOPS.

That's efficiency of weak scaling over nodes I guess, total HPL efficiency should be calculated then like this 0.86*0.44 = 0.38 or 38%
Lets check that out, the system has 2560 nodes, each is 4x MI250X GPUs, so 10240 accelerators in total or 20480 GPUs, each accelerator is 95.7 TFLOPS
10240*95.7 = 979 968 TFLOPS or ~980 petaflops, the committed Linpack performance of LUMI-G is 375 Pflop/s, so that's 375/980 = 0,38 or 38% efficiency, as you can see numbers add up.

CarstenS · Nov 24, 2021

Qesa said:
About 44% no? Its matrix fp64 throughput is 95.7 TFLOPS.

I was referring to LUMI-G, going by 42,2 TFlops per MI250X, which is 432 PFlops (2560*4*42,26), as a reference point for HPE Cray EX system level efficiency. There, it's 86-ish %.

Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to.

trinibwoy · Nov 24, 2021

Does it make sense? The top HPL crunchers are achieving 60-80% of peak flops system wide. How is it that a single MI250x is only hitting 44%. Do we know that the benchmark was run in the matrix cores?

Maybe it used the regular cores with a peak of 47.9. That would put the individual card at 42.2/47.9 = 88% efficiency.

Total system would be 375/490 = 76% efficiency which makes a lot more sense to me.

troyan · Nov 24, 2021

CarstenS said:
I was referring to LUMI-G, going by 42,2 TFlops per MI250X, which is 432 PFlops (2560*4*42,26), as a reference point for HPE Cray EX system level efficiency. There, it's 86-ish %.

Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to.

LUMI quotes AMD's own numbers and they did HPL with Matrix engines. That is 44% of peak performance.

OlegSH · Nov 24, 2021

trinibwoy said:
Maybe it used the regular cores with a peak of 47.9. That would put the individual card at 42.2/47.9 = 88% efficiency.

MI100 DP efficiency in HPL was 69% for reference, so unless somehow MI250X not just doubled FP64 flops per regular cores, but also increased efficiency by a lot, the 88% efficiency is unlikely the case.

trinibwoy · Nov 24, 2021

troyan said:
LUMI quotes AMD's own numbers and they did HPL with Matrix engines. That is 44% of peak performance.

Where does it say AMD’s numbers were from the Matrix engines?

CarstenS · Nov 24, 2021

troyan said:
LUMI quotes AMD's own numbers and they did HPL with Matrix engines. That is 44% of peak performance.

... and I even tried to give a subtle hint at why I did the comparison this way. *sigh* Maybe in bold, all caps? Or a larger font? Or just a repitition: "Relevance: Have a data point to compare the larger HPE Cray EX that will be Frontier to."

CarstenS · Nov 24, 2021

trinibwoy said:
Total system would be 375/490 = 76% efficiency which makes a lot more sense to me.

That would be a bit disappointing.

FWIW, the Nvidia numbers for A100 that AMD quoted (15,33 TFlops) have to include the Tensor Cores. Why would AMD try and compete without their HPC-equivalent of it? Just lowballing?

trinibwoy · Nov 25, 2021

CarstenS said:
That would be a bit disappointing.

FWIW, the Nvidia numbers for A100 that AMD quoted (15,33 TFlops) have to include the Tensor Cores. Why would AMD try and compete without their HPC-equivalent of it? Just lowballing?

AMD is being real generous. Dell only got 14.2 on their A100 test.

AMD would want to show MI250x in the best light of course. You guys are probably right and it’s running on the Matrix cores. In the end it’s the effective flops that matter and 42.2 >>> 15.5.

Leoneazzurro5 · Nov 25, 2021

trinibwoy said:
AMD is being real generous. Dell only got 14.2 on their A100 test.

AMD would want to show MI250x in the best light of course. You guys are probably right and it’s running on the Matrix cores. In the end it’s the effective flops that matter and 42.2 >>> 15.5.

Flops per W, mainly, as this is one of the main metrics for HPC. So the comparison is 42.2@600W vs 15.5@400W (as the referenced A100 is not the PCIE version, I'll consider the nominal TDP for both as we have no data about thre real power draw). Still 70.3 GFLOPs/W vs 38.75 GFLOPs/W, or a 1.8x power efficiency in this particular benchmark. Of course going slower (than the process would allow to) and wider always pays in this metric, even if with the added silicon/packaging costs, which are probably not an issue for this market.

AMD CDNA Discussion Thread

Samwell

itsmydamnation

CarstenS

Moderator

nAo

Nutella Nutellae

Granath

DavidGraham

trinibwoy

Meh

itsmydamnation

CarstenS

Moderator

Qesa

OlegSH

CarstenS

Moderator

trinibwoy

Meh

troyan

OlegSH

trinibwoy

Meh

CarstenS

Moderator

CarstenS

Moderator

trinibwoy

Meh

Leoneazzurro5