AMD CDNA Discussion Thread

Bondrewd · Jun 10, 2021

Leoneazzurro5 said:
Aldebaran is coming up as a two die

Yes.

Leoneazzurro5 said:
2x112 CU GPU

It depends.

Leoneazzurro5 said:
with uncertain amount of HBM2e

128 gigs or so.

Krteq · Jul 1, 2021

Some Aldebaran bits from recent Linux Kernel commit => 2 dies + 128GB HBM2(e) confirmed

Code:

On newer heterogeneous systems from AMD with GPU nodes connected via
xGMI links to the CPUs, the GPU dies are interfaced with HBM2 memory.

This patchset applies on top of the following series by Yazen Ghannam
AMD MCA Address Translation Updates
[[URL]https://patchwork.kernel.org/project/linux-edac/list/?series=505989[/URL]]

This patchset does the following
1. Add support for northbridges on Aldebaran
   * x86/amd_nb: Add Aldebaran device to PCI IDs
   * x86/amd_nb: Add support for northbridges on Aldebaran
2. Add HBM memory type in EDAC
   * EDAC/mc: Add new HBM2 memory type
3. Modifies the amd64_edac module to
   a. Handle the UMCs on the noncpu nodes,
   * EDAC/mce_amd: extract node id from InstanceHi in IPID
   b. Enumerate HBM memory and add address translation
   * EDAC/amd64: Enumerate memory on noncpu nodes
   c. Address translation on Data Fabric version 3.5.
   * EDAC/amd64: Add address translation support for DF3.5
   * EDAC/amd64: Add fixed UMC to CS mapping


Aldebaran has 2 Dies (enumerated as a MCx, x= 8 ~ 15)
  Each Die has 4 UMCs (enumerated as csrowx, x=0~3)
  Each die has 2 root ports, with 4 misc port for each root.
  Each UMC manages 8 UMC channels each connected to 2GB of HBM memory.

Muralidhara M K (3):
  x86/amd_nb: Add Aldebaran device to PCI IDs
  x86/amd_nb: Add support for northbridges on Aldebaran
  EDAC/amd64: Add address translation support for DF3.5

Naveen Krishna Chatradhi (3):
  EDAC/mc: Add new HBM2 memory type
  EDAC/mce_amd: extract node id from InstanceHi in IPID
  EDAC/amd64: Enumerate memory on noncpu nodes

Yazen Ghannam (1):
  EDAC/amd64: Add fixed UMC to CS mapping

So... yes, @Bondrewd was right - 128GB HBM2(e) -> 2 dies - 4 UMCs per die - 8 channels per UMC - 2GB per channel

Deleted member 13524 · Jul 1, 2021

Krteq said:
So... yes, @Bondrewd was right - 128GB HBM2(e) -> 2 dies - 4 UMCs per die - 8 channels per UMC - 2GB per channel

So they never disable any stack, out of the 8 stacks in the PCB?
Nvidia always disables one stack out of 6, for every A100 GPU. Could AMD be that much more confident with their HBM2e's yields?

Perhaps what @Bondrewd meant with the "more or less" part is that some SKUs may come with one or more stacks disabled.

Bondrewd · Jul 1, 2021

Oh yeah and LUMI (the pre-exascale Euro system) is a Frontier copy with 1/3rd the node count.
For 1/3rd the money that is.

ToTTenTranz said:
So they never disable any stack, out of the 8 stacks in the PCB?

Nope.

ToTTenTranz said:
Nvidia always disables one stack out of 6, for every A100 GPU

Volumes.
NV also bins relatively more % for A100 versus MI100.

ToTTenTranz said:
Perhaps what @Bondrewd meant with the "more or less" part is that some SKUs may come with one or more stacks disabled.

Tbh it's a question of AMD selling MI200 with half-sized stacks.
128GB of HBM is a very pricey endeavour to casually yeet it into the wider market.

Krteq · Jul 1, 2021

Some more info from Locuza

https://twitter.com/x/status/1410602684604846081

https://twitter.com/x/status/1410609918705475595

Deleted member 13524 · Jul 1, 2021

So it's a total of 256CUs capable of full-rate FP64 / packed FP32?

We're looking at 256CUs * 64 ALUs each * 2 Multiply + Add * 2 FP32 RPM = ~100 TFLOPs FP32, or 200 TFLOPs FP16 (if clocked at 1.5GHz).

I wonder how this would fare in a software renderer.

Bondrewd · Jul 1, 2021

ToTTenTranz said:
We're looking at 256CUs * 64 ALUs each * 2 Multiply + Add * 2 FP32 RPM = ~100 TFLOPs FP32, or 200 TFLOPs FP16.

A bit less but yea.

Newguy · Jul 1, 2021

They also say doubled BF16 rates to match FP16 which is nice (roughly 4x faster total with doubled CUs) however no faster int8/4 or sparsity which is interesting, both will stick with FP16 rates like before I presume. If they're going full on for FP64 I guess it makes sense, HPC not AI/ML focused:

https://twitter.com/x/status/1410615491115163661

Bondrewd said:
A bit less but yea.

No chance of a 1731MHz boost clock AMD edition for 113.4Tflops? A = 1, M = 13, D = 4. Maybe that's reserved for the top tier RDNA3 GPU assuming that has double rate FP32 too, 160 CUs, so 2769MHz

Deleted member 90741 · Jul 1, 2021

Newguy said:
They also say doubled BF16 rates to match FP16 which is nice (roughly 4x faster total with doubled CUs) however no faster int8/4 or sparsity which is interesting, both will stick with FP16 rates like before I presume. If they're going full on for FP64 I guess it makes sense, HPC not AI/ML focused:

If Xilinx acquisition goes through (looks like it may, only China pending) they probably might end up not doing exotic ML ops at all in the near future
They have a bunch of patents for what looks like far more advanced implementations of Xilinx's ACAP concept.
A bunch of these patents also I see funded by Path Forward/Fast Forward 1/2 DoE programmes
Some new startups have been using Xilinx Versal for dramatic speedups in ML
https://www.zdnet.com/article/xilin...c-speed-up-of-neural-nets-versus-nvidia-gpus/

ROCm unified programming model for FPGAs/GPUs/CPUs already demonstrated last year .
https://forums.xilinx.com/t5/Xilinx...onverged-ROCm-Runtime-Technology/ba-p/1175091

I think there is some possibility they will use this approach.
Lisa's messaging also seems to be in this direction, and seems to be consistent lately

Bondrewd · Jul 5, 2021

https://www.hpcwire.com/2021/07/05/...-data-deluge-from-the-square-kilometre-array/

That aussie HPC cluster does in fact use the MI200.

Bondrewd · Jul 5, 2021

One can also count the FP64 TF number per board for lulz.
Okay, should.

Deleted member 90741 · Jul 5, 2021

Bondrewd said:
One can also count the FP64 TF number per board for lulz.

200,000 Milan Cores @ 2.5TF/64Cores = 7.8PF DPFP from 200K+ Milan Cores, 42.2PF from 750 MI200
Each MI200 can do 56TF DPFP, that is 5x of MI100
MI100 can do Half rate FP64 Ops and LLVM indicates MI200 can do full rate FP64 Ops.
With a 1.2x clock boost over MI100, full rate FP64 and 2x dies, MI200 @ ~1.8GHz can put out 56TF DPFP
Sounds formidable.

def FeatureISAVersion9_0_8 : FeatureSet<
[FeatureGFX9,
HalfRate64Ops,
FeatureFmaMixInsts,
FeatureLDSBankCount32,
FeatureDsSrc2Insts,
FeatureExtendedImageInsts,
FeatureMadMacF32Insts,
FeatureDLInsts,
FeatureDot1Insts,
FeatureDot2Insts,
FeatureDot3Insts,
FeatureDot4Insts,
FeatureDot5Insts,
FeatureDot6Insts,
FeatureDot7Insts,
FeatureMAIInsts,
FeaturePkFmacF16Inst,
FeatureAtomicFaddInsts,
FeatureSupportsSRAMECC,
FeatureMFMAInlineLiteralBug,
FeatureImageGather4D16Bug]>;

def FeatureISAVersion9_0_A : FeatureSet<
[FeatureGFX9,
FeatureGFX90AInsts,
FeatureFmaMixInsts,
FeatureLDSBankCount32,
FeatureDLInsts,
FeatureDot1Insts,
FeatureDot2Insts,
FeatureDot3Insts,
FeatureDot4Insts,
FeatureDot5Insts,
FeatureDot6Insts,
FeatureDot7Insts,
Feature64BitDPP,
FeaturePackedFP32Ops,
FeatureMAIInsts,
FeaturePkFmacF16Inst,
FeatureAtomicFaddInsts,
FeatureMadMacF32Insts,
FeatureSupportsSRAMECC,
FeaturePackedTID,
FullRate64Ops]>;

Bondrewd · Jul 5, 2021

ethernity said:
MI200 @ ~1.8GHz can put out 56TF DPFP
Sounds formidable.

A bit less than that, but yes.
That's what won AMD Frontier and then some.
It's an abhorrently mean stick as far as HPC classical prowess is concerned.
Which is why MI100 was more or less relegated to the world's most expensive devkit role.
Was basically a toy versus what went prod only 3Q after the thing.

CarstenS · Jul 6, 2021

There's going to be much funniness around year's end. Also in finland.

del42sa · Jul 28, 2021

any info about Aldebaran specs. ?

Bondrewd · Jul 28, 2021

del42sa said:
any info about Aldebaran specs. ?

Read ze thread

xpea · Jul 31, 2021

It's nice to see AMD coning back and pushing performance forward. Since Lisa Su took over, they have a very pragmatic approach and they always go for the market segments where they can get good ROI. We see it on CPUs where EPYC get priority over APUs and RYZEN. On GPU, they go for high-end since RDNA2.
Because they are so much supply constraint with TSMC, it's the right approach to quickly get cash flow back and increase their R&D expenditure as their last financial report shows. Ironically, TSMC is both the biggest reason of their success (as Intel is struggling with their nodes) and their biggest limiting factor.
Regarding RDNA3, they are clearly betting in last generation of old school HPC market, where FP64 is still relevant, before everything moves to lower precision AI/ML workflows. The timing is good, with CDNA2 coming few quarters before Hopper and getting as much supercomputer deals as they can. Obviously, the key to this market opportunity is that CDNA2 doesn't need any software or ecosystem investment. It's basic FP64 brute force approach.
The problem is when Grace-Hopper-Bluefield3 systems will arrive, the party will be over, and AMD won't have any other choice than getting their AI/ML stack up to the task. Even traditional FP64 markets like weather and fluid simulations are transitioning to AI/ML (it's f#cking time, this brute force FP64 approach is a no ending story as you never have enough compute power to simulate the quintillion of particles of our atmosphere...)
The interesting part of the story is what will happen next. RDNA3 is one time shot. Nvidia will have MCM solutions too and much stronger AI/ML hardware and software stack...

Bondrewd · Jul 31, 2021

Cringe again.
Even nV hweng org dudes shudder when they hear about MI200 then MI300.
Sad!

xpea said:
Even traditional FP64 markets like weather and fluid simulations are transitioning to AI/ML

Not really.
But FP32?
Hell yeah.

xpea said:
Obviously, the key to this market opportunity is that CDNA2 doesn't need any software or ecosystem investment

Of course it does.
Thank you Intel!

xpea said:
Bluefield3

Cringe again.
Nothing outside of hyperscale gives a shit about SmartNICs, and hyperscale builds their own!
See EFA.

xpea · Jul 31, 2021

Bondrewd said:
Of course it does.
Thank you Intel!

Cringe again.
Nothing outside of hyperscale gives a shit about SmartNICs, and hyperscale builds their own!
See EFA.

You have clearly no idea of what you are talking about.
1. No, Intel AI/ML stack won't benefit AMD. At least not fully as their AI/ML hardware is vastly different. The only (little) savior is the government investment in ROCm as part of FRONTIER deal. It was done to avoid CUDA total dominance in the field. But frankly, few millions is not enough to make a dent on CUDA, especially considering the pathetic current state of ROCm...

2. You should open your eyes and see how Nvidia operates when launching a new AI accelerator. They don't sale a GPU, they sale exclusively DGX systems the first 6 months and then their HGX reference platform for another months. Finally, one year later, they open the market with SMX/PCIe cards. In other worlds, it means that all customers will have to buy Grace-Hopper-BF3 systems to get their hands on the next gen AI/ML performance. And the thing is that customers are already lining up to get their hands on the new NVDA platform and they are already porting their hypervisors on BueField3 (Google, MS, FB, Baidu and Tencent are already at work). In fact, NVDA datacenter business is expected to be more than twice bigger with Grace-Hopper-BF3 generation than it was with Ampere, when Nvidia was only selling the GPU with a bit of Mellanox SmartNic (it was not a DPU yet) and relying on Intel / AMD for the CPU.

And that's the interesting point. It's no more GPU vs GPU or CPU vs CPU. It's [Hardware + Software] platform against another platform. With Grace CPU and Bluefield DPU, NVDA has finally a complete platform and an extremely performant one, against AMD and Intel. In fact, Grace-Hopper-BF3 + Nvidia software ecosystem (Hypervisor + CUDA) has no equivalent and it's a huge selling point, something that AMD can only dream about...

Bondrewd · Jul 31, 2021

xpea said:
No, Intel AI/ML stack won't benefit AMD

It's SYCL

xpea said:
At least not fully as their AI/ML hardware is vastly different

They're just barely programmable matrix engines; all stuff relevant is abstracted away.

xpea said:
The only (little) savior is the government investment in ROCm as part of FRONTIER deal.

Oh jeez Codeplay is literally writing a Level Zero backend for gfx9/you name it.

xpea said:
You should open your eyes and see how Nvidia operates when launching a new AI accelerator

Tell me something I don't know.

xpea said:
They don't sale a GPU, they sale exclusively DGX systems the first 6 months and then their HGX reference platform for another months

The opposite.
HGX ships first to super 8, then DGX, then ODM HGX units.

xpea said:
In other worlds, it means that all customers will have to buy Grace-Hopper-BF3 systems to get their hands on the next gen AI/ML performance

?
They will continue to use what they want.
Super 8 has infinite leverage over any IHV house.

xpea said:
they are already porting their hypervisors on BueField3

Irrelevant.
They'll cook their own much like AMZN did with Nitro + EFA.

xpea said:
It's no more GPU vs GPU or CPU vs CPU.

OH YES IT IS.
Just in a bitta different way.

xpea said:
Grace-Hopper-BF3 + Nvidia software ecosystem (Hypervisor + CUDA) has no equivalent and it's a huge selling point

Oh jeez that's peak LARP.

AMD CDNA Discussion Thread

Bondrewd

Krteq

Deleted member 13524

Guest

Bondrewd

Krteq

Deleted member 13524

Guest

Bondrewd

Newguy

Deleted member 90741

Guest

Bondrewd

Bondrewd

Deleted member 90741

Guest

Bondrewd

CarstenS

Moderator

del42sa

Bondrewd

xpea

Bondrewd

xpea

Bondrewd