AMD CDNA Discussion Thread

Some Aldebaran bits from recent Linux Kernel commit => 2 dies + 128GB HBM2(e) confirmed

Code:
On newer heterogeneous systems from AMD with GPU nodes connected via
xGMI links to the CPUs, the GPU dies are interfaced with HBM2 memory.

This patchset applies on top of the following series by Yazen Ghannam
AMD MCA Address Translation Updates
[[URL]https://patchwork.kernel.org/project/linux-edac/list/?series=505989[/URL]]

This patchset does the following
1. Add support for northbridges on Aldebaran
   * x86/amd_nb: Add Aldebaran device to PCI IDs
   * x86/amd_nb: Add support for northbridges on Aldebaran
2. Add HBM memory type in EDAC
   * EDAC/mc: Add new HBM2 memory type
3. Modifies the amd64_edac module to
   a. Handle the UMCs on the noncpu nodes,
   * EDAC/mce_amd: extract node id from InstanceHi in IPID
   b. Enumerate HBM memory and add address translation
   * EDAC/amd64: Enumerate memory on noncpu nodes
   c. Address translation on Data Fabric version 3.5.
   * EDAC/amd64: Add address translation support for DF3.5
   * EDAC/amd64: Add fixed UMC to CS mapping


Aldebaran has 2 Dies (enumerated as a MCx, x= 8 ~ 15)
  Each Die has 4 UMCs (enumerated as csrowx, x=0~3)
  Each die has 2 root ports, with 4 misc port for each root.
  Each UMC manages 8 UMC channels each connected to 2GB of HBM memory.

Muralidhara M K (3):
  x86/amd_nb: Add Aldebaran device to PCI IDs
  x86/amd_nb: Add support for northbridges on Aldebaran
  EDAC/amd64: Add address translation support for DF3.5

Naveen Krishna Chatradhi (3):
  EDAC/mc: Add new HBM2 memory type
  EDAC/mce_amd: extract node id from InstanceHi in IPID
  EDAC/amd64: Enumerate memory on noncpu nodes

Yazen Ghannam (1):
  EDAC/amd64: Add fixed UMC to CS mapping

So... yes, @Bondrewd was right - 128GB HBM2(e) -> 2 dies - 4 UMCs per die - 8 channels per UMC - 2GB per channel
 
Last edited:
So... yes, @Bondrewd was right - 128GB HBM2(e) -> 2 dies - 4 UMCs per die - 8 channels per UMC - 2GB per channel

So they never disable any stack, out of the 8 stacks in the PCB?
Nvidia always disables one stack out of 6, for every A100 GPU. Could AMD be that much more confident with their HBM2e's yields?

Perhaps what @Bondrewd meant with the "more or less" part is that some SKUs may come with one or more stacks disabled.
 
Oh yeah and LUMI (the pre-exascale Euro system) is a Frontier copy with 1/3rd the node count.
For 1/3rd the money that is.
So they never disable any stack, out of the 8 stacks in the PCB?
Nope.
Nvidia always disables one stack out of 6, for every A100 GPU
Volumes.
NV also bins relatively more % for A100 versus MI100.
Perhaps what @Bondrewd meant with the "more or less" part is that some SKUs may come with one or more stacks disabled.
Tbh it's a question of AMD selling MI200 with half-sized stacks.
128GB of HBM is a very pricey endeavour to casually yeet it into the wider market.
 
So it's a total of 256CUs capable of full-rate FP64 / packed FP32?

We're looking at 256CUs * 64 ALUs each * 2 Multiply + Add * 2 FP32 RPM = ~100 TFLOPs FP32, or 200 TFLOPs FP16 (if clocked at 1.5GHz).

I wonder how this would fare in a software renderer.
 
Last edited by a moderator:
They also say doubled BF16 rates to match FP16 which is nice (roughly 4x faster total with doubled CUs) however no faster int8/4 or sparsity which is interesting, both will stick with FP16 rates like before I presume. If they're going full on for FP64 I guess it makes sense, HPC not AI/ML focused:



A bit less but yea.

No chance of a 1731MHz boost clock AMD edition for 113.4Tflops? A = 1, M = 13, D = 4. Maybe that's reserved for the top tier RDNA3 GPU assuming that has double rate FP32 too, 160 CUs, so 2769MHz
 
They also say doubled BF16 rates to match FP16 which is nice (roughly 4x faster total with doubled CUs) however no faster int8/4 or sparsity which is interesting, both will stick with FP16 rates like before I presume. If they're going full on for FP64 I guess it makes sense, HPC not AI/ML focused:

If Xilinx acquisition goes through (looks like it may, only China pending) they probably might end up not doing exotic ML ops at all in the near future
They have a bunch of patents for what looks like far more advanced implementations of Xilinx's ACAP concept.
A bunch of these patents also I see funded by Path Forward/Fast Forward 1/2 DoE programmes
Some new startups have been using Xilinx Versal for dramatic speedups in ML
https://www.zdnet.com/article/xilin...c-speed-up-of-neural-nets-versus-nvidia-gpus/

ROCm unified programming model for FPGAs/GPUs/CPUs already demonstrated last year .
https://forums.xilinx.com/t5/Xilinx...onverged-ROCm-Runtime-Technology/ba-p/1175091

I think there is some possibility they will use this approach.
Lisa's messaging also seems to be in this direction, and seems to be consistent lately
upload_2021-7-1_22-7-8.png

 
https://www.hpcwire.com/2021/07/05/...-data-deluge-from-the-square-kilometre-array/
Pawsey-Setonix.png

That aussie HPC cluster does in fact use the MI200.
 
One can also count the FP64 TF number per board for lulz.
upload_2021-7-5_23-20-21.png
200,000 Milan Cores @ 2.5TF/64Cores = 7.8PF DPFP from 200K+ Milan Cores, 42.2PF from 750 MI200
Each MI200 can do 56TF DPFP, that is 5x of MI100
MI100 can do Half rate FP64 Ops and LLVM indicates MI200 can do full rate FP64 Ops.
With a 1.2x clock boost over MI100, full rate FP64 and 2x dies, MI200 @ ~1.8GHz can put out 56TF DPFP
Sounds formidable.

def FeatureISAVersion9_0_8 : FeatureSet<
[FeatureGFX9,
HalfRate64Ops,
FeatureFmaMixInsts,
FeatureLDSBankCount32,
FeatureDsSrc2Insts,
FeatureExtendedImageInsts,
FeatureMadMacF32Insts,
FeatureDLInsts,
FeatureDot1Insts,
FeatureDot2Insts,
FeatureDot3Insts,
FeatureDot4Insts,
FeatureDot5Insts,
FeatureDot6Insts,
FeatureDot7Insts,
FeatureMAIInsts,
FeaturePkFmacF16Inst,
FeatureAtomicFaddInsts,
FeatureSupportsSRAMECC,
FeatureMFMAInlineLiteralBug,
FeatureImageGather4D16Bug]>;

def FeatureISAVersion9_0_A : FeatureSet<
[FeatureGFX9,
FeatureGFX90AInsts,
FeatureFmaMixInsts,
FeatureLDSBankCount32,
FeatureDLInsts,
FeatureDot1Insts,
FeatureDot2Insts,
FeatureDot3Insts,
FeatureDot4Insts,
FeatureDot5Insts,
FeatureDot6Insts,
FeatureDot7Insts,
Feature64BitDPP,
FeaturePackedFP32Ops,
FeatureMAIInsts,
FeaturePkFmacF16Inst,
FeatureAtomicFaddInsts,
FeatureMadMacF32Insts,
FeatureSupportsSRAMECC,
FeaturePackedTID,
FullRate64Ops]>;
 
MI200 @ ~1.8GHz can put out 56TF DPFP
Sounds formidable.
A bit less than that, but yes.
That's what won AMD Frontier and then some.
It's an abhorrently mean stick as far as HPC classical prowess is concerned.
Which is why MI100 was more or less relegated to the world's most expensive devkit role.
Was basically a toy versus what went prod only 3Q after the thing.
 
It's nice to see AMD coning back and pushing performance forward. Since Lisa Su took over, they have a very pragmatic approach and they always go for the market segments where they can get good ROI. We see it on CPUs where EPYC get priority over APUs and RYZEN. On GPU, they go for high-end since RDNA2.
Because they are so much supply constraint with TSMC, it's the right approach to quickly get cash flow back and increase their R&D expenditure as their last financial report shows. Ironically, TSMC is both the biggest reason of their success (as Intel is struggling with their nodes) and their biggest limiting factor.
Regarding RDNA3, they are clearly betting in last generation of old school HPC market, where FP64 is still relevant, before everything moves to lower precision AI/ML workflows. The timing is good, with CDNA2 coming few quarters before Hopper and getting as much supercomputer deals as they can. Obviously, the key to this market opportunity is that CDNA2 doesn't need any software or ecosystem investment. It's basic FP64 brute force approach.
The problem is when Grace-Hopper-Bluefield3 systems will arrive, the party will be over, and AMD won't have any other choice than getting their AI/ML stack up to the task. Even traditional FP64 markets like weather and fluid simulations are transitioning to AI/ML (it's f#cking time, this brute force FP64 approach is a no ending story as you never have enough compute power to simulate the quintillion of particles of our atmosphere...)
The interesting part of the story is what will happen next. RDNA3 is one time shot. Nvidia will have MCM solutions too and much stronger AI/ML hardware and software stack...
 
Cringe again.
Even nV hweng org dudes shudder when they hear about MI200 then MI300.
Sad!
Even traditional FP64 markets like weather and fluid simulations are transitioning to AI/ML
Not really.
But FP32?
Hell yeah.
Obviously, the key to this market opportunity is that CDNA2 doesn't need any software or ecosystem investment
Of course it does.
Thank you Intel!
Bluefield3
Cringe again.
Nothing outside of hyperscale gives a shit about SmartNICs, and hyperscale builds their own!
See EFA.
 
Last edited:
Of course it does.
Thank you Intel!

Cringe again.
Nothing outside of hyperscale gives a shit about SmartNICs, and hyperscale builds their own!
See EFA.
You have clearly no idea of what you are talking about.
1. No, Intel AI/ML stack won't benefit AMD. At least not fully as their AI/ML hardware is vastly different. The only (little) savior is the government investment in ROCm as part of FRONTIER deal. It was done to avoid CUDA total dominance in the field. But frankly, few millions is not enough to make a dent on CUDA, especially considering the pathetic current state of ROCm...

2. You should open your eyes and see how Nvidia operates when launching a new AI accelerator. They don't sale a GPU, they sale exclusively DGX systems the first 6 months and then their HGX reference platform for another months. Finally, one year later, they open the market with SMX/PCIe cards. In other worlds, it means that all customers will have to buy Grace-Hopper-BF3 systems to get their hands on the next gen AI/ML performance. And the thing is that customers are already lining up to get their hands on the new NVDA platform and they are already porting their hypervisors on BueField3 (Google, MS, FB, Baidu and Tencent are already at work). In fact, NVDA datacenter business is expected to be more than twice bigger with Grace-Hopper-BF3 generation than it was with Ampere, when Nvidia was only selling the GPU with a bit of Mellanox SmartNic (it was not a DPU yet) and relying on Intel / AMD for the CPU.

And that's the interesting point. It's no more GPU vs GPU or CPU vs CPU. It's [Hardware + Software] platform against another platform. With Grace CPU and Bluefield DPU, NVDA has finally a complete platform and an extremely performant one, against AMD and Intel. In fact, Grace-Hopper-BF3 + Nvidia software ecosystem (Hypervisor + CUDA) has no equivalent and it's a huge selling point, something that AMD can only dream about...
 
No, Intel AI/ML stack won't benefit AMD
It's SYCL
At least not fully as their AI/ML hardware is vastly different
They're just barely programmable matrix engines; all stuff relevant is abstracted away.
The only (little) savior is the government investment in ROCm as part of FRONTIER deal.
Oh jeez Codeplay is literally writing a Level Zero backend for gfx9/you name it.
You should open your eyes and see how Nvidia operates when launching a new AI accelerator
Tell me something I don't know.
They don't sale a GPU, they sale exclusively DGX systems the first 6 months and then their HGX reference platform for another months
The opposite.
HGX ships first to super 8, then DGX, then ODM HGX units.
In other worlds, it means that all customers will have to buy Grace-Hopper-BF3 systems to get their hands on the next gen AI/ML performance
?
They will continue to use what they want.
Super 8 has infinite leverage over any IHV house.
they are already porting their hypervisors on BueField3
Irrelevant.
They'll cook their own much like AMZN did with Nitro + EFA.
It's no more GPU vs GPU or CPU vs CPU.
OH YES IT IS.
Just in a bitta different way.
Grace-Hopper-BF3 + Nvidia software ecosystem (Hypervisor + CUDA) has no equivalent and it's a huge selling point
Oh jeez that's peak LARP.
 
Back
Top