AMD CDNA Discussion Thread

AMD Just announced it's CDNA arch based "Ml 100" card. Source

Relevant specs:

Design Full-height, Dual-slot, 10.5 in. long
Compute Units 120
Stream Processors 7,680
FP64 TFLOPs (Peak) 11.5
FP32 TFLOPs (Peak) 23.1
FP32 Matrix TFLOPs (Peak) 46.1
FP16/FP16 Matrix TFLOPs (Peak) 184.6
Int4/Int8 TOPS (Peak) 184.6
bFLOAT16 TFLOPs (Peak) 92.3
HBM2 ECC Memory 32 GB
Memory Interface 4,096-bit
Memory Clock 1.2 GHz
Memory Bandwidth 1.23 TB/s
PCIe Support Gen4
Infinity Fabric Links/Bandwidth 3 / 276 GB/s
TDP 300 W
Cooling Passively cooled
 
Of note of course is that the leaks here were way off, at least those pinning it as a straight Vega derivative with a ton of standard FP32 flops. Instead this appears to be a straight up matrix/machine learning, at least for model training, competitor to challenge Nvidia and others. And it has, at least theoretically, the performance to do so if its priced right.
 
Last edited:

Leovinus

Newcomer
I want to recall that there was some speculation on wether CDNA would be a short-lived product precisely because it was assumed to be mostly Vega. Would this more radical departure speak for CDNA continuing in parallel with RDNA? And would there be some degree of symbiosis in having two architectures developed in parallel like this?
 

rSkip

Newcomer
From dieshot and infinity links placement, I guess that MI100 uses 3x Infinity Fabric Links + WAFL(?), with 3x more links unused. Just guessing, I can be totally wrong.
links.png
 

Erinyes

Regular
From the die shot pics, I've seen estimates of 709-713 mm2 posted on twitter. That's pretty large but still smaller than the A100. AMD enabling 120 CUs out of 128 is not bad at all points to a reasonably good yield. We'll probably see a slightly further cut down part later once they are in full production.

Capacity is only 32GB but just like Nvidia, I'm guessing they'll have a part with HBM2E and double the capacity in a quarter or two.
Of note of course is that the leaks here were way off, at least those pinning it as a straight Vega derivative with a ton of standard FP32 flops. Instead this appears to be a straight up matrix/machine learning, at least for model training, competitor to challenge Nvidia and others. And it has, at least theoretically, the performance to do so if its priced right.

Not sure what the leaks were but anyone who's been following AMD's path in HPC and supercomputer wins would have known it's not just a rehashed Vega. This is going into the Frontier supercomputer next year and a number of other smaller installations. This is AMD's push back into the lucrative HPC segment and it certainly looks like they've invested significantly into it (looking at the readiness of ROCm 4.0 as well)

Pricing is apparently around $6400 as per a link from @CarstenS in another thread but these products aren't usually sold at retail. AMD themselves claim between 1.8X - 2.1X higher performance per dollar than the Nvidia A100.
I want to recall that there was some speculation on wether CDNA would be a short-lived product precisely because it was assumed to be mostly Vega. Would this more radical departure speak for CDNA continuing in parallel with RDNA? And would there be some degree of symbiosis in having two architectures developed in parallel like this?

AMD has already committed to CDNA 2, which is going into the El-Capitan Supercomputer in 2022. The key feature they've announced is that it's using a next gen Infinity fabric which allows cache coherency with EPYC CPUs. Not sure about Symbiosis but the whole point of CDNA and RDNA is to maximize the utility of each with specific features for the segments they are intended. Certain technologies such as Infinity cache, fabric, etc may of course be used across both, and some amount of physical design, etc but otherwise it seems like they will be developed in parallel.
From dieshot and infinity links placement, I guess that MI100 uses 3x Infinity Fabric Links + WAFL(?), with 3x more links unused. Just guessing, I can be totally wrong.
View attachment 4930

Well the official specs from AMD say 3 links - https://www.amd.com/en/products/server-accelerators/instinct-mi100

And given that they're promoting 4 GPU configurations using the 3 links, I don't expect there to be more unused ones or they'd be using them.
 
D

Deleted member 13524

Guest
Yes but can it play Crysis?
Narrator's voice: It could not play Crysis..


BTW, why are they still calling this a GPU, if it can't render 2D graphics even?
 

bridgman

Newcomer
Subscriber
BTW, why are they still calling this a GPU, if it can't render 2D graphics even?

G is for GEMM :yes:... what were you thinking ?

Seriously, it's a fair question... we do refer to it as an "accelerator" frequently but it is still more GPU-ish than non-GPU-accelerator-ish and there's an argument for using familiar terminology for a while.
 
Last edited:

Erinyes

Regular
Yes but can it play Crysis?
Narrator's voice: It could not play Crysis..


BTW, why are they still calling this a GPU, if it can't render 2D graphics even?

Well..I suppose it still looks like a GPU at least.

They have dropped the Radeon branding though. Now it's just AMD Instinct.
 

Digidi

Regular
Same Performance as GA100 with 100W less power at both have 7nm......

AMD also have 100mm² less die space.
 

Erinyes

Regular
Same Performance as GA100 with 100W less power at both have 7nm......

AMD also have 100mm² less die space.

GA100 does have more memory and a larger portion of the chip disabled though. The 400W is the (max) power of the SXM version. And there are PCIE versions of GA100 at less than 300W so it's not like its a 100W less in all conditions. AMD certainly looks to be ahead in FP64 but Nvidia seems to have the advantage in AI/ML related ops.

Given the large die sizes, a chiplet style approach certainly seems to be the future of these kinds of chips. Rumours already point towards this for Nvidia Hopper and CDNA2.
 

Digidi

Regular
You should compare gflops value. At 400 watt Nvidia delivers 9,7 gflops, AMD is delivering at 300W card 11,5 glops. OK AMD has 8GB lass ram thats a point, but the rest is all faster. Only Matrix multiplikation at FP16/32 is higher.

So for AI go with Nvidia, the rest please buy AMD
 

Qesa

Newcomer
Peak FP64 throughput is a poor predictor of actual performance when comparing between different architectures. This is as true for GPGPU as it is for graphics.
 

Erinyes

Regular
You should compare gflops value. At 400 watt Nvidia delivers 9,7 gflops, AMD is delivering at 300W card 11,5 glops. OK AMD has 8GB lass ram thats a point, but the rest is all faster. Only Matrix multiplikation at FP16/32 is higher.

So for AI go with Nvidia, the rest please buy AMD

As I mentioned, 400W is simply the max power draw, not actual. It was likely rated that high to ensure that the 80GB variants are drop in replacements in the existing infrastructure. 80GB of HBM memory can consume a significant amount of power. And there are PCIE versions with lower rated power draw as well. I already mentioned AMD is better for FP64, but you cannot discount NV's performance in other metrics or the CUDA ecosystem.
 

3dilettante

Legend
Alpha
The whitepaper also indicates a few things, like the apparent doubling of the register file to 128KB per SIMD.
There is a separate class of registers mentioned for matrix operations, although I'm not clear if that's part of the total register file diagrammed.
Matrix instructions are also diagrammed as being a separate issue type from vector instructions. I'm assuming it's still one instruction per clock per wave, although that wouldn't rule out a longer cadence for matrix instructions or concurrent issue from neighboring waves. Perhaps there are limits based on how many operands are being sourced from the register file.
Some code commits mentioned that while it is possible to mix matrix and vector math, there was concern about thermal throttling being more likely if that happened.
Clocks seem to reflect that while the matrix operations may be more efficient per FLOP, there's a lot of hardware being exercised across the chip.
 

CarstenS

Legend
Subscriber
Well the official specs from AMD say 3 links - https://www.amd.com/en/products/server-accelerators/instinct-mi100
And given that they're promoting 4 GPU configurations using the 3 links, I don't expect there to be more unused ones or they'd be using them.
I would guess, they'd have a 4th one at least for spares, since those cards seem not to be sold (yet) in a harvested edition and you'd need full connectivity in data centers. If one of your IF-links is broken, you can throw away the whole chip.
Maybe the attached file helps for clarification, it's the original used in my CDNA-article here.
 

Attachments

  • 20xxxxxx_AMD INSTINCT MI100 Die_Face.png
    20xxxxxx_AMD INSTINCT MI100 Die_Face.png
    1.3 MB · Views: 17
So for AI go with Nvidia, the rest please buy AMD
These chips are directed towards AI markets primarily, it's as if you are saying MI100 is irrelevant.

A100 is over 3 times faster in AI workloads, while being behind by 15% or so in traditional FP32/FP64 workloads. A100 has 33% to 66% more memory bandwidth, a very important factor for reaching peak throughput more often.

A100 vs M100
FP32: 19.5 vs 23.5
FP64: 9.5 vs 11.5
FP16: 624 vs 184
BF16: 624 vs 92
TB/s: 1.6 to 2 vs 1.2
Matrix FP32: 310 vs 46
Matrix FP64: 19.5 vs nill?

And that's not even the full A100 core, it has 20 CUs disabled, 108 out of 128 CUs.
 
Last edited:

Erinyes

Regular
I would guess, they'd have a 4th one at least for spares, since those cards seem not to be sold (yet) in a harvested edition and you'd need full connectivity in data centers. If one of your IF-links is broken, you can throw away the whole chip.
Maybe the attached file helps for clarification, it's the original used in my CDNA-article here.

Wouldn't the 4th be for connectivity to the CPU? So you'd have each GPU connected to the CPU and 3 other GPUs? There still might be more for redundancy as you suggest, though they do have all HBM PHYs enabled unlike Nvidia with the A100. I do expect at least one more further cut down part later in the lifecycle. They had two with the Vega 20 chip, and that was much smaller (though on a new node at the time).
 
Top