AMD CDNA: MI300 & MI400 (Analysis, Speculation and Rumors in 2024)

trinibwoy

Meh
Legend
Supporter
[MOD EDIT] This is the new 2024 CDNA/MI300/MI400 thread that has been split from the Blackwell thread.

Please try to keep things civil and not make too many assumptions about relative competitiveness without good data backing it, that's a slippery road to epic fanboy flame wars we'd all rather avoid...
 
Last edited by a moderator:
From the looks of things MI300 can't even kill Hopper so I don't think Nvidia should be that worried.
Does AMD even have a switch for large-scale training? Some raw specs may look good on paper, but their relevance for large-scale training without infrastructure is beyond me.
 
Does AMD even have a switch for large-scale training? Some raw specs may look good on paper, but their relevance for large-scale training without infrastructure is beyond me.

I haven’t seen anything bigger than their 8-way solution. Nothing like Nvidia’s DGX stuff.

“The AMD Instinct MI300X Platform integrates 8 fully connected MI300X GPU OAM modules onto an industry-standard OCP design via 4th-Gen AMD Infinity Fabric™ links, delivering up to 1.5TB HBM3 capacity for low-latency AI processing. This ready-to-deploy platform can accelerate time-to-market and reduce development costs when adding MI300X accelerators into existing AI rack and server infrastructure.”
 
I haven’t seen anything bigger than their 8-way solution.
Even this can be limited by the bandwidth between the accelerators, as AMD doesn't seem to provide full speed bandwidth between all accelerators due to the absence of an NVSwitch analog.
 
Discuss the ponts or don't discuss
jeez, we're back to "AMD bad".
This forum is useless for anything but giving NV fellatio I guess.
You're not even talking B100.
MI400 will be a year later than B100
No, it's mid'25.
From the looks of things MI300 can't even kill Hopper so I don't think Nvidia should be that worried.
proofs?
you're unlikely to have 300X numbers under MS Azure Early Access so why even talk.
Does AMD even have a switch for large-scale training?
Yes MI400 has XSwitch and the funny networking (Pensando Salina stuff with custom embeddings, also coming to Venice-E).
but their relevance for large-scale training without infrastructure is beyond me.
The money (well the sustained one, not the bubble hypium) is inference.
I haven’t seen anything bigger than their 8-way solution.
8-way is what everyone uses.
MI300X bb plugs into HGX mobos directly since NV contributed the HGX connector/mobo/bb spec to OCP like 18 months ago.
Nothing like Nvidia’s DGX stuff.
DGX is 8 GPUs.
Since P100.
as AMD doesn't seem to provide full speed bandwidth between all accelerators due to the absence of an NVSwitch analog.
That's straight up FUD, it's an 7-link all-to-all setup which tops out at 896GB/s per GPU.
AMD switches are for APUs anyway, people like their 32p xNC gigatrons still.
 
Last edited:
there aren’t any 3rd party numbers so nothing to get excited about yet.
It has a lot of customer traction which is the Very Relevant Bit that no other vendor besides NV managed.
1709354293752.png
I was thinking of this guy.
256 GPUs is a meme for real training clusters and then you're hitting your IB/Ethernet/Slingshot network so why did you buy these again.
GH200 is a meme.

Frankly you can evaluate whether merchant ML part is a meme by doing a Microsoft test.
If it's on Azure, it's not a meme.
If it's not (hello Graphcore), it is.
 
That's straight up FUD, it's an 7-link all-to-all setup which tops out at 896GB/s per GPU.
That's just a bunch of theoretical numbers slapped together. The effectiveness of the ring p2p connections for the 8 GPUs setup is certainly in question without any benchmarks. Given the scarce topology information they provided, achieving the claimed 900 GB/s p2p bandwidth may require going through multiple hops to the other GPUs, making the things complicated and uneven. If that is not the case and there are any benchmarks, don't be ashamed to point them out. What AMD has shown so far regarding training has not been impressive at all, especially considering the cherry picking nature of single point results (SPEC, MLperf, etc. exist for a reason). I suppose, they are developing the switch not just for fun, right? (assuming your speculations about the XSwitch are correct).
 
The effectiveness of the ring p2p connections for the 8 GPUs setup is certainly in question without any benchmarks
It's not a ring.
It's all to all, you get 7 links to 7 GPUs.
making the things complicated and uneven.
It's simple as bricks but I digress.
If that is not the case and there are any benchmarks, don't be ashamed to point them out
Pay for Azure early access and you can do it yourself!
Or get on good graces with OCI people, not like I care.
What AMD has shown so far regarding training
They don't want training, they want inference.
Yeah, like that one.
Meme.
I suppose, they are developing the switch not just for fun, right?
Those people invented glueless MP as you know it so they're doing it for fun, too.
Plus tiny 4P nodes aren't sustainable for HPC, we gotta RETVRN to xNC'd 32p monstrosities.
I would strongly support that bit
Yes but they usual suspects can't breathe without saying "AMD bad".

Not like B100 is anything interesting.
You get two retsized die, 4HBM each. And more math. And more watts. All very by the numbers things that's been happening ever since V100 (it's usually double the ideal workload perf at 30-40% moar power).
 
Last edited:
It's not a ring.
It's all to all, you get 7 links to 7 GPUs.

It's simple as bricks but I digress.

Pay for Azure early access and you can do it yourself!
Or get on good graces with OCI people, not like I care.

They don't want training, they want inference.

Yeah, like that one.

Meme.

Those people invented glueless MP as you know it so they're doing it for fun, too.
Plus tiny 4P nodes aren't sustainable for HPC, we gotta RETVRN to xNC'd 32p monstrosities.

Yes but they usual suspects can't breathe without saying "AMD bad".

Not like B100 is anything interesting.
You get two retsized die, 4HBM each. And more math. And more watts. All very by the numbers things that's been happening ever since V100 (it's usually double the ideal workload perf at 30-40% moar power).
Why so angry ? this is a nvidia thread not amd talk
 
It's not a ring.
Quoting AMD data sheets here.

It's all to all, you get 7 links to 7 GPUs.
So, do you imagine them having a dedicated 900 GB/s connection from each GPU to any other?)
Of course it will still require hopping through all the other GPUs to achieve the 900 GB/s bandwidth and waiting for the slowest one.

It's simple as bricks but I digress.
Lol, so simple that they can't even post the meme benchmarks, apparently

For the sake of a good SNR. More signal, less noise.
All these are just trash talks until there are any "meme" benchmarks posted. If AMD has anything at hand, they would post it right before GTC or soon after that for obvious reasons.
 
Last edited:

Good eye. It doesn’t seem to be a ring though. Each GPU has 7 IF links, one link to each peer. So fully P2P but at nowhere near full bandwidth. This is in contrast to NVSwitch which does provide full bandwidth between all peers. Will be interesting to compare 8-way scaling benchmarks once available.

NVSwitch has been around for a long time now so should have been an obvious target. The Broadcom announcement seems to address inter-node communication but presumably it will tackle inter-GPU as well.
 
The high level SM structure hasn’t changed that much.
Pretty much all the GPUs in the industry settled on like 16-32 wide SIMDs with a pile of L1 and shmem.
Mature technology.
Funny, wonder if Dell got in trouble for casually leaking specs like that.
Lenovo would've been carpetbombed by Intel/AMD if that was the case.
1000w for a dual big chip B100 192GB would likely be a significant improvement in perf/watt.
Same rules as MI250X, new uarch + Si spam works.
If AMD has anything at hand, they would post it right before GTC or soon after that for obvious reasons
Ughhh.
Really?
 
Exciting! Seems shady on both sides though (shocking right)

1. Is batch size of 1 representative of real world inference workloads? Surely not.
2. Is vLLM widely adopted by H100 users or is everyone actually using TensorRT?
 
Exciting! Seems shady on both sides though (shocking right)

1. Is batch size of 1 representative of real world inference workloads? Surely not.
2. Is vLLM widely adopted by H100 users or is everyone actually using TensorRT?
Github repo has 16k stars and 2k forks.
 
Back
Top