Datacenter GPU market supply and demand (2024)

xpea · Jun 3, 2024

LordEC911 said:
MI325X is the HBM3E CDNA3 refresh with 288GB and 6TB/s bandwidth with increased compute coming this year.

MI350 is CDNA4 on 3nm coming in 2025. 35x inference improvement compared to CDNA3(FP4/FP6).
1.5x more memory and 1.2x "AI compute" TFlops compared to B200.

MI400 is CDNA Next coming in 2026.
AMD Computex 2024 stream link

MI325X good for inference. Only application AMD can touch for now
MI350 too late. At best will fight a superior B200 Ultra and worst case scenario, will be available in quantity after Rubin
MI400 too late. Far from Rubin training performance and it's insane interconnect, and no RAM advantage for inference

Unfortunately, AMD is only a bargaining tool to keep NV prices in check until they solve their huge interconnect deficit for lucrative training bizness (No big money to make with inference because it competes with internal hyperscalers silicon)

LordEC911 · Jun 3, 2024

xpea said:
MI325X good for inference. Only application AMD can touch for now
MI350 too late. At best will fight a superior B200 Ultra and worst case scenario, will be available in quantity after Rubin
MI400 too late. Far from Rubin training performance and it's insane interconnect, and no RAM advantage for inference

Unfortunately, AMD is only a bargaining tool to keep NV prices in check until they solve their huge interconnect deficit for lucrative training bizness (No big money to make with inference because it competes with internal hyperscalers silicon)

Links/sources?

xpea · Jun 3, 2024

LordEC911 said:
Links/sources?

Sales channel feedback
Rubin allocation is off the roof. Even reluctant customers like AWS pouring big money into NV pocket for Rubin (and for the story, AWS is very worried about their own silicon that can't keep up with AMD/NV + low users utilization)
On the other side, you can buy any Instinct easily. AMD has no allocation problem because interest is "low"

Granath · Jun 3, 2024

xpea said:
Sales channel feedback
Rubin allocation is off the roof. Even reluctant customers like AWS pouring big money into NV pocket for Rubin (and for the story, AWS is very worried about their own silicon that can't keep up with AMD/NV + low users utilization)
On the other side, you can buy any Instinct easily. AMD has no allocation problem because interest is "low"

Why Jensen talk was so f.... boring?

xpea · Jun 3, 2024

Granath said:
Why Jensen talk was so f.... boring?

I found it boring too.
Traditionally NV don't announce new stuff at Computex. End of year (RTX5080/90) and maybe CES (WoA SoC) will be different tho

Edit: To illustrate what I said about insane NVDA demand vs AMD:

https://twitter.com/i/web/status/1797382701541990841

Elon musk shows a $9B spending in a tweet for ~300k B200s with CX8 networking. Meta, MS, Google are doing the exact same thing. AMD simply doesn't have a competitive solution until they fix their huge interconnect deficit. I repeat, nobody cares about how fast a single accelerator is when you must train multi trillion parameters LLMs...

LordEC911 · Jun 3, 2024

xpea said:
Sales channel feedback
Rubin allocation is off the roof. Even reluctant customers like AWS pouring big money into NV pocket for Rubin (and for the story, AWS is very worried about their own silicon that can't keep up with AMD/NV + low users utilization)
On the other side, you can buy any Instinct easily. AMD has no allocation problem because interest is "low"

Really??? No lead times whatsoever?

troyan · Jun 3, 2024

AMD is demand constraint. They have secured enough ressources for the full year and waiting for (new) customers. But with the Computex announcement and nVidia's Blackwell i guess there wont be any new one using MI300X in a meaningfull way.

I like how AMD is doing marketing for nVidia. Claiming that MI350X would be 35x faster with inference sounds like Blackwell is as fast as MI350X.

xpea · Jun 3, 2024

LordEC911 said:
Really??? No lead times whatsoever?

Lead time for AMD Instinct OAM racks are less than 3 months where it's still more than 6 months for Hopper stuff (For example, X.Ai will get 100k H200 this winter only)

Let me put it this way:
AMD had one lifetime opportunity with MI300X and the huge RAM advantage over Hopper 80GB when they announced it with a promised availability 3 quarters before H200 (and still more RAM). But all MI300X first batches went first into the 2 exascale supercomputers bought by the US government. Then, the deployment with hyperscalers was plagued with lot of performance issues that were fixed 4 months later by firmware/software updates. AMD started to sell MI300X for training and inference but H100 is faster at training (because no transformer engine and no fast interconnect) so MI300X is now only sold for inference. AMD tactic was simple, throw more and more silicon at the problem to outperform Nvidia and sale their stuff for less in order to compensate for their weak ecosystem. but Nvidia is not Intel and is not a still target. In fact, Jensen accelerated the roadmap and also throw more silicon/RAM at the problem to keep the performance leadership. The AMD only advantage disappeared and OEMs stayed with Nvidia.
Fast forward one year later, AMD can't keep up and is in total panic mode. Original MI350X was cancelled and replaced by mild MI325X update (just MI300X with more RAM). Original MI400 became MI350 (much less ambitious design) to be launched before Rubin, and old MI500 became MI400...
Nvidia won.

del42sa · Jun 3, 2024

xpea said:
Lead time for AMD Instinct OAM racks are less than 3 months where it's still more than 6 months for Hopper stuff (For example, X.Ai will get 100k H200 this winter only)

Nvidia won.

Poor Lisa, luckily she doesn't read the beyond 3D forum

Frenetic Pony · Jun 3, 2024

xpea said:
Lead time for AMD Instinct OAM racks are less than 3 months where it's still more than 6 months for Hopper stuff (For example, X.Ai will get 100k H200 this winter only)

This is one easy way to tell the whole thing is an absolutely enormous bubble. Businesses try to make money, hype bubbles try to spend money.

trinibwoy · Jun 4, 2024

Frenetic Pony said:
This is one easy way to tell the whole thing is an absolutely enormous bubble. Businesses try to make money, hype bubbles try to spend money.

It’s absolutely insane. It’s going to take ages to put all of this infra to use and start generating revenue. Revenues, margins and stock valuations are unsustainable long term once everyone has enough AI accelerators.

xpea · Jun 4, 2024

trinibwoy said:
It’s absolutely insane. It’s going to take ages to put all of this infra to use and start generating revenue. Revenues, margins and stock valuations are unsustainable long term once everyone has enough AI accelerators.

That's the thing. AI is already profitable for many applications. Meta saw 20% improvement on their adds. Foxconn saw 30% reduction cost with AI usage for their factory layout and organization. Amazon warehouses productivity exploded with their robots driven by AI.
Of course, the business plan of these chatbot startups spending billion dollars in training trillion parameters LLMs is still dubious. But for now, like in every new industry, the goal is to outperform the competition and attract customers.
Truth is AI will be everywhere and computational need can only increase overtime. We really are only at the beginning and investment in infrastructure won't slow down for the next 3~4 years. Still plenty of money to be made by the shovel vendors...

LordEC911 · Jun 4, 2024

xpea said:
Lead time for AMD Instinct OAM racks are less than 3 months where it's still more than 6 months for Hopper stuff (For example, X.Ai will get 100k H200 this winter only)

Weird... Other partners/sources are stating a +6month lead time.
Funny how you went from "easy to buy" and "low interest/demand" to "maybe ~3months lead time."

Erinyes · Jun 4, 2024

LordEC911 said:
Weird... Other partners/sources are stating a +6month lead time.
Funny how you went from "easy to buy" and "low interest/demand" to "maybe ~3months lead time."

While AMD also probably has a lead time, the fact remains that MI300X and even MI300A are seeing much softer demand, ~$1B/quarter compared to like ~$20B/quarter for Nvidia DC. Even the advantages of higher memory capacity and unified memory do not seem enough to overcome Nvidia . This is also due to software of course, Nvidia does have a significant advantage there.

troyan · Jun 5, 2024

LordEC911 said:
Weird... Other partners/sources are stating a +6month lead time.
Funny how you went from "easy to buy" and "low interest/demand" to "maybe ~3months lead time."

Which one? AMD has announced that Blackwell(!) is 30x or so faster than their not even released MI325X card. They have osbourned the product stack till next year. There wont be any supply problem going forward.

trinibwoy · Jun 5, 2024

Erinyes said:
While AMD also probably has a lead time, the fact remains that MI300X and even MI300A are seeing much softer demand, ~$1B/quarter compared to like ~$20B/quarter for Nvidia DC. Even the advantages of higher memory capacity and unified memory do not seem enough to overcome Nvidia . This is also due to software of course, Nvidia does have a significant advantage there.

It doesn’t really feel like AMD is playing in the big leagues yet. They’re still talking about single node & single accelerator performance and memory capacity while the big boys are talking rack scale and datacenter scale. AMD won’t be a real player in AI until they have an answer for NVLink & Infiniband and Nvidia’s software ecosystem. That’s probably a few years away.

Erinyes · Jun 5, 2024

trinibwoy said:
It doesn’t really feel like AMD is playing in the big leagues yet. They’re still talking about single node & single accelerator performance and memory capacity while the big boys are talking rack scale and datacenter scale. AMD won’t be a real player in AI until they have an answer for NVLink & Infiniband and Nvidia’s software ecosystem. That’s probably a few years away.

I think AMD does offer up to 8x GPUs as a default configuration with 2x EPYC CPUs but you're right, that's still far below what Nvidia is offering in terms of scale. Whatever advantage AMD had with a unified CPU and GPU are almost mitigated with Grace. A part of it is also that some of these decisions are made years in advance and AMD bet on HPC while Nvidia bet on AI. Doubt even Nvidia saw the market moving the way it has when they originally planned Hopper/Blackwell. And in terms of execution as well, Nvidia has executed flawlessly while it seems AMD has seen a few delays. While AMD would probably win the HPC market, the AI market is at least an order or magnitude larger at the moment. I don't see anyone even catching up to Nvidia until 2026 at the earliest.

Arun · Jun 5, 2024

Erinyes said:
I think AMD does offer up to 8x GPUs as a default configuration with 2x EPYC CPUs but you're right, that's still far below what Nvidia is offering in terms of scale. Whatever advantage AMD had with a unified CPU and GPU are almost mitigated with Grace. A part of it is also that some of these decisions are made years in advance and AMD bet on HPC while Nvidia bet on AI. Doubt even Nvidia saw the market moving the way it has when they originally planned Hopper/Blackwell. And in terms of execution as well, Nvidia has executed flawlessly while it seems AMD has seen a few delays. While AMD would probably win the HPC market, the AI market is at least an order or magnitude larger at the moment. I don't see anyone even catching up to Nvidia until 2026 at the earliest.

I don’t think NVIDIA executes flawlessly - they just execute better than the competition. IIRC Hopper was a little bit late to ship in volume compared to their original announcement, and Grace was significantly late needing at least an extra respin. Also, Jen-Hsun has clearly been convinced AI will overtake HPC for many years, but I agree they probably didn’t expect the market to move so abruptly and Hopper was clearly designed to be very competitive in HPC as well (Blackwell might be different, tbd?)

I’m not sure whether the 8xGPU vs rackscale matters as much as that for all hyperscalers; e.g. see The Information’s reports on Microsoft developing their own Ethernet-based chips and infrastructure to replace Infiniband in their future datacenters (and that actually being ahead of their own AI chips). But NVIDIA clearly has a huge advantage by providing the “full solution” both in terms of competitiveness and addressable revenue.

8xMI300X has 1536GB of HBM which feels like plenty to run inference on most high-end LLM models; I’d trust MS’s claim that it provides leading perf/$ for GPT4(-o?) inference.

I think it remains to be seen how widely FP4 inference can be deployed for LLMs. Obviously a lot of engineering/research R&D is going to go into that now that Blackwell supports it, but I’m not convinced it puts MI300X/325X out of the running short-term.

LordEC911 · Jun 5, 2024

troyan said:
Which one? AMD has announced that Blackwell(!) is 30x or so faster than their not even released MI325X card. They have osbourned the product stack till next year. There wont be any supply problem going forward.

A few. Maybe, there is a caveat in there if you look close. Not really. That's quite a statement to make.

Erinyes said:
While AMD also probably has a lead time, the fact remains that MI300X and even MI300A are seeing much softer demand, ~$1B/quarter compared to like ~$20B/quarter for Nvidia DC. Even the advantages of higher memory capacity and unified memory do not seem enough to overcome Nvidia . This is also due to software of course, Nvidia does have a significant advantage there.

The shifting goalposts are always amusing. Obviously the competitive landscape has changed but that doesn't change the underlying numbers.
If you notice this discussion was from the CDNA thread, other than posting the information AMD released, I didn't mention anything about Nvidia.
People just can't help themselves.

Erinyes · Jun 5, 2024

Arun said:
I don’t think NVIDIA executes flawlessly - they just execute better than the competition. IIRC Hopper was a little bit late to ship in volume compared to their original announcement, and Grace was significantly late needing at least an extra respin. Also, Jen-Hsun has clearly been convinced AI will overtake HPC for many years, but I agree they probably didn’t expect the market to move so abruptly and Hopper was clearly designed to be very competitive in HPC as well (Blackwell might be different, tbd?)

I’m not sure whether the 8xGPU vs rackscale matters as much as that for all hyperscalers; e.g. see The Information’s reports on Microsoft developing their own Ethernet-based chips and infrastructure to replace Infiniband in their future datacenters (and that actually being ahead of their own AI chips). But NVIDIA clearly has a huge advantage by providing the “full solution” both in terms of competitiveness and addressable revenue.

8xMI300X has 1536GB of HBM which feels like plenty to run inference on most high-end LLM models; I’d trust MS’s claim that it provides leading perf/$ for GPT4(-o?) inference.

I think it remains to be seen how widely FP4 inference can be deployed for LLMs. Obviously a lot of engineering/research R&D is going to go into that now that Blackwell supports it, but I’m not convinced it puts MI300X/325X out of the running short-term.

True, I should have perhaps said executes better, there are those delays as you mentioned. Blackwell continues to push on the AI front and has cut down HPC IIRC? We also have to remember that AMD's limited resources until the last few years meant they had to pick and choose where they should focus. However with the increased spend on R&D and manpower, I see them catching up over the next few years.

Yes the UAlink standard I think you mean? But that will take time. Clearly the market is not happy paying the premium of Infiniband, but in the near term they don't seem to have any choice. Intel does not have a competing solution either and we'll have to see when and how it pans out.

The parameter size keeps increasing though, and the memory demand will keep increasing as well. MI325X in Q4 this year should again bring an advantage there and Nvidia would not be able to catch up until Rubin in Q4'25. MI350 seems to be an evolution and probably keeps the same memory size until HBM4 and MI400 in 2026.

Well MI350 will also support FP4/FP6 so whatever research goes into FP4 will ultimately benefit FP4 as well. My other thought is that ultimately it will also become about efficiency rather than just brute horsepower. Datacenter power usage has been increasing at an alarming rate over the past few years, and while energy efficiency has been increasing, the absolute power has gone up significantly as well.

Datacenter GPU market supply and demand (2024)

xpea

LordEC911

xpea

Granath

xpea

LordEC911

troyan

xpea

del42sa

Frenetic Pony

trinibwoy

Meh

xpea

LordEC911

Erinyes

troyan

trinibwoy

Meh

Erinyes

Arun

Unknown.

LordEC911

Erinyes