AMD Execution Thread [2024]

Speaking of which, I wanna discuss the difference between H100 and MI300X die size and cost wise

We know MI300X uses about 920mm² of combined TSMC 5nm chiplets (8 compute chiplets) stacked on top of 1460mm² of TSMC 6nm chiplets (4 IO chiplets). Combined this makes for a total silicon area of 2380mm², this is compared to a 814mm² single die of the H100.

https://www.semianalysis.com/p/amd-mi300-taming-the-hype-ai-performance

NVIDIA's transistor footprint is much smaller, but yield is worse, however the H100 is not a full die product, it has about 20% of it disabled to improve yields. So the difference in yields between the two may not be that large.

MI300X costs significantly more due to it's larger overall size, more complex packaging and the need for additional 6nm chiplets.

Any thoughts?

This could be a hint or not, MI300's yield seems somewhat low due to it's complex packaging and early stage of ramp.
Jean Hu
Executive Vice President, Chief Financial Officer and Treasurer at Advanced Micro Devices
Yes. Thank you, Joe. Our team has done an incredible job to ramp MI300. As you probably know, it's very complex product and we are still at the first year of the ramp, both from EO [Phonetic], the testing time and the process improvement. Those things are still ongoing. We do think, over time, the gross margin should be accretive to corporate average.
Jean Hu
Executive Vice President, Chief Financial Officer and Treasurer at Advanced Micro Devices
Yes. I think you're right. It's the GPU gross margin right now is below the data center gross margin level. I think there are two reasons -- actually, the major reason is we actually increased the investment quite significantly to, as Lisa mentioned, to expand and accelerating our road map in the AI side. That's one of the major drivers for the operating income coming down slightly. On the gross margin side, going back to your question, we said in the past and we continue to believe the case is, data center GPU gross margin over time will be accretive to corporate average, but it will take a while to get to the server level for gross margin.
 
This could be a hint or not, MI300's yield seems somewhat low due to it's complex packaging and early stage of ramp.
Yeah, I myself was shocked to discover how big MI300X is compared to H100! Even transistor count is huge (163 billion for MI300X vs 80 billion for H100).
 
I feel like the usual PPA metrics (or at least the A) don't really apply here. We're talking hundreds of dollars of silicon on a product that sells for tens of thousands, while the bulk of the cost is the HBM. Even if MI300X's core costs twice as much as H100's, it's only a couple percent gross margin difference. And a couple percent less than Nvidia's ~80% is still really damn good.

Where they would actually be stung on margin is selling 144 GB of HBM for similar/less than Nvidia are selling 80 GB.
 
Where they would actually be stung on margin is selling 144 GB of HBM for similar/less than Nvidia are selling 80 GB
Which they do! MI300X 192GB is sold for significantly less than H100 80GB. MI300X is sold for about 15k (sometimes 10k), while the H100 is anywhere between 30k and 40k or more.

an inside source claims that Microsoft has purchased the Instinct MI300X 192 GB model for ~$10,000 a piece

The leakers claim that businesses further down the (AI and HPC) food chain are having to shell out $15,000 per MI300X unit

we have seen NVIDIA's H100 80 GB HBM2E add-in-card available for $30,000, $40,000, and even much more at eBay. Meanwhile, the more powerful H100 80 GB SXM with 80 GB of HBM3 memory tends to cost more than an H100 80 GB AIB

 
we have seen NVIDIA's H100 80 GB HBM2E add-in-card available for $30,000, $40,000, and even much more at eBay. Meanwhile, the more powerful H100 80 GB SXM with 80 GB of HBM3 memory tends to cost more than an H100 80 GB AIB
You removed the beginning of the quotation and also the links, so it isn't obvious that you are comparing August 2023 price of H100 to April 2024 price of MI300X.

Source of the quotation:

the missing part: "over the recent quarters, we have seen NVIDIA's H100 80 GB HBM2E"

links to (February 2, 2024):

"we have seen Nvidia's H100 80GB HBM2E add-in-card available for $30,000, $40,000"

links to (August 15, 2023):

"Nvidia H100 80GB HBM2E […] retails for around $30,000 in the U.S."
links to a retailer:

So you are comparing H100's price from a random retailer in August 2023 to a contract price of MI300X arranged between manufacturer and a big customer in April 2024.
 
So you are comparing H100's price from a random retailer in August 2023 to a contract price of MI300X arranged between manufacturer and a big customer in April 2024.
I don't understand the problem, the point still stands, in fact as of April 2024 the H100 costs 40k or more and the MI300X costs about 15k (or 10k with Microsoft).

If you mean that a random retailer is some untrusted source, then you should know the price of H100 has been set to at least 33k since before it's even out in 2022, it's been on the rise ever since.


Nvidia H100 80GB HBM2E […] retails for around $30,000 in the U.S
That's the cheapest SKU (HBM2E and PCIe), you seem to misunderstand the link I posted, it has prices listed for all SKUs of the H100, the highest SKU of H100 (HBM3 and SXM5) costs 40k, 50k, on ebay it costs even 60k as of right now.

From the same retailer you posted, it costs 70k to own the HBM3 version as of today.

 
Last edited:
I don't understand the problem, the point still stands, in fact as of April 2024 the H100 costs 40k or more and the MI300X costs about 15k (or 10k with Microsoft).

If you mean that a random retailer is some untrusted source, then you should know the price of H100 has been set to at least 33k since before it's even out in 2022, it's been on the rise ever since.
There should be a difference between street price and actual price set between manufacturer and "big" customer.
On the street, the biggest, fattest, fastest 256 GB DDR5 memory modules for servers cost around $18,000 running at 4.8 GHz, which works out to around $70 per GB. But skinnier modules that only scale to 32 GB cost only $35 per GB. So that puts HBM2e at around $110 per GB at a “greater than 3X” as the Nvidia chart above shows. That works out to around $10,600 for 96 GB. It is hard to say what the uplift to HBM3 and HBM3E might be worth at the “street price” for the device, but if it is a mere 25 percent uplift to get to HBM3, then of the approximate $30,000 street price of an H100 with 80 GB of capacity, the HBM3 represents $8,800 of that. Moving to 96 GB of HBM3E might raise the memory cost at “street price” to $16,500 because of another 25 percent technology cost uplift and that additional 16 GB of memory and the street price of the H100 96 GB should be around $37,700.

It will be interesting to hear the rumors about what the H200, with 141 GB of capacity (not 144 GB for some reason), might cost. But if this kind of memory price stratification holds – and we realize these are wild estimates – then that 141 GB of HBM3E is worth around $25,000 all by itself. But at such prices, an H200 “street price” would be somewhere around $41,000.
 
There should be a difference between street price and actual price set between manufacturer and "big" customer.
I am not debating that, but we are speaking about the difference in price between MI300X and H100 and the related margins and bill of material. We know the street price difference is large, it stands to reason that the big customer price difference is also large.

According to your link, HBM3 is sold at 110$ per GB, which means 192GB of the HBM3 in MI300 costs about 20k, if that is true then AMD is selling MI300X at a loss (at a street price of 15k), which I don't think is true. We know the margins on the MI300X is not that large, but it's definitely not being sold at a loss.
 
Sorry I don't have a direct link, but I'm hearing on twitter that AMD may have been hacked and a huge leak is incoming with potential PS5 Pro and PS6 info.

Edit:
Link:
Ok so I'm quoting this again as yes... information is starting to get out there and if you look you can easily find it. A twitter account I saw was basically showing proof but soliciting payment for the unredacted information in a tweet, so obviously I'm not going to post them... but it's getting out there now. There's employee information and product information coming.. along with future product spec sheets and source code/firmware and finances.

God damn this really sucks for AMD.. I hate these ransom hackers..
 
A company spokesperson told Bloomberg, “Based on our investigation, we believe a limited amount of information related to specifications used to assemble certain AMD products was accessed on a third-party vendor site.”
Little less dramatic according to official statements
 

Little less dramatic according to official statements
This bit, "AMD's statement doesn't say so outright, but seems to suggest no customer or employee information was obtained.", worries me. If no customer or employee data was taken they'd just say so, they wouldn't have to carefully word it.

Bad news for AMD again. :(
 
Because making definitive statements could be cause for even more legal trouble if they later find out some of that data got out. Reporting a data breach is a precarious thing, because you somehow have to work out exactly what they got away with. Until you have a way to know for certain, you always err on the side of hedging statements.

Source: former director of infrastructure and security operations for a Fortune 250 org. I ran, among a lot of things, the Security Operations Center, the Computer Emergency Response Team, the Incident Operations team, and the Red/Blue teams. We were the ones responsible for figuring out where a thing started, where it moved to, and where / who it was being controlled by. Reacting to a breach is an incredibly delicate matter, even at a technology level.
 
Well, that's the actually bad part. If it had "only" been customer and employee data, it's just a bunch of paying for people's credit monitoring and probably some CPRA/GPDR lawsuits for dozens or maybe hundreds of millions.

But if it's all their engineering data for current and future products, that could literally cost them their business in the medium to long term.
 
Well, that's the actually bad part. If it had "only" been customer and employee data, it's just a bunch of paying for people's credit monitoring and probably some CPRA/GPDR lawsuits for dozens or maybe hundreds of millions.

But if it's all their engineering data for current and future products, that could literally cost them their business in the medium to long term.
yup, I guess that would have implications on the contracts with Sony, Microsoft and so on.

How do you get to know over time how much data or what data has exactly been breached and taken from your servers? I mean, you should have to be able to see the data transfers logs, right? Is there a way to know which data has been extracted to a certain ip or computer? Just curious....
 
It really depends on the environment and if you actually know (or can reasonably suspect) where they got in -- or at least have a grasp of what data they have to give you a hint of where it might have been sourced.

It's worth noting: the vast majority of breaches are caught in the act and subsequently contained, and you'll never hear about them. Examples from my past typically started with a piece of malware on a PC somewhere, which begins reaching laterally across the same network to find other vulnerable PCs, and then will start reaching out into other routed networks (usually as linked to mapped network drives or open network sessions from the infected PCs) to begin scouring those areas for weaknesses. Most attempts are blocked at the source with high quality EDR software; you can think of it as a far more advanced form of antivirus combined with a good huerstics analysis engine. Even if a brand new piece of malware which has never before been seen lands on a PC with a good EDR solution, the behavior of a brand new process having been spawned out of a browser parent process which then immediately starts walking the network sessions list in the kernel is immediately flagged as bad news and stopped.

If that piece of malware is far sneakier, maybe it just runs and does nothing for hours, or days, or maybe even longer. Maybe it only opens up one network session every few hours, to a single IP and port, just to test. What you're looking for is traffic that shouldn't be there: why is John Doe's laptop from Accounting trying to access the engineering file server? Why is our internet bandwidth egress somehow 5% larger today than our typical, and why is that excess seem to be destined for an Akamai endpoint? Why did Jane Doe's account get authentication denied against ten completely different servers over the past two days, when she's never accessed any of those servers in her past?

A diligent organization will have detailed logging around processes which stopped and started for each machine, network session establishment and teardown for those same machines, where authentications were approved or rejected across the enterprise, sample netflows for things crossing internal and external network boundaries. If you suddenly get an alarm for non-normal behavior, now you get to start walking that all backwards -- where did this new network session come from? Ok, now from that source PC, who has it been talking to, and who has been talking to it? Walk those backwards, are they normal? These five network sessions are non-normal, where did they come from, whose account was it using to auth? Where else does has that account logged in successfully in the last day? Two days? Weeks? You just keep following all the little silk threads until you finally have convered the entire spider web.

Really good threat actors don't smash and grab, they sit very quietly in your environment for weeks or months or longer, and will take very tiny and very carefully planned steps to discover the topology of your environment, unearth where the interesting things may lie, and then move data out in the most inconspicuous way possible. Whatever this databreach was at AMD, I can already tell you those threat actors had been in their org for probably weeks if not months. And if the SOC only discovered the breach when the news headlines hit the virtual airwaves, they have a world of pain waiting for them as they now try to reverse-engineer all the places that data could exist, what and who touched that data in the prior weeks or months, and so the process grows...
 
Last edited:
Back
Top