Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
We don't know if chiplets are cheaper but considering that AD103 die is 378.6 mm^2 on N5 ("4N") and that die perform roughly on par with a Navi 31 system where the GCD is 306 mm^2 made on a similar N5 and MCDs are 37.5 mm^2 each made on a cheaper N6 I wouldn't be so certain that chiplets are in fact cheaper - at least in this particular comparison.
 
We don't know if chiplets are cheaper but considering that AD103 die is 378.6 mm^2 on N5 ("4N") and that die perform roughly on par with a Navi 31 system where the GCD is 306 mm^2 made on a similar N5 and MCDs are 37.5 mm^2 each made on a cheaper N6 I wouldn't be so certain that chiplets are in fact cheaper - at least in this particular comparison.
I think RDNA3 had the potential to be more cost efficient, but it required AMD to have actually not flopped massively with the architecture. I mean, unless you're going for massive scaling beyond monolithic reticle limits, the whole point of using chiplets is to reduce costs. It can definitely be cheaper, but you cant flub your architectural gains and expect to benefit from these cost advantages against tough competition.

I have no doubt AMD would have liked to have priced their RDNA3 parts higher than they are now, but cant do so because of performance reasons.
 
Is that an official statement from AMD including packaging costs? There’s no guarantee the economics are the same for Nvidia.

Packaging is cheap versus large monolithic chiplets. There's no possibility of power/performance savings, right now it goes over a solder ball and loses power/latency, at best it goes copper to copper and is kinda close to monolithic but with heat problems. No one would be doing chiplets if it weren't cheaper right now, it's the same economics for Nvidia and Intel, which is exactly why both are going chiplets.

That being said I don't know if graphics is chiplets for Nvidia this gen, or like AMD it'll be only compute first (see CDNA2) and then graphics later
 
Last edited:
The cheapest RDNA3 GPU is monolithic. The cheapest AMD CPUs are monolithic.

Hopefully this isn't veering off topic but I feel because of the technical marketing used and the mindshare from AMD's Zen CPUs seems to result in some over generalizations with respect to chiplets.

If you just look at the technology fundamentally the benefit of chiplets over monolithic has to do with scaling. If the end product does effectively leverage said scaling (over the drawbacks of doing so) it can lead to things like a cost advantage over monolithic, but you can't just throw out the generalization that chiplets are cheaper, and take for granted that any end chiplet implementation is cheaper.

AMD's chiplet CPU implementation and the market dynamics of it vs. Intel's monolithic CPUs over the last few years is very different then what we are seeing so far in the GPU space. The CPUs have the compute component scaling to a point that one solution can serve consumers at $200 (or lower I guess depending on sales) to HPCs at over $10000 a piece, from 6 cores through to 128 cores. That scaling advantage through to the top end is where they are having a huge advantage and making massive inroads in enterprise because the monolithic solutions simply can't practically match. It's also implemented with relatively simple packaging.

That isn't what we are getting in GPUs so far or as far as I know what is being proposed.
 
The cheapest RDNA3 GPU is monolithic. The cheapest AMD CPUs are monolithic.

Hopefully this isn't veering off topic but I feel because of the technical marketing used and the mindshare from AMD's Zen CPUs seems to result in some over generalizations with respect to chiplets.

If you just look at the technology fundamentally the benefit of chiplets over monolithic has to do with scaling. If the end product does effectively leverage said scaling (over the drawbacks of doing so) it can lead to things like a cost advantage over monolithic, but you can't just throw out the generalization that chiplets are cheaper, and take for granted that any end chiplet implementation is cheaper.

AMD's chiplet CPU implementation and the market dynamics of it vs. Intel's monolithic CPUs over the last few years is very different then what we are seeing so far in the GPU space. The CPUs have the compute component scaling to a point that one solution can serve consumers at $200 (or lower I guess depending on sales) to HPCs at over $10000 a piece, from 6 cores through to 128 cores. That scaling advantage through to the top end is where they are having a huge advantage and making massive inroads in enterprise because the monolithic solutions simply can't practically match. It's also implemented with relatively simple packaging.

That isn't what we are getting in GPUs so far or as far as I know what is being proposed.

That’s my thinking as well. Chiplets only start making sense when scaling to performance levels higher than what a monolithic chip can provide.

The economics for Nvidia are different because they will likely not use chiplets in the consumer space for performance levels that can be reached with a single die.
 
If you just look at the technology fundamentally the benefit of chiplets over monolithic has to do with scaling.
It can be scaling if we're talking big enterprise/DC/server stuff, but the main draw at the consumer end is basically cost reduction through better yields and as RDNA3 does - being able to portion off parts of the processor that can be fabbed on older nodes without (much) compromise.

Both of these are pretty significant. The larger the die you want to make, the more defects are gonna pummel your yields. Doing smaller chiplets also means you're gonna have more 'prime' dies to use in higher end, higher margin SKU's.

Getting multiple GCD's working together for graphics will be a golden opportunity to reduce costs if Nvidia/AMD/Intel and packaging partners can figure it out, and I expect they will in time cuz doing very large dies on these increasingly expensive nodes is gonna become unfeasible for the consumer market before too long.

But for now, Nvidia is gonna run with what they know til such a breaking point, I expect. Certainly after Lovelace, they have reset consumer market pricing quite drastically, so they probably expect there's still room to play before making any chiplet shift just yet.
 
The larger the die you want to make, the more defects are gonna pummel your yields. Doing smaller chiplets also means you're gonna have more 'prime' dies to use in higher end, higher margin SKU's.

Getting multiple GCD's working together for graphics will be a golden opportunity to reduce costs if Nvidia/AMD/Intel and packaging partners can figure it out, and I expect they will in time cuz doing very large dies on these increasingly expensive nodes is gonna become unfeasible for the consumer market before too long.

Is that true though? For a given performance level it's not yet proven that multiple smaller heterogenous dies are cheaper overall than harvesting a big die on an expensive node. RDNA3 certainly isn't proof of that.
 
Is that true though? For a given performance level it's not yet proven that multiple smaller heterogenous dies are cheaper overall than harvesting a big die on an expensive node. RDNA3 certainly isn't proof of that.
I was gonna say NVIDIA probably has some of the best margins they've ever had. Much better than AMD. There is a disconnect between the theory and the reality of chiplets for GPUs at the moment.
 
I was gonna say NVIDIA probably has some of the best margins they've ever had. Much better than AMD. There is a disconnect between the theory and the reality of chiplets for GPUs at the moment.
Eh, these margins are from DC products though - which are partially "chiplet" based in case of Nvidia too. That's even disregarding the whole AI boom which is responsible for said margins. So this isn't any proof either.
 
I'm very skeptical that multiple dies on the same process is cheaper *if* there's a lot of redundancy in the monolothic die. What percentage of an AD102 die can be defective and still result in a viable SKU? It looks to me like less than 15% absolutely has to work (probably less than 10% even, e.g. PCI-Express, video decode/encode, central control logic for processing & distributing commands, various multiplexers/buses, etc...) - that means there's potentially less non-redundant logic than on many smartphone chips where practically everything has to work! Now let's say you moved all that to a smaller die that you can test separately, but you've decreased your packaging yields and increased packaging costs while adding a few % of area for the cross-die PHYs/TSVs... you really haven't gained that much I think, and it's a bit academic whether it's 1% better or 1% worse...

But here's a dirty little secret: a *LOT* of GPU dies are probably much more functional than the SKUs they end up in. This depends on the process generation and maturity, as sometimes the process really has fairly low yields and you might struggle to build enough of the top bins... but the rest of the time, a lot of the distinction is artificial (at least in terms of the chips, you still reduce costs with lower TDPs resulting in simpler board designs etc...) - and you might not be able to reliably predict TSMC's yields 12-18 months in advance of mass production so you can't really optimise around it in the design phase too much either.

RNDA3's approach is really elegant there: they can remove a MCD completely for lower-end SKUs, which will save them a lot more money than disabling a fully functional memory controller. The PHY on the GCD isn't free but it's an lot smaller than the area/cost of a MCD. This is also similar (but in the other direction) to how AMD has a single I/O die for multiple SKUs with different numbers of CPU chiplets. There are also verification & tape-out costs benefits if you reuse the same chiplet for different products or even generations, e.g. for AMD's CPUs it's less verification effort to tape-out one I/O die and one CPU chiplet than it would be to tape-out two monolithic CPU+IO dies.

But the biggest benefit of chiplets for GPUs/CPUs is the ability to use different processes for different chips. That includes N5 for GCD and N6 for MCD on RDNA3. I'm convinced AMD is right that that RNDA3 chiplets are significantly more cost effective than a hypothetical monolithic Navi31 given the different processes and the ability to reduce the number of MCDs (which is orthogonal to whether RDNA3 is more or less cost effective than Ada).

My new favourite example of this is Intel's Clearwater Forest which they detailed a bit more yesterday at their foundry event:

https://spectrum.ieee.org/intel-18a

In Clearwater Forest, billions of transistors are divided among three different types of silicon ICs, called dies or chiplets, interconnected and packaged together. The heart of the system is as many as 12 processor-core chiplets built using the Intel 18A process. These chiplets are 3D-stacked atop three “base dies” built using Intel 3, the process that makes compute cores for the Sierra Forest CPU, due out this year. Housed on the base die will be the CPU’s main cache memory, voltage regulators, and internal network. “The stacking improves the latency between compute and memory by shortening the hops, while at the same time enabling a larger cache,” says senior principal engineer Pushkar Ranade.

Finally, the CPU’s I/O system will be on two dies built using Intel 7, which in 2025 will be trailing the company’s most advanced process by a full four generations. In fact, the chiplets are basically the same as those going into the Sierra Forest and Granite Rapids CPUs, lessening the development expense.

So it's Intel 18A for the CPU cores (and presumably L1+L2 cache), Intel 3 for the base die that includes the L3 cache (no scaling/power benefit to doing the L3 on 18A since SRAM has effectively stopped scaling, but you still want L1 and probably L2 on the same die to minimise latency), and Intel 7 for the I/O dies (since I/O has been scaling even worse than SRAM for longer), for a total of 12xCPUs + 3xBase + 2xIO = 17 chiplets. Assuming their packaging yields are good, I expect this to be a lot cheaper than a single 18A monolithic die.

Similarly, I don't think NVIDIA has much reason to go for "multiple GCD-like chiplets" ala MI300X short-term (especially given how profitable their ex-Mellanox datacenter networking business is), but they'd be missing a trick if they don't go for heterogeneous chiplets given how bad SRAM/IO scaling is on TSMC N3E. At their current scale, they have to consider the risk of supply constraints for advanced packaging though, so they might decide to be more conservative for that reason.
 
Is that true though? For a given performance level it's not yet proven that multiple smaller heterogenous dies are cheaper overall than harvesting a big die on an expensive node. RDNA3 certainly isn't proof of that.
All else being equal, yes, it should be cheaper, especially if you're not jumping onto the latest node on Day 1, as Nvidia and AMD dont do.

Again, with RDNA3, it doesn't seem to have worked out because something is wrong with it in terms of performance. You of course need to actually be able to leverage newer nodes not just for density, but performance and efficiency as well. Nobody is targeting static performance and efficiency though, and you will run into limits with those if you stick with older nodes. Also, RDNA3 uses two different GCD's so still not the optimal way to do things yet, either. Being able to produce just a single GCD that you scale up and down for the whole range(ala Ryzen) lets you reduce design costs and simplify testing and manufacturing.

Done right, it should ultimately be a win. I can see lower end GPU's in a range using older nodes going forward with an otherwise larger die though, sure. There's simply less there to 'win' through yields and spinning off I/O/cache onto a separate chiplet.
 
Similarly, I don't think NVIDIA has much reason to go for "multiple GCD-like chiplets" ala MI300X short-term (especially given how profitable their ex-Mellanox datacenter networking business is), but they'd be missing a trick if they don't go for heterogeneous chiplets given how bad SRAM/IO scaling is on TSMC N3E. At their current scale, they have to consider the risk of supply constraints for advanced packaging though, so they might decide to be more conservative for that reason.

Nvidia's SRAM is currently all L1 & L2 though. There's no L3 to carve out into chiplets. Question is whether following AMD down that path is a net win from a performance and margin perspective.
 
What percentage of an AD102 die can be defective and still result in a viable SKU? It looks to me like less than 15% absolutely has to work

Note that you cannot harvest every defective die even if every part of the die is redundant. Faults don't just mean "transistor doesn't work", there are plenty of potential faults that trash the entire chip even if the part it occurred in isn't important. The canonical example is a direct short between power and ground, so when you give it any power it melts.
 
Nvidia's SRAM is currently all L1 & L2 though. There's no L3 to carve out into chiplets. Question is whether following AMD down that path is a net win from a performance and margin perspective.

Their L2 is equivalent to AMD's L3 in a lot of ways, just a giant shared SRAM cache meant to soak up traffic that would otherwise go out to RAM. AMD just appears to have been designing for chiplet SRAM from the start, while Nvidia has not.

That being said, since chiplet SRAM seems to be the cost win it's meant to be (heck we see it in mid range consumer products from AMD now) the first consumer graphics chiplets we see from Nvidia might be SRAM.
 
Their L2 is equivalent to AMD's L3 in a lot of ways, just a giant shared SRAM cache meant to soak up traffic that would otherwise go out to RAM. AMD just appears to have been designing for chiplet SRAM from the start, while Nvidia has not.

That being said, since chiplet SRAM seems to be the cost win it's meant to be (heck we see it in mid range consumer products from AMD now) the first consumer graphics chiplets we see from Nvidia might be SRAM.
Only if Nvidia breaks from their two-level cache hierarchy. My understanding is that Nvidia really tries to avoid the chiplet/tile route for their designs and keep the cache subsystem streamlined even at the cost of compensating it with faster and more expensive GDDR memory. Power consumption might be one of the motivations, since any high-speed interface going off die will degrade power/perf metrics and signal latency. The Infinity Cache implementation in Navi31, while providing more throughput than the previous generation, takes a significant hit in latency.
 
Ironically, A100/H100’s split cache seems like a perfect design for “two identical chiplets”, and as per chips & cheese’s testing, the latency to the “Far L2” is surprisingly high, so the impact of using actual chiplets with die-to-die links might not be that high. I’m not sure they would want to stick with this L2 design anyway as AD102 has proven they can make a larger combined cache that is also lower latency.

Anyway, my argument about CPU L2 not being suitable for chiplets is really specific to CPUs which are much more latency sensitive, and NVIDIA L2 is more like a CPU L3 as others have said.

BTW remember AMD actually has 4 levels of cache (32KiB L0, L1 for multiple WGPs, L2 potentially per GCD, and L3) versus NVIDIA’s 2 levels. I could easily imagine NVIDIA adding a per-GPC “L1.5” especially as cluster shared memory on H100 is moving them in the direction of GPCs being more important in terms of the memory hierarchy (well, they always were as the MMUs/TLBs are per-GPC and the L2 is in physical address space, but that doesn’t have as much direct software impact).
 
Back
Top