With the chiplet approach, it may be possible to create a solution that wouldn't present itself as a multi-GPU. An interview with Raja Koduri about the future being more about leveraging explicit multi-GPU with DX12 and Vulkan seems to contradict that. In addition to that, the chiplet method would seemingly be a more extreme break with the usual method, since it's potentially fissioning the GPU rather than just making it plural. Just counting on developers to multi-GPU their way around an interposer with two small dies is the sort of half-measure whose results we've seen from RTG so far.Also, the leaked slide with GPUs and server segments positions Navi 10 and 11 as the direct successors to Vega 10 and 11 respectively, at least where the slide is concerned. I don't see any sign of the speculated multi-die approach with Navi from this slide although I can't rule it out either. (Unless I'm missing something, this slide is the only mention of specific codenames and positioning for Navi so far.)
Also, the timelines for Navi and the filing of the interposer patent may make it too new to be rolled into that generation.
There are some elements of the GPU that might make it amenable to a chiplet approach, in that much of a GPU is actually a number of semi-autonomous or mostly independent processors and various bespoke networks. Unlike a tightly integrated CPU core, there are various fault lines in the architecture as seen from the different barriers at an API level, and even from some of the ISA-exposed wait states in GCN where independent or weakly coupled domains are trading information with comparatively bad latency already. A chiplet solution would be something of a return to the days prior to VLSI when a processor pipeline would be made of multiple discrete components incapable of performing a function outside of the substrate's common resources and the other chiplets.
It's such a major jump in complexity versus the mediocre returns of AMD's 2.5D integration thus far that I would expect there to be more progress ahead of it.
I'm open to being pleasantly surprised, although not much has shown up that indicates GCN as we know it is really moving that quickly to adapt to it. Some of the possible fault lines in a GPU might be moving the command processor and front ends to one side of a divide from dispatch handling and wavefront creation, given the somewhat modest requirements for the front end and a general lack of benefit in multiplying them with each die. Another might be GDS and ROP export, where the CU arrays already exist as an endpoint on various buses.
However, one item of concern now that we've seen AMD's multi-die EPYC solution and infinity fabric starting to show itself is how little the current scheme differs from having separate sockets from a latency standpoint. The bandwidths also fall seriously short for some of the possible GPU use cases. GCN's ability to mask presumably short intra-GPU latencies doesn't match up with what infinity fabric does for intra die accesses.
There may be other good reasons for what happened with Vega's wait states for memory access, but the one known area AMD has admitted to adding infinity fabric to is the one place where Vega's ISA quadrupled its latency count, which gives me pause.
That, and the bandwidths, connection pitch, and power efficiency AMD has shown so far and has so far speculated about really don't seem good enough for taking something formerly intra-GPU and exposing it.
Nvidia's multi-die scheme includes its own interconnect for on-package communication with much better power and bandwidth than AMD's infinity fabric links, and even then Nvidia seems to have only gone part of the way AMD's chiplet proposal does. Nvidia does seem to be promising a sort of return to the daughter-die days of G80 and Xenos, with the addition of a die on-package containing logic that wouldn't fit on the main silicon. In this case, it would be hardware and IO that would be wasted if mirrored on all the dies.
Glad to see that I'm not the only one on this.
But even if that's the way Navi will go, I don't see how, say, 2 400mm2 dies on an interposer could be cheaper than 1 750mm2 die with a few shader cores disabled. Yield is so easy to recover with a bit of redundancy, especially when the size of an individual core is getting smaller and smaller.
And that's assuming that this 2 die solution would have the same performance as the one die solution, which I doubt as well.
Perhaps they are hedging against issues they think they might hit with early ~7nm (whatever label those nodes get) or 7nm without EUV, or they have reason to expect it to be that bad for a while?
Nvidia seems to be proposing moving off some of the logic from the multi-die solution that would be replicated unnecessarily. AMD's chiplet solution takes it to the point that a lot of logic can be plugged in or removed as desired. The socket analogy might go further since AMD's chiplet and interposer scheme actually leaves open the possibility for removable or swappable chiplets that have not been permanently adhered to their sites.
The ASSP-like scheme may also allow more logic blocks to be applied across more markets, if the fear isn't that a large die cannot recover yields but that it cannot cover enough segments to get the necessary volume.
Whether two 400mm2 dies that can more fully use the area because of a 50mm2 daughter die giving them more room each is sufficient, I wouldn't know. Given the bulking up in cache, interface IO, and other overheads just to make those dies useful again, it seems risky.
Just putting two regular GPUs on an interposer or MCM with the complex integration, raised connectivity stakes, and redundant logic seems like a milquetoast explicit multi-GPU solution--which doesn't seem to go too far from what Koduri promised. It would be nice if AMD demonstrated integration tech and interconnect that would even do that sufficiently.