So a single 64CU GPU would work... but a GPU using two 32CU chiplets wouldn’t? I’m not following the logic here.
In certain places of the architecture, this is the case given what we know.
Certain things that are internal to a single GPU are multiple times the width of what is seen externally, like the very wide fabric between the L2 and the CUs and/or L1.
The data fabric AMD uses to allow for multi-chip connectivity is in the wrong place in that hierarchy, as it's outside the L2. Many parts of the architecture rely on the L2 as the point of coherence/consistency, and the L2 itself isn't designed to be split between chips and still function correctly. Other items in the graphics pipeline also don't show signs of being designed to work correctly if the graphics context they support exists no more than one chip.
With Navi and the introduction of the L1 cache, it might make this problem worse since it's yet another layer of cache that is not coherent or requires more special handling and a loss of performance to keep its contents from becoming stale.
There would need to be substantial architectural changes, and architectural in this case is closer to the original meaning as far as what model is presented to programmers. It's not really defined how a graphics pipeline is supposed to work if there's more than one heavyweight context, whereas there's a long-established sense for how CPU architecture works.
There are possible directions solutions can go down, and some signs that it might be possible eventually to get enough bandwidth (at least if combined with large architectural changes just for MCM), but whether a console maker would get enough from that to foot the bill seems questionable and AMD has basically said it's not committed to the idea--though it might think about it later.
Some of my baseless theorizing as to why Sony/MS may be aiming for 12+ TF: Perhaps the chances of a Pro version of either console in the next 3-4 years is rather slim simply due to the slow down in process tech.
It would have been difficult to achieve a major separation from the original current-gen consoles at 10nm, and the mid-gen refreshes made that problem significantly worse. Getting ~2x the Xbox One X seems at least somewhat practical, given that 7nm is a somewhat better than average node transition--and also something like a bare minimum to get some daylight between the mid-gen refresh and the next gen. Even from the base consoles this jump might be just enough to not be considered too disappointing relative to how some where underwhelmed by the power jump from the last gen to current.
I was wondering how the next gen could stand out enough, and it seems other features or design elements are being leveraged in cases where there are diminishing returns on scaling and how perceptible a given performance jump actually is.
I’ll submit that such price estimates mean little when you are making PS5 size orders. ”We’ll buy 100 million stacks over the next three years.” Those are orders that you plan/finance your fab structure around, and commodity pricing is irrelevant. Sony would pay less, is about all we can say.
The throughput of those involved in the package integration might indicate if this would happen, and if that's not expected to scale sufficiently the memory manufacturers aren't going to buy into volume promise. There's also the question if it's worth committing to 100 million stacks at console component prices versus possible organic growth in the AI, networking, and HPC markets that are all willing to pay so much more. Why commit to volume production for a sub-retail buyer?
If ReRam is as fast as that presentation claimed, then PS5 could go with 8/16GB HBM + 256GB of ~50GB/s ReRAM [+M.2 slot for NVME storage].
The power numbers would hopefully be improved from that presentation, where the upper end was closing on 30W of power consumption. While I haven't been able to find the chip-only power limit for the Radeon 5700XT, at least some GPU Z shots show load can get to at least 120 out of a 225W max. Taking out inefficiency in VRM conversion and miscellaneous consumption, GDDR6 should be taking 60-70W, as a possible max. Committing to a solid-state solution with that consumption is akin to giving up about half of a 256-bit bus, or possibly going from ~768GB/s to ~512 GB/s. While having a high-speed solid-state solution is very helpful, hopefully other factors have intervened so that I wouldn't need to weigh losing that much memory bandwidth.
Honestly it looks like memory paging magic is HBCC.
At least so far, HBCC hasn't shown up in unified memory situations. The lack of an expansion bus and separate memory pool likely may make the win less noticeable, and the APU GPUs have a potentially more fine-grained ability for interacting with paged memory, since there's an XNACK mask that can track faults on a per-lane granularity.