One of the things that gives me concern with the concept of Navi being merely a set of compute dies with memory stacked atop them is the density ratios: compute versus memory. Ratios concerning both logic and power.
It might not be in the cards for Navi. The AMD proposals are for a succession of computing projects whose timelines are past 2020.
Navi does seem to be keyed to next-gen memory and scalability, however it doesn't seem like the likely next-gen memory candidates such HBM3, GDDR6, maybe cost-reduced HBM are necessarily more suitable for the PIM model or well-timed for Navi's 2018-2019 time frame.
AMD wants to apply the same GPU silicon to many markets, and in addition to the memory and power concerns brought up there are other trade-offs in terms density, communication between chips, and physical optimization whose tradeoffs would compromise Navi if it tried to adopt a TOP-PIM model while still being the GPU that can slip into the current product spaces.
The active interposer and chiplet scheme, however, is a significant jump in implementation and cost over passive interposers and MCM packaging. MCMs are well-understood, but even AMD's best effort here with the links in EPYC are orders of magnitude short of the needs here. Passive interposers might get closer, but may be marginal even for Navi in terms of cost and complexity--plus AMD needing to significantly do better in terms of interconnect power, bandwidth, and density.
AMD's active interposer assumption is seemingly further out and may be necessary for what it intends to do.
Nvidia's similar concept seems to be better-documented, referencing nearer-term products as possible contemporaries, better than AMD's signalling promises, and reliant on fan-out packaging rather than an interposer. It has at least one daughter die with some of the more miscellaneous IO and other hardware that would otherwise be uselessly replicated across the chips.
Realistically, one or two stacks of HBMx sitting atop compute chips pumping out a hundred or more watts is not going to happen.
AMD's stacked compute papers have assumed something like a maximum of 10W for under-stack logic, and with the memory stack perhaps ~5 W for the rest in order to keep DRAM temperatures at 85C or less.
The latest exascale concept has a 160W node power budget (200W minus internode and system infrastructure power), and with 8 GPU chiplets that is 20W assuming the CPU sections draw 0W.
More realistically, AMD's modeling gives 40-70W for off-interposer system memory access across workload profiling. Even assuming the CPU segments and other chiplets wouldn't be drawing power idling or supporting the GPUs, that means probably barely more than 10W can be sustained per GPU chiplet.
So the processor in memory concept needs to be seen as a subset of GPU functionality, which I think goes back to my initial idea about PIM: that memory-heavy fixed function hardware would be located in PIM, with the rest of the GPU being somewhere else.
Some of the fixed function hardware still has some relatively high-bandwidth connectivity with the programmable portions, and the latency-hiding for the GPUs isn't that strong for intra-block communication and synchronization like it is for texturing latency.
Unfortunately, without something like AMD's active interposer concept and a next-gen fabric that somehow allays concerns that its active interposer concept is still insufficient, its efforts so far are unsuitable.
Effectively it's describing a "one-size fits all" interposer, upon which varying configurations of chiplets can be deployed.
My interpretation is that this covers various forms of interposer, one of which seems to generally describe the active interposer concept from AMD's latest exascale concept. The exascale paper briefly mentions the interposer providing miscellaneous functionality and networking as part of its duties for supporting the chiplet.
What is more universals is the standardization of interface site formats, which chiplets can conform to and different interposer designs can combine. I think one rough interpretation is doing for an interposer what various slots, package ball-outs, and sockets do for motherboards.
It follows with AMD's dis-integration goals with interposers and splitting functions and silicon processes apart and combining them in a 2.5D package. It seems as if most are on the same page, albeit others like Nvidia and Intel are looking at not needing an active interposer and stand a good chance of soon having in practice what AMD has on paper.
What AMD might be trying to do is dis-integration to the point that various areas can be treated as a kind of Application Specific Standard Product. It can sell chiplets with just subsets of the overall GPU and freely include other blocks/outside IP as a custom product. The silicon itself would be more generic due to this, although I still question how generic it can be given what its exascale project is doing to the CU implementation. Among other things, clock rates are currently too high, and Vega is not moving in a promising direction. Unless AMD starts giving more concrete details on how it intends to improve the bandwidth, power, and interface pitch, the drop in perf/mm2 and perf/W could readily eclipse the benefit of splitting up the die.
Perhaps Vega's implementation of the infinity fabric is a first step, and we can look at its non-progress in terms of perf/mm2 and perf/W as what even that step can do even before taking the scaling hits from leaving the die. The CU area in its marketing picture is unambiguously much less dense than it is with Polaris, despite currently not offering much more in terms of what it delivers per-CU.
couldn't they put the shaders on the hbm2 dies as the bottom stack and run them slower and make up for it with higher shader counts ?
One of AMD's concepts had a big GPU die that then connected to TOP-PIM or similar HBM stacks with mini GPUs under it.
Workloads could try to leverage whichever silicon suited them best.
Granted, I think that was 2 or more of AMD's aspirational compute concepts in the past.