Not sure if there's a misunderstanding from my side, but all the 2D resources for a GPU are 2D tiled and completely seperable. The meta-data for depth/color-compression is also tiled and seperable. I see no problem in concurrent processing of different areas of the same resources by different chiplets, because all the access is through state-enforced locality. There is only exactly one edge-case which is UAVs, which are timely overlapped or location overlapped, and that's either restrictable by "Concurrent Access Views" or a tuned coherency protocol. It's not difficult to rethink algorithms to not utilize the most contentious and hazardous memory access pattern.
The RAW hazard handling for resources that have associated metadata includes flushing the affected caches. Vega can avoid flushing its L2, if its alignment is respected, but the metadata cache would still be flushed. This is a heavier operation since it is a pipeline flush that includes among other things a microcode engine stall to avoid a race condition within the command processor. The operation and its front-end stall are not quite transparent in the single-GPU case to start with.
This is why Xenos worked quite well...
To clarify, is the chiplet stacked memory a part of the GPU's general memory pool, or is there more than one path to the stack? Xenos had a dedicated path for its GPU output, separate from and higher bandwidth than the the DDR bus.
My interpretation of the stacked solution is that there would be one signal path to the stack, which the other proposals matched with the HBM's bandwidth.
The EDRAM was sized to be able to support a worst-case amplification of that 32GB/s payload to 256GB/s internally, although that included the choice to not use the compression techniques available at the time. That 8x amplification of export bus to ROP traffic might not hold as strongly once basic color sample compression or now DCC come into play.
Not using all channels or not having all lanes marked valid could also be an opportunity for an external bus to compact or compress data, although it would make less sense as long as the export bus is on-die.
Cop-out time: I'm assuming that hierarchical-Z will disappear in Navi, in favour of DSBR. Depth at fragment precision compresses really well, so it's practical to fetch a tile of depth in the bin-preparation phase.
Something like the 32-pixel tiles in mentioned in the Vega ISA for the shader engine coverage instruction?
I've long posited that L2 would be in the base logic die beside the ROPs in the "lightweight" PIM scenario...
Is there some other pool of L2 on the other side of the connection to the PIM, in order to maintain some amount of bandwidth amplification and coherence?
Initially for a first iteration i would expect something like 2x300mm on 7nm. 2stacks of HBM per chip. I think redundant silicon is overstated (initially), ultimately you need to build a desirable product. Being able to hit a performance tier others cant early on in 7nm life would be one of those desirable products.
That seems a touch large if we are to believe AMD's projection for cost/mm2 for 7nm on a 250mm2 die, if they aren't being overly pessimistic. If this is a Fiji-like arrangement ( ~600mm2 GPU, 4 stacks) this presumably has some additional improvements like a larger interposer or some other means of making the larger HBM2 stacks fit, or some kind of multi-exposure interposer like that used for GP100.
The bandwidth between chips would likely be proportional to the HBM interfaces, since it would favor a wide and slow connection, and the HBM stacks dominate the length of a die edge.
None of this sound particularly horrendous, if they are already incoherent sharing them across 2 chiplets doesn't sound that bad. I assume that all coherent data is ultimately the responsibility of the L2.
Why the elements are incoherent and whether something else can use a given piece of data can determine how expensive they can be. A fair number of them are assuming they won't be shared, so tidying up the architecture would mean making sure the data stays private. Other cases like the process of geometry setup has caches and queues that are global and visible, just with hardwired assumptions like the limits of shader engine counts that we don't know how AMD intends deal with in a scalable fashion.
The L2's coherence is based on a pretty simple assumption that a given entry can only exist in one specific L2 slice. Caching another chip's location would break that assumption, but addresses are usually interleaved between channels to get more even bandwidth utilization. One or the other would need adjustment.
There are two things to care about right, average and peak average. I assume 2-3x for the internal structure is because it all has a lock step clock and it has to handle that peak average.
The L2's bandwidth is such because it is supposed to provide bandwidth amplification over what the memory bus can provide, and the many L1s are write-through and very small. The L1s are there to at most compensate for L2 latency and coalesce the most recent vector loads, rather than avoid consuming L1/L2 bandwidth.
Vega now has the L2 servicing the ROPs as well, which is a new client with an unclear impact at this point.
When going across chips there is going to have to be some form of a boundary between the PHY/its buffers and the internal cross bar. At this point you could decouple and have a dynamic external frequency ( i believe EPYC does this already) then that bandwidth becomes less of an issue and its more about power consumption when it hits that peak bandwidth for sustained amounts of time.
The fabric is pretty consistent about memory and link bandwidth being equivalent with nice ratios of clock and link width. xGMI and its more stringent error margins are the outlier.
Lower activity might scale down some elements, but GPU loads are miss-prone and the caches themselves are not as good about reducing misses.
To me going for 2 chiplets seems significantly easier then >2 and if this is the path AMD is going that should be exploited until it is no longer an advantage.
A specific solution that slots into a Fiji-like package would at least be somewhat plausible as a product. Just by opting for an interposer, it rules anything greater out and leaves the one-chip solution saddled with an interposer. AMD's estimates might point to this being somewhat Vega-like in cost for a single-chip implementation. It's not EPYC-like at this point, which I think we agree upon.
I've commented on this before, but AMD uses "chiplet" for a specific form of chip that cannot function on its own without a specialized interposer and secondary silicon either mounted alongside or operating in the interposer itself. It's even less like EPYC and even more far-out in terms of whether it is feasible for some time. The more alternatives to interposers gain success, the more there's a chance AMD's speculating down a wasted effort.