6. Since there is already ordered sync at CU group level, perhaps a workgroup (that needs only unordered sync) can also be allowed to span over CUs within the same group. That would require the LDSes moving a bit further from the CUs though, and not as pipelined as it was.
Currently they have LDS and GDS, which are roughly equivalent hardware, but with GDS having additional sync capabilities. The LDS moving away I don't see as a problem. I've been envisioning them more as the memory controllers, part of the export pipeline when joined with other CUs, and synchronization mechanisms. A significant amount of their workload moved to that proposed scalar. Having a full crossbar it would be reasonably efficient at repacking data and some of the interpolation tasks. It should work well for de/compression as well without blocking SIMD execution by my thinking.
Along the lines of the current scalar unit being more flexible, it could even do some of the cache work if a thread was designed around alternating scalar and vector pathways. SIMDs for heavy lifting and scalar doing control flow and data sharing. An extension of the current capabilities with more robust hardware and SGPRs moved into VGPRs. Along the lines of pushing a workgroup onto a SIMD. Certainly not required, but an option to explore. Avoids register limitations, but some vector space is likely wasted.
So in the end, say with 64CU Greenland, it may have 16 CU groups, each of which has 4 CUs, 4 Color ROPs, 16 Z/Stencil ROPs, 128KB L2 Cache and 256KB shared LDS.
In theory it also has 16 memory channels which might line up well with that 16 CU group organization. Could be a baseline to extrapolate the organization as only the HBM designs had that many channels.
Just to point out for future clarification, coherence wouldn't change the problem that OIT solutions seek to solve: wavefronts reaching their export phase for transparent data at varying times based on the dynamic behavior of the CUs involved, leading to inconsistent final output. Having multiple threads spread across different cores in a coherent multicore CPU would have similar problems in the absence of any additional synchronization.
Raster-order views inject a stall if it is determined that a pixel in a ROP tile is currently being worked on by a primitive earlier in submission order.
Not directly, but I'm imagining a significant rework of the export pipeline along with ROP and LDS/GDS behavior with the mesh. At least with Navi, although I guess Vega could have it. OIT would be relevant when all the transparent samples got read back in and processed. This would be up to the programmer if they desired unlimited samples or some sort of compression like PixelSync. My theory might consider breaking the submission order in favor of binning and synchronization. ROVs with opaque samples, while not ideal from a performance standpoint, wouldn't necessarily care about the submission order. It should address certain scaling concerns, although you may need to be careful reading back samples. Tiled rasterization, where all geometry is known along with unknown samples, could help here.
I'm having some difficulty parsing this, and what parts are different or the same as before.
The programmable CUs are the first stage already, if the definition of the second stage for the ROPs is basic blending and data manipulation.
Having the working cache compressed for the second stage would be different, but it would be a change because it would likely make parallel work significantly more painful to achieve.
I don't follow how QR is supposed to be conditional on the CUs doing something strange. It's either being used or it isn't. It's a little late to use it if it was not used and then we realize something strange happened. That would sound like some kind of very optimistic speculation with an intractable fallback case.
I'm still working this idea out myself. My definition of the second stage was more along the lines of aggregating results at a macro level. One ROP per memory channel arbitrating exports from each CU or cluster. It might be more than just a ROP along the lines of GDS/LDS with the heavy lifting having occurred in the first stage. As I mentioned before, I was expecting some of these ideas to be related to Navi and the "scalability" on the roadmap.
The QR wouldn't be conditional, but an option for efficiently coalescing results within multiple CUs prior to export. Possibly part of the solution for that first stage. Partitioning cache resources, possibly in different locations, to a task allocated to a group of CUs. It would make more sense for a multi-chip solution, and with the paper presented would seemingly be a better match to that timeframe. It's definitely optimistic speculation as I wasn't expecting some of this to be occur with Vega. It doesn't seem out of the realm of possibility though, just need to rethink a lot of conventional wisdom on past designs.
As noted in that paper, ~2048 bits of interface per core was prohibitive in terms of die are lost to pad space. (the paper's projections are actually optimistic for what we have currently for bump pitch)
In that regard, could you clarify what you mean by having most the interface disabled? These are physical objects with non-zero area that doesn't become available if they are turned off.
The physical implementation of the network spans both
the multi-core processor die as well as the silicon
interposer, with shorter core-to-core links routed across
the multi-core processor die and the longer-distance
indirect network links routed across the interposer.
Selective concentration is employed to limit the area
overheads of vertical connections (i.e., micro-bumps)
between the multi-core processor and interposer layers.
Not all of that 2048 needs pads though. The vast majority would likely exist within the die like most FPGAs as that is where most nodes on the mesh should reside. Selective grouping reducing the number further. The portion through the interposer reserved for multi-chip solutions, including the Zen APU. The interconnect I'm envisioning would be a full mesh, say 17 nodes all linked together through variably sized channels. All global channels totaling 2048(probably another channel or two for IO and communication). Local channels likely increasing that number within a fixed organization. For a point to point route all but one link within a channel would be disabled. All links enabled if broadcasting or requiring a shared bus for whatever reason. Four nodes for example might share a channel for some purpose. The paper alluded to this as "indirect links on the interposer".
Export is given to what wavefront is able to negotiate an export buffer over the export bus at the end of a thread group's life span. ROV has a more global view, hence why Pixelsync stalls a group until earlier ones are able to finish their export. Without it, it's a race like any other concurrent access situation. At least with opaque geometry the constraint is what is closest, rather than the order of what was overdrawn/culled.
Two stage arbitration. LDS for local export and arbitration within a CU cluster, GDS (one per memory channel) arbitrating over global memory and IO resources from each cluster. PixelSync would only have to stall if the buffer was filled and blending/compression required. It would simply append the result, possibly opaque results, until read back and optimized. The ID buffer may stem from that as well if using it to bin geometry.
LDS should also be closely tied to a GDS unit and hopefully prioritize traffic to memory in order to spill cache. Allow the LDS unit to reasonably efficiently coalesce results that won't fit into L2 prior to writing out results through GDS that may have to arbitrate through a large number of channels and stalls.
I'm assuming you mean with the proposed new scheme?
The current scheme is that no such reordering is done and the results can be unstable and inconsistent.
With the ID buffer added to PS4 and "future AMD architectures" some reordering seems likely. Tiling or binning of geometry makes a lot of sense as demonstrated with Nvidia's solution. With the ROV capabilities described above and a theorized scalar that could efficiently sort the results it makes even more sense. Scaling should be easier if order is less of a concern.