It's the programmable blending and OIT I was looking to address. At least the OIT should be coming with Vega, ideally the blending as they're similar. In theory that's somehow coherent along with the interconnect.
Just to point out for future clarification, coherence wouldn't change the problem that OIT solutions seek to solve: wavefronts reaching their export phase for transparent data at varying times based on the dynamic behavior of the CUs involved, leading to inconsistent final output. Having multiple threads spread across different cores in a coherent multicore CPU would have similar problems in the absence of any additional synchronization.
Raster-order views inject a stall if it is determined that a pixel in a ROP tile is currently being worked on by a primitive earlier in submission order.
My thinking was a two stage design with programmable first stage in the CU(s) as opposed to a programmable ROP elsewhere. Second stage being some basic blending or data manipulation logic. Possibly actual data compression and indexing to conserve space as opposed to just bandwidth. The QR for if programmable units in different CUs did something strange. Computing a tile wide value perhaps once the CUs simplified results. Perhaps related to DCC or a compression scheme. Ensure another CU didn't make any changes that would affect the outcome if several were working together. That would be at a rather course granularity.
I'm having some difficulty parsing this, and what parts are different or the same as before.
The programmable CUs are the first stage already, if the definition of the second stage for the ROPs is basic blending and data manipulation.
Having the working cache compressed for the second stage would be different, but it would be a change because it would likely make parallel work significantly more painful to achieve.
I don't follow how QR is supposed to be conditional on the CUs doing something strange. It's either being used or it isn't. It's a little late to use it if it was not used and then we realize something strange happened. That would sound like some kind of very optimistic speculation with an intractable fallback case.
http://www.eecg.toronto.edu/~enright/micro14-interposer.pdf
Current thinking was 2048b(?) interface on each CU or SE. Part of that tied directly to memory, another part remapped to a configurable network topology through an internal crossbar to create the mesh. The coherence would be a protocol over whatever topology was configured. Most of the interface would be disabled as it would be used by other nodes and would require a really large crossbar and ability to consume data in each node.
As noted in that paper, ~2048 bits of interface per core was prohibitive in terms of die are lost to pad space. (the paper's projections are actually optimistic for what we have currently for bump pitch)
In that regard, could you clarify what you mean by having most the interface disabled? These are physical objects with non-zero area that doesn't become available if they are turned off.
A few wild ideas:
1. Let's assume the rasterisers are already owning large interleaved partitions in Z order of the screen space.
Currently, 2D screen space is partitioned into rectangles and the rasterizers get a subset, which helps match up with the way the address space is striped to allow optimal utilization of the DRAM channels. A rasterizer can handle one primitive per clock, 16 pixels are rasterized per clock, meaning 16 quads for the 64-lane wavefront.
2. A CU group (let's say four of them) can further own a fixed subpartition in the logical space of their parent rasteriser.
Is this static ownership, or some kind of dynamically balanced ownership? There is a decent amount of static partitioning based on a shader engine having a rasterizer with a static mapping, but the programmable resources are more flexible within that subset to handle variability in demand.
4. Exporting to ROP AFAIU has the same requirement in the API (re)ordering as ROV, just that you can only write data out with a predefined atomic operation.
Export is given to what wavefront is able to negotiate an export buffer over the export bus at the end of a thread group's life span. ROV has a more global view, hence why Pixelsync stalls a group until earlier ones are able to finish their export. Without it, it's a race like any other concurrent access situation. At least with opaque geometry the constraint is what is closest, rather than the order of what was overdrawn/culled.