Anarchist4000
Veteran
Best guess is foveated rendering and a larger implicit wave size. Adapting the scheduling to 1x/2x/4x wave size shouldn't be all that difficult and reduce pressure on instruction buffers/caches and the scheduler.
Also possible it implies some form of wave packing and MIMD behavior. Four(?) sequencers shared by all lanes along with any scalars(lanes with more robust fetching, possibly inclusive). Technically a 64 lane SIMD and 3x scalars could execute simultaneously.
There is probably some corner case involving successive overlapping geometry where OIT isn't sufficient or edge detection is involved, but that seems remote. You'd need a shader somehow reliant on the prior triangle affecting the outcome within the draw call where OIT was insufficient or overly costly. Perhaps some sort of particle effect operating in screen space or atomics? Even then you could probably composite the the entire draw call into it's own render target with HBCC and dynamic memory allocation and then use TBDR to composite everything in order.
Also possible it implies some form of wave packing and MIMD behavior. Four(?) sequencers shared by all lanes along with any scalars(lanes with more robust fetching, possibly inclusive). Technically a 64 lane SIMD and 3x scalars could execute simultaneously.
Remove those ordering guarantees and it becomes a whole lot simpler. TBDR or OIT would allow you to defer the ordering at the possible expense of some culling opportunities and overdraw storing unnecessary samples. Could probably reduce that expense to cases involving successive geometry and a compaction process could limit the expense to cache bandwidth. Defer Z culling to a L2 backed ROP export. Lose some execution efficiency from unculled/masked lanes, but that shouldn't overly affect off chip memory accesses in most cases.This allows the GPUs to provide more resources to hopefully speed up the world-space portion of the process, with a dedicated portion for maintaining ordering guarantees, broadcasting status and outputs, and for making accelerated culling decisions about whether their local GPU will be handing a set of inputs or not. While there is a work distributor of sorts mentioned in recent AMD GPUs, the last part concerning culling seems like it brings part of the culling duties of primitive shaders that might be part of the first scenario in the patent (and perhaps primitive shaders as we know them) and places the decision making in this dedicated logic stage.
There is probably some corner case involving successive overlapping geometry where OIT isn't sufficient or edge detection is involved, but that seems remote. You'd need a shader somehow reliant on the prior triangle affecting the outcome within the draw call where OIT was insufficient or overly costly. Perhaps some sort of particle effect operating in screen space or atomics? Even then you could probably composite the the entire draw call into it's own render target with HBCC and dynamic memory allocation and then use TBDR to composite everything in order.