This patent seems to be interesting. Any conclusions?
A quick layman skim through:
Two 'GPUs' with good old 16-lane SIMD and 4 'geometry engines', etc. Work could be easily split and done in parallel after the GS stage (seems to be intuitive). Messing with vertices is harder and with tessellation included super-hard (and slow?)...
From my initial read, the patent appears to outline two scenarios for multiple GPUs working together to render to a screen space that is subdivided into zones each GPU is responsible for.
Both scenarios involve a generally identical stream of commands being fed to all the GPUs.
The first and most straightforward scenario has similar language to other patents and disclosures that have AMD's graphics pipeline divided into world-space work (input, transform, tessellation, etc.) and screen space (pixel), and has every GPU duplicate the world-space work and pass along to the screen-space portion of the pipeline fragments that overlap with an area of the screen a GPU is responsible for. This sounds the most like the culling work done by primitive shaders and their methods for doing the same sort of coverage determination for the responsible shader engine and screen-space tiled ROPs within a GPU.
It is more straightforward and requires a low amount of cross-communication and synchronization, but also does not leverage the extra hardware and resources very well prior to getting to the screen-space part of the process.
The second scenario seeks to distribute world-space work among the different front ends with a dedicated work distribution facility. Each GPU still receives mostly the same commands, but the input assembly, setup, and geometry portions are farmed out in chunks for each GPU to take on individually. The patent puts forward round-robin distribution of sections of the index buffers and groups of setup primitives as an example.
A work distributor is found between the primitive setup and geometry/vertex shader phases in the absence of tessellation, or the work distributor is found on the input and output ends of the tessellation unit.
Each GPU's work distributor internally tracks the global API submission order with incrementing counters, a series of FIFOs for each geometry engine and tessellation block, and communications to and from other work distributors as to the status or ordering of various work items each GPU is responsible for.
A work distributor will have a counter and FIFOs for each of its local geometry engines and shader launch pipes, as well as a series of FIFOs corresponding to the equivalent hardware belonging to other GPUs. Each distributor will run through the same evaluation process, and then it compares the calculated selection tags to what is available locally.
The distributors accelerate the distributed work process by semi-independently incrementing the ordering count (each GPU derives its count from effectively the same command stream) and using the same load-balancing rules to rapidly pass data to local engines or discard elements that another GPU (independently making the same calculations) will cover. A lower number of updates related to completion status and the output of setup stages is broadcast from each GPU to all the others so that they have a consistent view of what the ordering is and what is in-progress. Some output data from the geometry engines is broadcast to the FIFOs in the other GPUs, whereas in other cases a stage that expands the amount of data like the tessellation stage might just have the work distributors pass the relevant ordering number to a GPU that will then fire up the selected tessellation unit, which will read in control points and feed the next surface/geometry shader locally.
This allows the GPUs to provide more resources to hopefully speed up the world-space portion of the process, with a dedicated portion for maintaining ordering guarantees, broadcasting status and outputs, and for making accelerated culling decisions about whether their local GPU will be handing a set of inputs or not. While there is a work distributor of sorts mentioned in recent AMD GPUs, the last part concerning culling seems like it brings part of the culling duties of primitive shaders that might be part of the first scenario in the patent (and perhaps primitive shaders as we know them) and places the decision making in this dedicated logic stage.