OREO (Opaque Random Export Order) sounds interesting, essentially replacing the re-order buffer (ROB) with a smaller skid buffer allowing things to be received and executed in any order before being exported to the next stage in-order.
So I think OREO is required to support distributed vertex shading combined with coarse rasterisation.
My theory:
Vertices are distributed by a central scheduler, in groups of hardware threads, to any WGP that's available. Using a cut-down vertex shader, which only exports position, the resulting triangles are then coarse-rasterised. Only after this has been done and the screen-space tiles covered by a triangle have been identified, is the full vertex shader evaluated for each triangle's vertices (to generate all relevant attributes).
To perform the full evaluation of the vertex shader, each triangle is sent to the shader engine that owns the screen space tile touched by the triangle. So the shader engine has to construct hardware threads for the vertices received and assign them to WGPs.
If a triangle touches more than one screen space tile then each shader engine will separately evaluate the full vertex shader, for the triangle's vertices.
Once each shader engine has evaluated the full vertex shader, the triangles can be finally assembled and fine-grain rasterised.
As a result of the varying workloads of shader engines, fully-assembled triangles will be pixel shaded in an ordering that no longer corresponds with developer intent. This is because adjacent or overlapping triangles will have originally been position-only shaded by any shader engine and only arrived at the final shader engine for pixel shading after a journey that takes an indeterminate amount of time, versus other relevant triangles.
I believe this is the problem OREO solves, it allows the GPU to pixel shade triangles in an arbitrary order but the result in the render target (and depth buffer) is in agreement with developer intent.
All of this rests upon "next gen geometry" ("primitive shaders") which is something that has been confirmed for RDNA 3: the DirectX/OpenGL vertex processing pipeline is no longer executed in the set of shaders separated by fixed-function hardware that we've known for decades.
Naturally, this makes tessellation and geometry shading more complex, as both of these techniques generate vertices as output from shaders. AMD has solved that problem.
In theory, distributed final vertex shading takes us back to the old problem of multi-GPU rendering (alternate line, split frame. or screen-space tiled rendering): the vertex shader has to be ran by multiple shader engines for some vertices, so there is an overhead to distributed final vertex shading when triangles span screen space tiles.
Once you've got a combination of:
- next gen geometry shading
- vertex-position-only shading
- coarse grained rasterisation
- multiple shader engines each aligned to an exclusive set of screen space tiles
- final vertex shading
- fine-grained rasterisation
- opaque random export order
You then have, in my opinion, all the ingredients required to support a GPU that consists of multiple compute chiplets, each functioning as a shader engine, each aligned with a set of screen space tiles.