The filing times are an interesting example of how variable the tea leaves are for gauging timing. The likely precursor patents for Vega's rasterizer are significantly older, for example. In the CPU realm, there's some recently published patents concerning store to load forwarding with a memfile that align with recent optimization guidance concerning keeping loads and stores occurring in a region close together in the code, and trying to keep them sharing the same base and index registers without modifications. This allows a form of value prediction using just a subset of the offset address to serve as a path for using rename registers and the renamer to forward a store rather than the load/store unit.
The primitive shader patent does seem to align with the aspirational long-term goals of the Vega whitepaper. However, that leaves questions as to whether this patent represents a long-term direction, the full use of the NGG path that Vega had in parallel with its standard path, a non-buggy version of it, or possibly a case where Vega's culling-only offering represents how much they could fit in amongst the gaps in the standard hardware.
Perhaps it's also a cautionary tale for those that get hyped about patents. Sometimes even nifty ones can turn out to be meh.
Long ramble ahead:
There are other hints that are intriguing, though some raise further questions.
There is a claim concerning an opcode for accelerating screen space partition coverage checks, which the Vega ISA has specifically for 4-engine GPUs.
Vega's supposed fully working path and the patent still have the rasterizer as a final culling point, although there is a vast gap in capability between the two primitive shader concepts.
The whitepaper hints at an architectural ability to shift from using the parameter cache to the L2, in a potentially inflexible way. This raises questions like whether Vega could literally have two paths in parallel, or are parts of them overlapping and repurposed like the path that chooses between the cache types. Could the inability to produce a working implementation, with the compiler-driven model exactly what AMD initially promised, for even the first step at only culling stem from bugs or the difficulties in getting an effective implementation with an architecture that might be awkwardly straddling both sides?
There's discussion in the patent about assigning ordering identifiers, and the ISA has a POPS counter mode.
Perhaps one notable change that's hard to tease out externally is the fate of the primitive setup to rasterizer crossbar that the patent claims poses a practical ceiling exactly where GCN has maxed out its shader engines.
Vega, if it implemented the patent, shouldn't need it but is structured like it does. Could it be that the GPU has it anyway for backwards compatibility, or could Vega have emulated the old functionality with the new? There are hints of new culling methods in the Vega ISA that might align with the compaction steps, and the opcode for partition is part of the export process. If it was emulating the old way with new, what would it mean that no one has noticed?
It would seem fully embracing this would give "scalability" for those looking ahead to future GPUs.
The local data store is an odd duck. It acts at some times like the LDS for parts of the process, but there's some cross-unit or global usage that doesn't mesh. This again calls back to that blurb about opting to use the new larger parameter cache or just going for stream out. If it's the memory hierarchy, in some ways I'd question if it's a fair characterization that the patent has abandoned crossbars, since the L2 itself uses a crossbar--which has been under load before and getting more use now that the ROPs use it too (and people are stuck wondering about why the memory subsystem seems like it just can't leverage memory bandwidth as well as they thought it should).
The combined shader stage patent aligns with the GFX9 driver changes for just that type of merge. I'm hazy on whether those are even optional for GFX9.
The touted efficiencies, more compact allocation, fewer intra-stage barriers, and less tracking overhead haven't shown up in comparisons with GFX8, though.
It makes the individual shaders more complex, and there's a bit of driver intervention for adding code to handle a mismatch in vertex wavefront count to later geometry shader ones in order to put excess geometry shader ones to sleep while copying vertex output from one combined shader to another.
The shifting granularities and masking point to a place where it'd be nice if the hardware could handle divergence better, as some other patents that might not be used have brought up. The tradeoffs for both patents on programmable hardware that has only mildly evolved from an era that predates these new use cases might be more complicated versus a GPU that committed fully to the newer concepts.