Large as in 10%? I consider that large. And that’s what AMD sees in some cases.
I think your performance improvement expectations of some individual features are wildly optimistic.
Really depends on the game with more potential in lesser optimized or immediate mode titles. The improvements are based on how much overdraw currently exists in various titles. Not to mention the bandwidth savings Nvidia likes to tout with their compression or more appropriately, better culling implementation. Maxwell got a bit more than 10% there.
Until FP16/RPM is more widely functional and primitive shaders become documented, I wouldn't consider the features fully functional. Just working at a limited capacity and not necessarily synergizing. RPM should improve primitive shaders, PS improve culling, culling improving binning with less clutter, binning limiting overdraw and fragment dispatch. It just seems there is a lot of software work to be done still.
Is this back to equating the Draw Stream Binning Rasterizer to actual Tile-Based Deferred Rendering implementations?
Not equating, but moving much of the culling/overdraw savings from fragment shaders into the primitive pipeline as
earlydepthstencil in less than optimal rendering orders. Deferred Draw Stream Binning Rasterizer with Tile Based Rendering might make more sense. With async compute, having an execution gap between the primitive pipeline and fragment shaders is less of an issue. Only catch is needing async compute, which is still somewhat limited. It's possible the effects only work well with DX12/Vulkan, but my thinking is a new toolchain is required and that's the huge rewrite that is still occurring. At least the public commits aren't what I would call stable with major functionality being added. A couple of months ago even FP16 wasn't working across the entire product stack.
Given the alleged roadmap of Intel's going to an EMIB-based Gen 12 and 13, it's also possible that AMD's custom chip is meant to temporarily fill a gap in Intel's product line due to the 10nm blowup sinking Intel's internally-sourced graphics efforts for 1-2 major product cycles.
Outside competition would factor into this, but also Intel's need to get something out even versus itself.
That's possible, but wouldn't necessarily explain AMD not attaching a similar chip to Ryzen. AMD appears to have made APUs smaller and possibly larger, but not in direct competition. An 8 core Ryzen with 32 CU Vega and HBM2 and big 120mm cooler would dominate right now. In part because discrete parts became scarce.
That would seem to run counter to the "virtual" component of the register file patent, and the timing appears off for it being applicable. VGPR indexing is software-visible, as it is used by the shader code--whose view of the register file is being spoofed by the virtual register file scheme.
The initial filing for what appears to have become the DSBR was 2013. There's a lot of games that could be played with when disclosures are filed, but there's a multi-year gap that does seem consistent with these two techniques being part of different designs.
Not counter as much as attacking the problem from different angles. VGPR spilling technically allows the larger register file size, just with unacceptable performance in most cases. The virtual RF would address that with a renaming and paging mechanism that should be transparent to the shader or DSBR model. It would be transparent to the original design as it would be on par with simply providing a larger cache or register file and relaxing the bin size requirements. Only begin raster on a bin when hitting a context limit, running out of geometry, or hinting from a prior frame all geometry is present. Actual bin size would be more complex to model as register pressure could vary significantly based on the shader.
Try this interpretation:
It's overbuilt because with only so few clients, IF is not yet pushed to it's limits.
That would imply IF is a fixed configuration and there are nodes on the network that don't attach to anything or are only active in server/pro scenarios. That could be the case if physically dividing the CUs into separate virtual hardware devices, but that runs counter to what AMD has been advertising. Where the ACEs allow load-balancing many clients in a secure fashion. That's why I think the network was enlarged to accommodate additional IO for Vega10 in server/pro parts. Extra space in the form of larger/additional PHYs with internal routing for growing the network like Epyc. 32 PCIe lanes on a gaming part would be largely wasted, but practical on SSG, duo, or APU if using the same part.