I guess that figure was pulled out of
this.
The specific reference I saw for 300 was the ns cost for a memory reference necessary for an indirect draw being handled by the command processor, which the system has limited ability to hide if there are calls with 0 triangles.
This does show where there are points in the architecture where there's massively less leeway to handle costs concurrently. One CP versus 2-4 DSBRs is still dwarfed by the parallel capacity of the back end. Given the Vega white paper's placement of primitive shaders, it would require time travel for an intra-draw element like a primitive shader to retroactively decide the command processor shouldn't have launched the vertex process that contains it.
The architectural descriptions and patents also usually describe the tiled deferred rasterization step as being post primitive assembly, which if mapped to the "everything is a primitive shader" claim would create a shader spanning VGT, the FIFO, PA, and right up to the start of the SPI instantiating wavefronts.
The balancing act in managing the VGT to PA path is where I'm curious if there's an impact for Vega.
A primitive shader's code is inside the VGT path, and it is culling code that exists in addition to (or redundantly with) code or hardware that implements the VGT and standard culling portion of the process. AMD indicated that the standard pipeline is still there, which may explain some of the complexity in activating or exposing primitive shaders more generally, if it does involve balancing overproduction or starvation across elements like a limited number of inter-stage FIFOs.
I also note that the culling shaders do not try to check for all coverage scenarios for MSAA, which makes me wonder if a primitive shader would either.