Right now, I'm using the following mental analogy of primitive shaders: I see geometry stages in the pipeline as relatively small functions in a C program that are getting called in succession. Without an optimizing compiler, there's a lot of overhead just passing around function parameters and results, pushing and popping things to and from the stack etc. But an optimizing compiler would simply inline the different subroutines and all that overhead goes away.
The merging of shader stages, multiple combinations of API shader types, different enablement states, dynamic state, and the DSBR(and its multiple modes) might explain why this optimizing compiler analogy hasn't provided the results necessary to provide it generally. Worse if exposing some the internals of those subroutines also exposed or created an x87-type oversight or bug, where some nasty corner case winds up pessimizing the overall critical path.
The remaining transition points in the pipeline are also centered around hand-offs between hardware blocks with different amplification and reduction possibilities, so the fixed-function stages might be better characterized as independent threads with access to dedicated libraries, specialized formatting intrinsics, a hardware-compressed more complete representation of architectural behavior.
The primitive shaders are described as being implemented serially with respect to a significant portion of the existing path, and they are implemented as conservative (imperfect and optional) representations of the optimized subroutines.
The optimizing compiler in this analogy may find that an optimized sequence of cost X might translate into X+n in the critical path, plus feeding into a pre-existing concurrent machine. The vertex processing path is generally characterized as being less able to hide latency, and the DSBR's related patents indicate a serial component and limited scope for concurrent batching.
The binning process might also have some inflection points related to batching latency versus a longer window for coalescing/culling.
The old way might have burned more resources and bandwidth, but that might matter more for a small implementation like Raven Ridge, whereas an implementation like Vega 64 finds its scads of resources and larger power budget less able hide the incrementally higher serial overhead revealed by its much more parallel engine and broader fixed-function resources.