Key difference being what used to be a driver optimization is becoming exposed to devs.
The cited statement is literally saying it is not being exposed to devs. It was noted that GFX9 merged several internal setup stages, seemingly combining the stages that generate vertices that need position calculation and then the stages that processed and set them up. That sounds like it has to do with primitive shaders, or is what the marketing decided to call primitive shaders.
The prior shader stages didn't look like they were exposed to developers, so AMD might not be driven to expose the new internal stages all that quickly. Some of the work that occurs hooks into parts of the pipeline that have been protected from developers so far.
With the recent open world craze and modding it will likely be used. Fallout 4, Skyrim, etc can easily use all available memory as gamers load all sorts of models and texture packs. Limited only by acceptable performance. HBCC is likely the key feature Bethesda was after as it would aid old and new games. Cases where the dev can't control the scene as a gamer plops objects everywhere inevitably building some giant castle.
The stats I've seen for the gaming customer base may be out of date by now, but most systems they'd be selling to would be less well-endowed than a Vega RX. I do not think they are going to abandon those systems, and as such counting on a Vega-specific feature outside of coincidental use seems unwise.
I would like to see an interview or some statement about Bethesda's pushing for a leading-edge feature like HBCC or similar tech, given their open-world engine isn't from this decade.
I'm curious about the internal makeup of the primitive shader pipeline, such as whether the front end is running the position and attribute processing phases within the same shader invocation, or if it's like the triangle seive compute shader+vertex shader customization that was done for the PS4.
The idea of using a compute-oriented code section to calculate position information and do things like back-faced culling and frustrum checking ahead of feeding into the graphics front end seems to be held in common.
Mark Cerny indicated that this was not always beneficial, since developers would need to do some preliminary testing to see if it made things better.
Potentially, the overhead of two parallel invocations and the intermediate buffer was a source of additional overhead.
If this isn't two-workgroup solution, then AMD may have noted from analyzing the triangle sieve method that it could take that shader and the regular vertex shader and throw it all in the same bucket, then try to use compile-time analysis to hoist the position and culling paths out of the original position+attribute loop.
Perhaps a single shader removes overhead and makes it more likely to be universally beneficial than it was with the PS4. If not, then going by the Linux patches it may be that it's not optional like it was for the PS4.
Analysis of the changes may explain what other trade-offs there are. Making a single shader out of two shaders may save one portion of the overhead and possibly reuse results that would have to be recalculated. However, it could also be that while it's smaller than two shaders it's still an individually more unwieldly shader in terms of occupancy and hardware hazards, and exposed to sources of undesirable complexity from both sides. The statement that it calculates positions, and then moves to attributes may also point to a potentially longer serial component. From the sounds of things, this is an optimization of some of the programmable to fixed function transition points, but the fixed function element is necessarily still there since the preliminary culling must be conservative.
edit: missing word