I think the idea of a dumb driver has merit. There are allusions to multiple kinds of higher-level shading in the whitepaper: primitive, surface, deferred attribute, etc.
There are higher level concepts, although whether they would be specific forms of complex shaders or overall renderer algorithms that become more practical with them isn't clear to me. The exception would be the surface shader, which AMD gives as a name for a position in the pipeline matching the merged LS-HS stages. Until there's more information, I am wary of reading too much into what may be more aspirational or version 2.0 possibilities. The stages merged are consistent with either making the output of the more general Vertex stage and its variants more general, but it looks like stages remain separated at junctions with fixed-function and/or varying of item count--be it amplification or decimation.
The programmable stages do not appear like they can fully encapsulate the tessellation hardware or the geometry shader stage without at least some recognition of the discontinuity, although it seems the GS stage is closer to being more generalized except for the VS pass-through stage.
The VS to PS transition point is an area of fixed-function and amplification, potentially covering the post-batch path of the DSBR through scan-conversion and through the wavefront initialization done by the SPI.
A hypothetical splitting of parameter cache writes from other primitive shader functions such that only parameters that feed into visible pixels occur would require straddling the VS to DSBR to PS path, which has some interestingly complex behaviors to traverse.
As I described earlier, a non-compute shader looks to the GPU as a set of metadata defining the inputs, the shader itself and metadata for the output.
I think it also looks to the GPU to implicitly maintain the semantics, ordering, and tighter consistency than the more programmable shader array. The graphics domain's tracking of internal pipelines, protection, and higher-level behavior provides a certain amount of overarching state that the programs are often dimly aware of, like the hierarchical culling and compression methods. It's not entirely alien from some of the meta-state handled for speculative reasons in CPU pipelines (branch prediction, stack management, prefetching, etc.), or for their respective forms of context switching and kernel management.
The idea of having side-band hardware functions is also not unique to the GPU domain and has some ongoing adoption for non-legacy reasons. Compression is making its way into CPUs, where it saves on some critical resource with autonomous meta-work opportunistically without injecting it into the instruction stream that benefits from it.
In effect the entire pipeline configuration has, itself, become fully programmable:
It might not quite be there yet, since load-balancing pain points seem to be part of where the old divisions remain, and the VS to PS division persists for some reason despite them being ostensibly similarly programmable.
Hence the talk of "low level assembly" to make it work. That idea reminds me of the hoop-jumping required to use GDS in your own algorithms and to make LDS persist across kernel invocations.
There's implicit elements implied in current recommendations, like the amount of ALU work between position and parameter exports in an attempt to juggle internal cache thrashing versus pipeline starvation. It's part of where I'm leery of exposing byzantine internal details to software. Low-level details bleeding into higher levels of abstraction have a tendency of obligating future revisions to honor them, or that developers juggle variations. It's an area where having abstractions, or a smarter driver, would stop today's hardware from strangling generation N+1.
Of course Larrabee is still a useful reference point in all this talk of a software defined GPU - you could write your own pipeline instead of using what came out of the box courtesy of the driver.
It would be interesting to see how some of the initial conditions would be revisited. The default rasterization scheme was willing to accept some amount of the front-end work being serialized to a core, which at various points today's GPUs are now more parallel than they were.
Maybe ... or maybe something more focussed on producer-consumer workloads, which isn't a use-case that caches support directly (cache-line locking is one thing needed to make caches support producer-consumer).
Line locking seems to be something that comes up as an elegant solution to specific loads, but much mightier coherent system architectures have taken a look at this and designers consistently shoot it down for behaviors more complex than briefly holding a cache line for atomics and the like. There are implicit costs to impinging on a coherent space that have stymied such measures for longer than GPUs have been a concept.
More advanced synchronization proposals tend towards registering information with some form of arbiter or specialized context like transactions or synchronization monitors.
I still think we could have a viable Larrabee like graphics processor now. The available compute in these latest GPUs is barely growing (50% growth in two years from AMD - that is actually horrifying to me) which I think proves that we're truly in the era of smarter graphics (software defined on-chip pipelines) than the nonsense of wrestling with fixed function TS, RS and OM (what else?). All the transistors spent on anything but cache and ALUs just seem wasted most of the time in modern rendering algorithms.
Increasing programmability does not lessen the dependence on compute, which apparently has slowed for AMD during its laurel-resting period. It's also potentially not a straightforward comparison between Larrabee's x86 cores and GCN. The memory hierarchies remain quite different, and the threading models differ. GCN's domain-specific hardware still performs functions the CUs are not tasked with that Larrabee's cores can/must handle.
Without actual implementations, we wouldn't know how many truths of the time would hold up now.
It's not even necessarily the ALUs and caches being worried about, with the increasing networks of DVFS sensors, data fabrics, interconnects, and all the transistors AMD said it spent on wire delay and clock driving. Sensors, controllers, offload engines, and wires seem to be showing up more.