With the new rumors about Fiji's improved scalar unit (memory stores, full instruction set, one scalar unit per CU), the GCN architecture seems to be moving even closer to throughput oriented in-order CPUs with wide SIMD.
I am aware of the scalar memory store capability, as that was added in Tonga and is part of the latest GCN ISA document. The other two items I have not seen mooted for Fiji, although we should know rather soon what, if any, tweaks it has over Tonga. Full instruction set I've seen as a theoretical, and one scalar unit is already the case. Did you mean something like the data cache (maybe?) being replicated per CU?
Knights Landing has a simple in-order scalar pipeline that handles branching and control flow (and uniform integer math and uniform loads/stores can be offloaded to it), 512 bit AVX (16 wide for 32 bit float), 4 way hyperthreading (GCN is 10 way). Similarities are striking.
From what I've seen, Knights Landing is a superscalar OoOE processor with a speculative execution and memory pipeline, far stronger memory model, precise exceptions for the integer domain--at a minimum, and it meets the level of rigor that permits each core to function as a host processor.
I am curious if Knights Landing's vector sections are listed separately due to some kind of separate scheduling or memory domain, which might provide something closer to a design parallel. The other design parallels, from SIMD width and the presence of more than one hardware thread do not seem much closer between the two compared to other SMT and SIMD implementations.
256 bit (32 bytes) resource descriptor is significant amount of data. However, you also need to send 64 UVs, 64*2*sizeof(float) = 512 bytes. The resource descriptor is only 6.25% of the data. Some sampling instructions also need mip level or gradient (further reducing the resource descriptor percentage). It is not that bad design call. It gives AMD lots of flexibility in the future.
If I may weigh in on one possible factor, going back to 2011 and the early days of GCN, was that AMD was actively trying to reduce the amount of hidden state for the CU's execution context. For me, Google only brings up a few references to this, with one my posts being one of them, oddly enough.
The texture path is an area where they had not fully exposed the CU's execution context to software, possibly related to the phase in the texturing process where texture accesses that require multiple samples are cracked into separate cache accesses and eventually returned as filtered values.
Explicit passing of descriptors to the vector memory path exposes what was once a separate collection of internal states. The GCN scalar unit itself might be more notable in that it is software-exposed, but there would have been a hardware analog running in the shadows before the clause model was abandoned.
In light of this goal, there is much less that is moving independently of a shader context for a compute architecture that had compute context-switching and pervasive virtualization as a design target. Pointer passing would mean there would be an engine doing who knows what if a context switch were ordered, and literal data passing doesn't need to worry about maintaining virtualization, since that was already handled by the explicit virtual memory system prior to it making it to the scalar registers.
A TMU descriptor engine without the necessary synchronization or translation hardware would be potentially destructive, whilst having one that elaborate would be more expensive. One potential reason for the way things are is that the scalar unit might have been such an engine prior to being exposed in GCN.
For AMD's specific compute goals, passing the descriptor data itself may have been necessary for their implementation needs on the path towards the FSA-now-HSA model they wanted.
Programmatically generating resource descriptors may have been a consequence they might have noted, although Mantle's choice to not go down that route may point to a level of disinterest or pitfalls in the technology.
It would be very different system behavior, which they may not have been able to validate, and which may have been too exotic compared to other architectures to get broad buy-in. There may be driver behaviors in Mantle that assumed too much of the old model persisted to allow shaders to generate another layer of dynamic state that could interact with existing states.
Another, going back to what Knight's Corner has that GCN does not, is that GCN has at least some FP exception tracking, but many other faults not vector-related are very imprecise. Basing Mantle 1.0 on programmatic descriptors built on hardware with a blind spot over its binding implementation may have been premature or possibly too binding at a low level for maintaining compatibility or changing implementations.
Don't get fooled by the maximum amount of compute queues (shown by some review sites).
I still don't get where the controversy is coming from on this. AMD's description of queues and the processors that manage them seems straightforward to me.
I was thinking about a shader that stores the resource descriptors in the instruction stream. This ensures that the the resource descriptors never miss the cache (as the instruction stream is prefetched linearly and the resource descriptor and the sampling instruction share the same scalar/instruction cache line).
The SALU's immediate handling might allow this, although the payload efficiency is not great and it might take some additional fiddling to prepare the destination registers.
It may not work with programmatically generated values. The GCN ISA docs don't seem to offer a clear avenue for dynamically changing code.