I like GCN compute units. The design is elegant. I definitely do not want OoO. I like that GCN is a memory based architecture. All the resources stored in memory and cached by general purpose cache hardware.
The cache hierarchy could stand for an improvement.
The 2011 implementation GCN still has is a step above the incoherent read-only pipeline it had before, but its behavior is still too primitive to mesh with the CPU (Onion+ skips it) and its method of operation moves a lot of bits over significant distances.
The increase in channel count, and the number of requestors for the L2 means more data moving over longer distances, which costs power. The way GCN enforces coherence by forcing misses to the L2 or memory costs power as well.
Changes like more tightly linking parts of the GPU to the more compact HBM channels and going writeback between the last-level cache and the CUs could reduce this, but not without redesigning the cache.
While increasing storage locality, why not copy Nvidia and provide a small set of registers/register cache for hot register accesses, rather than going over greater distances to the register file?
If not ripping off Nvidia, AMD could revive a form of clause temporary register and explicit slot forwarding it had in its VLIW GPUs to get the effect.
Surprisingly many instructions could be offloaded to the scalar unit, if the compiler was better and the scalar unit supported full instruction set. This would be a good way for AMD to improve performance, reduce the register pressure and to save power. But this strategy also needs a very good compiler to work.
There was a paper on promoting the scalar unit to support scalar variants of VALU instructions, which had some benefits.
Tonga's GCN variant did promote the scalar memory pipeline to support writes, at least.
The CUs are rather loose domains, as exposed in GCN. The number of manual wait states for operations that cross between the different types means software is made aware that there are semi-independent pipelines whose behavior requires hand-holding to provide correct behavior.
This has gotten worse with the introduction of flat addressing, which readily admits to a race condition between LDS and the vector memory pipe, but in each of these cases the program is forced to set waitcnts of 0 at fracture points in the architecture. The scalar memory pipeline cannot even guarantee that it will return values in the order accesses were issued, which seems like a nice thing to nail down when the transistor budget doubles at 16nm.
Other questions are whether GCN can evolve to express dependency information to the scheduling hardware. It currently has 10 hardware threads per SIMD waking up and being evaluated per cycle. Perhaps some of that work could be skipped if the hardware knew that certain wavefronts could steam ahead without waking up all the arbitration hardware.
This is another thing it could borrow from Nvidia, or again the VLIW architectures GCN replaced.
The most important thing with 16 bit registers is the reduced register file (GPR) usage. If we also get double speed execution for 16 bit types, then even better (but that is less relevant than saving GPRs).
The most recent GCN variation also includes pulling 8-bit fields out of registers, and some use cases in signal analysis can use them.
16-bit fields are sufficient for machine learning.
That's three datum lengths for differing workloads, not including 64-bit which a number of HPC targets like. The way the domains where these data paths are used on a GPU is not conducive to their flexible use, but these chips do have regions of hardware that handle and manipulate data with differing fractional precisions.
A few possible future questions for APUs is whether a less-coarse granularity than 64 work items, and page granularity below 64k could make interoperability work better.
It's a scalar instruction set that operates on a VLIW unit.
The instruction word is one operation long in GCN. That seems like a SIMD implementation.