Do we believe the "Super SIMD" / "VLIW2" patent is applicable to Navi? It doesn't feel like a huge departure from GCN to me, so it feels very plausible to me that it'd be considered for Navi (whether it makes it into the final design is another question; patents don't always result in end-products).
It would take an encoding change for the architecturally single-issue ISA, since the compiler needs to determine whether instructions can go the primary or core slots instead of the hardware checking for dependences.
Perhaps that's a matter of yet another instruction format, which has precedent for GCN. Whether that can be fit in the current instruction lengths, or if this threatens to require a new length is unclear. (It could be company for Volta, which has gone to 128-bit instructions per the "Dissecting the NVIDIA Volta GPU Architecturevia Microbenchmarking" paper.
https://arxiv.org/pdf/1804.06826.pdf)
After that, it would seem as if the CU would treat the two lanes as separate instruction queues that are drained in the same fashion as before.
The data output cache on the other side might be a somewhat larger departure. It seems like the hardware does a bit more checking in order to make a hit in the cache, or perhaps the decode and queuing process in the front end has a small table of cache slots and last-used architectural registers to override the source operands of subsequent instructions.
One thing I'm still confused about with GCN and my Google-fu is failing me (asking console devs on Twitter might be the easiest route but hopefully someone here knows as well): transcendental/special-function is 1/4 rate on GCN, but do they stall the entire pipeline for 4 cycles, or can FMAs be issued in parallel for some of these cycles?
Everything I've found implies that they stall the pipeline for 4 cycles, which is pretty bad (speaking from experience for mobile workloads *sigh* maybe not as bad on PC-level workloads) and compares pretty badly with NVIDIA which on Volta/Turing is able to co-issue SFU instructions 100% for free and they don't stall the pipeline unless they're the overall bottleneck (as they've got spare decoder and spare register bandwidth, and they deschedule the warp until the result is ready; obviously they can't co-issue FP+INT+SFU, but FP+SFU and INT+SFU are fine).
From the ISA docs, I do not see any reference to wait states for transcendental operations. If such an instruction actually did require 4 vector cycles to fully output the results for all waves, presumably co-issue would inject the risk of a subsequent fast instruction being able to source from the slow instruction's output register several cycles ahead of the writeback.
The architecture has various other places where it does not interlock, and the wait counts do not control for within-VALU dependences. Rather, the vestigial references to a VALUCNT in old docs may point to a time where the possibility was brought up but discarded. The more straightforward method that seems consistent with the ISA is that the architecture won't issue until the prior instruction has completed for these longer-duration instructions.
It feels to me like at this point, 1 NVIDIA "CUDA core" is actually quite a bit more "effective flops" than an AMD ALU. It's not just the SFU but also interpolation, cubemap instructions, etc... We can examine other parts of the architecture in a lot of detail as much as we want, but I suspect the lower effective ALU throughput is probably a significant part of the performance difference at this point... unlike the Kepler days when NVIDIA was a lot less efficient per claimed flop than they are today.
It's been some time since Kepler, but my recollection is that the impression of AMD's architectures consuming more general purpose FLOPs in mixed-use scenarios goes at least as far back as Tahiti, and possibly Cayman. (edit: VLIW5 had an AMD FLOP vs Nvidia FLOP debate as well.) The question was whether AMD's chip would have enough spare FLOPs to overcome the impact of the higher-cost special function instructions.
Other than perhaps Fermi's hobbled start, the impression with the VLIW GPUs was that AMD FLOPs weren't as meaningful for graphics as Nvidia FLOPs, and that's mostly held true for GCN.
EDIT: Also this would allow a "64 CU" chip to have 2x as many flops/clock as today's Vega 10 without having to scale the rest of the architecture (for better or worse). It feels like 8192 ALUs with 256-bit GDDR6 and better memory compression could be a very impressive mainstream GPU.
The exemplar image in the patent at least doesn't draw sufficient paths in the operand delivery from the register file, with just 4 reads overall for the ALUs and vector IO. If this is combined with the bandwidth from the destination operand cache, the ALU section sees a possible peak of 6 operands sufficient for 2 FMAs. It doesn't seem unreasonable to consider this close enough to 2x peak, given that many CPUs have needed the bypass network to compensate for a register file with too few ports for all the ALUs, and Nvidia's operand reuse cache does compensate for cases where its vector register bandwidth cannot be fully used.
One wrinkle to this going from the patent is that the operand cache interjects itself in the way of the forwarding network to the vector IO bus, potentially requiring some extra tracking or wait states since the cache does not feed into the bus used by the ALU and IO sections. It's a local ALU bus or write to the register file, so an export or memory read dependent on an operand in the cache may force an immediate writeback or require some additional checks of the mapped register list. There are some existing short wait states for some register hazards like this already, though this cache may make for longer explicit delays without pipeline interlocking.
Everything's a trade-off and clearly AMD went strongly in the direction of doing more on the general-function FMA units compared to their VLIW4/VLIW5 architectures and compared to NVIDIA. It's not obvious to me whether that has actually paid off for them...
If I recall correctly, VLIW4 is where the T-unit was broken up and the special-function elements distributed among the remaining four ALUs. An operation would cascade from one lane to the next over four cycles, with successive approximations or lookups occurring each time. GCN's lane orientation flipped things by 90 degrees, but it's possible that what it's doing for special instructions is from that lineage. The quad-based arrangement and 4-way crossbar available between the 4 ALUs in a quad for some instructions may fit with GCN acting like VLIW4. Otherwise, every lane would need the full complement of lookup tables and miscellaneous hardware, incurring an area cost while not realizing potentially significantly higher throughput if a full transcendental unit were in every lane.
Like the patent's side ALUs, the VLIW5 T-unit didn't have a corresponding set of hookups into the operand network, requiring unused operand cycles or shared operands with neighboring ALU slots. Unlike the T-unit, the side ALUs lack a multiplier and so cannot on their own perform complex operations. They're less generalist than the units that preceded them. Instead, it seems like the patent has two core ALUs with FMA capability, and then in some more complex scenario the side ALU can pair with one of them to perform instructions that require a full ALU.