The paper makes reference to the scalar being able to execute it's own instruction stream. What if however that 2nd stream wasn't a scalar, but in fact a hybrid scalar/vector? Akin to ALU and FPU instructions in the same stream?
That would revert the second thread into what GCN already does with scalar and vector instructions in the same stream.
Perhaps more distinct from GCN is the helper-thread model, where there is an semi-independent thread that is still kept in sync with the main one.
That's not something spelled out at the GCN ISA-level for getting a specific thread co-resident with another, though it might be something that could be hacked in software or can be handled at a higher level with various graphics pipeline stages.
Some of the descriptions for Polaris' instruction prefetch indicate that it happens when wavefronts are co-resident in a given CU or CU group, so there might be hints of that kind of context information filtering down.
The scalars already break the cadence with scalar code.
In the proposed scheme, or in GCN? With GCN currently, they do not.
AMDs current scalar unit unfortunately only has integer instruction set, so ALU offloading is basically limited to address calculation (of coherent memory reads) and branch/jump related code (as jumps are always wave coherent as there's only one program counter per wave). AMD added scalar memory stores in GCN3. I was hoping for full float instruction set. Scalar unit is tiny compared to SIMDs, so adding float support wouldn't cost much, but it would allow much better scalar offloading. This would mean increased performance and reduced power usage. I don't know why they haven't done this already. If they do some scalar unit related improvements, I would expect full float support to be high on the list.
One possible reason that I have been pondering lately is whether the complexity in handling FP in the scalar domain would have added too much complexity at an acceptable level of support, and not enough upside if kept simple enough to implement.
This goes to my earlier musing about different the command stream appears at the instruction buffer or in the sequencing after. The CU seems to be structured into multiple independently sequenced pipelines with varying behaviors, with there being internal command words that carry over multiple cycles or possibly there are separate internal ops per cycle.
The instruction buffers have some amount of sequencing logic, while the decode path after them would have domain-specific logic and sequencing.
The way the CU distributes its resources, the FP path has one path with some unknown amount of control hardware and control logic per SIMD, which means designers saw fit to give it the resources to have 4 paths/buffers/sequencers while allowing individual instructions to spread that overhead over 4 cycles.
The scalar unit is shared between all SIMDs, so its sequencing and control pipeline has 4x the demand and needs to support different threads/commands every cycle. This could mean the cost of expanding what the scalar unit can do can be higher than it seems, with a larger representation for each cycle's command, the fact it is shared and active every cycle, and due to the scalar domain interacting with centrally-important contexts and providing operands to distant parts of the CU.
Perhaps the way the complexity is distributed between the various sections is due for a rebalancing, although it's not clear if the patent represents how AMD will actually do it.
GCN vector<->scalar register moves always need waitcnt afterwards, as the scalar and vector units are not running in lockstep. Waitcnt blocks execution based on a counter value.
Am I correct in assuming this is using a wait count for a memory write at one pipe and then another wait count on the memory pipe reading it back in? I'm not sure what other wait counts would help with that scenario.
You could for example have one scalar unit per SIMD instead of one per CU. I would be positively surprised if Vega had changes like these. Scalar unit is AMDs architecture's unique strength and they haven't exploited if fully yet.
The patent puts forward the possibility of high-performance scalar units that can take on wavefronts that are mostly predicated off, although it doesn't quite state that those are the same as the scalar unit. One possible implementation cited had 4 such units, which might keep the full complexity of supporting vector ops safely on the SIMD portions rather than in the scalar pipe.
There's a number of assumptions built into the model of GCN as we know it that might not hold with the patent.
There are various interpretations that could mess with expectations on the instructions issued per cycle, operations per instruction, SIMD length, wavefront size, the relationship between SIMD count and issue units, forwarding latency, issue latency, the number of cycles in the cadence, if there is necessarily a fixed cadence, SIMDs per CU.
Some of the complexities brought up by this might be a reason to be uncertain how much of it is going to happen with Vega, or if it will happen.
It'd be much more interesting if it did, particularly with some of the other patents and rendering schemes were synthesized with it.
Something like a visibility buffer running on an architecture with a hybrid tiled/deferred front end with enhanced culling, wavefront compaction, and the variably-sized wavefronts.
However, why couldn't it be Polaris+tweaks+bugfixes on TSMC 16nm FF+ with HBM?