PERFORMANCE AND ENERGY EFFICIENT COMPUTE UNIT
So this seems kinda crazy: vary voltage per lane in a SIMD so that slight variations in completion time by individual lanes are ironed-out (effectively clocking lanes of the SIMD at slightly varying frequencies!).
Also seems to be a way to run a SIMD more efficiently.
It mentions being a work derived from a Department of Energy contract, possibly one of the exascale projects.
This may actually mesh with the most recent HPC GPU chiplet proposal from AMD, where asynchronous techniques are applied to the ALUs and crossbars of the SIMD units. It might be a Navi or post-Navi architecture, however.
For reference:
http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf
As described, it's not that the lanes are not at some point interfaced with a clock, but it may be that the pipeline has become increasingly decoupled in terms of instruction and operand transfer to the lanes.
This potentially creates a physically dynamic inter-wavefront issue behavior, which has more explicit forms with things like variable DP rate.
The patent is being cagey on how exactly the lanes are running very different instructions. It's more readily true with current SIMD arrangement if the comparison is between a fast lane in SIMD 0 (off-lane or 8-bit add) is compared with a slower op in SIMD 1 (FMA with some mix of bits and exponent requiring the maximum switching activity and shifting). Potentially, predication or specific corner cases may short-circuit evaluation or could be set to forward data unmodified or flushed to a fixed value. The SIMDs would still need to care about each other's delays, since the CU's clock applies the the longest delay to all.
This, coupled with the claimed use of near-threshold computing, would be a significant source of variation in terms of time.
It may be as if the SIMD in this mode is operating in a constant state of Vdroop compensation with dynamic clocking, only AMD's method of extending clock cycles dynamically on a more global level has added some additional adjustments in the positive direction with clock and voltage adjustments if one lane is showing that it is experiencing cumulatively more delay than others. The patent mentions the possibility of linking voltage to clock, which sounds like extending that compensation method.
How the excessive delay would be measured (or predicted in the other variation) would be interesting to see. This would seemingly allow voltage and timing slack to be utilized, rather than a more rigid voltage level and circuit cadence sized closer to the worst-case scenario. Some scenarios that may also be explicitly targeted are cross-lane ops like those that broadcast one lane to the next clock, where that one lane is more timing critical and may incur longer delay setting up a broadcast in a near-threshold environment.
In this paragraph, reference to a "few nanoseconds" was initially confusing to me:
I think this may be cumulative over some number of execution cycles. One specific wave may only take a fraction of a nanosecond longer, but it can accumulate. As a physical/electrical phenomenon, there may be a level of correlation in a lane--especially if riding the edge in terms of voltage and timing. If a lane's switching activity versus its current power delivery results in voltage droop, the next cycle it has may start at a slightly worse voltage level in addition to its delayed start, causing the next operation to insert even more, and so on.
I took this to mean that different paths through a multiplier, for example, could result in fractionally different completion times - but I've not come across such a multiplier.
In modern x86 CPUs, division latency depends on the actual values of the operand, although that may be too complex an operation for this scenario.
The example does have a CU with multiple SIMDs, but talks about relative delay between lanes--possibly not in the same SIMD. That scenario makes different execution times more plausible, particularly if dealing with low-precision in one SIMD and heavy DP arithmetic in the other.
They'd still indirectly interact, since the rest of the CU may be synchronous and its arbitration cycles would otherwise be bound by the worst-case time of one of them.
If running things more asynchronously, and if running at a low voltage, there may be delays even within a SIMD.
If AMD introduces a dynamically variable-precision ALU that gates off more sections based on how many bits it really needs, the timing could get stranger.
The prior FIFO would be an operand collector (as seen in NVidia's designs).
It could be a straightforward FIFO of command signals, reads from the register file using the same logic as before. AMD's HPC proposal did not note that the register file would be asynchronous and it was explicitly stated the SRAM would not be running at near threshold voltage. Buffering multiple operations ahead of the pipeline, giving them a rough time budget, and buffering any writeback could let the GPU usually complete work at a lower voltage and without as much wall-clock time wasted in a long cycle--usually.
The idea of having a FIFO at the top and a FIFO at the bottom may allow for monitoring of delay and readiness of the asynchronous (isochronous?) lanes, and may also serve to coalesce feedback into the synchronous and higher-voltage regions of the CU. The discrepancy between the upper and lower FIFOs may also give an idea of which lanes need a boost, or if the CU needs to insert an actual stall.
The complexity of the behavior makes me wonder about some of the wait states in current GCN. Parts of the CU are going to be more aware of delays than before, and perhaps some of these paths would now have interlocks. An alternate possibility is adding power-aware instruction scheduling, and adding more wait states for the compiler to allow it to actively target filling the FIFO with dependent instructions, possibly allowing results to forward within the lanes and allowing the output FIFO to elide some writes to SRAM if the same destination shows up. However, to get the most of this, GCN's rather switch-happy threading may need to be curtailed in order to allow a single wavefront more chances to issue into the FIFO without interruption. Perhaps the VALU count value may find use for this?
The current GCN 4-cycle cadence results in simple and efficient 16-lane split register file design. No bank conflicts at all. Registers can be very close to execution units.
While the exact width may not be known, one way to split the difference as AMD has indicated for its exascale proposal is to keep SRAM at a higher voltage level and possibly kept at a synchronous clock. The patent may mean things are less tightly linked. The ALU lanes would buffer work and writeback in various FIFOs, so that actual interactions with the register file would only occur after all the variable timing has been resolved.
That may mean some kind of internal operand caching, which might let the lanes go further before having to sync with the outside world.