Jawed
Legend
The register file is really 256x 2048-bit registers and wouldn't need to change. That's why I'm so impressed by this idea, because all the shenanigans involved in traditional dynamic wavefront formation relating to context manipulation and fragmentation are entirely obviated.How would the register files be organized in these dynamic sized wavefront systems?
As I understand it, the current design uses three cycles to read operands (64 operands from address A, then address B and finally address C in FMA A * B + C) and a fourth to write resultants.
The 64 operand slots need to be "swizzled in time" in order to feed them into the SIMD: A, B and C 0-15, then A, B and C 16-31 etc. So GCN already has an "operand collector" to support this (not exactly a FIFO, but the lanes read from it as though it were a per-lane FIFO per operand).
This latter operation, modulated by the execution mask, then chooses which operands are delivered in the temporal predication scheme. If after four cycles of register operations (three reads, one write), when the "FIFO" is ready to deliver operands to the instruction that's about to start, there are only two operands in a specific lane, then that lane runs at half speed.
In GCN as it currently exists there has to be some kind of resultant collector, so that a single write cycle can send all 64 resultants to the register file. The resultants arrive in this collector over four cycles, but need to be written coherently as a single operation in a single cycle to a single address (with masking for resultants that should not be written - GCN already has masking for resultant writes, completely distinct from predication).
Similarly, in the temporal predication scheme, the resultant "FIFO" (it's just a buffer, but the document talks about a FIFO), modulated by the execution mask, enables delivery of resultants to the correct subset of registers. This would be a minor change to the existing design.
Couple these two collectors with a forwarding network so that resultants can be consumed by the immediately successive instruction and you have a drop in replacement for the existing pipeline that uses two collectors and a forwarding network
It's not merely branchy code but also supports wavefronts where not all lanes have meaningful data, e.g. quads with less then four active fragments.It is easy to see the potential improvements of utilization (in branchy code).
I don't understand that conclusion. The register file is pretty dumb. One could argue that fetching a 2048-bit register when only, say, four disparate 32-bit operands are required is wasteful, but that's a different class of problem.However (if I understood properly) the register files consume more power than ALUs.
Operand and resultant collection is already a separate workload in GCN. There's more work in temporal predication, because the execution mask (or it's proxy: count of actual operands) has to modulate lane clocking and resultant routing.
In the end all that's actually sought is a net gain in power efficiency.
The "null" scenario, where the execution mask is "-1", would lead to the regulators setting up all lanes to "full speed" and the resultant write operation working without any "shuffle". That's pretty close to "no cost". But there is still temporal predication functionality (transistors) sat there "idle", consuming power, etc.Would this new design consume significantly more power in simple (non branchy) code?
The voltage regulation would appear to add power consumption regardless of whether predication was active or not. 4096 VALU lanes each with private voltage regulation seems to be about an order of magnitude more density in voltage regulation than seen anywhere else in computing (uneducated guess to be fair). On the other hand, contemporary super-efficient designs appear to feature high density voltage regulation merely to work.
A coarse grained voltage regulation architecture is going to suffer power losses simply because it's coarse grained (it's harder to distribute regulated voltages than it is to distribute unregulated voltages - impedance and latency are enemies over wide areas) and also because it's slow to react and because it results in more of the chip being supplied with the wrong voltage (or running at the wrong clock).
In general you spend transistors and increase architectural complexity because it results in more compute efficiency. See DCC as an example.
I can't see how we can quantify the pros and cons here. The proof will be in whether a GPU using this scheme is actually built.