The key word is "wire". When trying drive signals on a wire, its resistance and capacitance determine how quickly it can be done and how much energy is needed to meet the timing target.That's an important distinction. My intuition is that it takes a lot of energy not so much because the power requirements are high, but because it takes time to access the registers. This would explain how having a closer register cache might help.
Wires at this level of geometry have had very poor scaling for several nodes now.
A wire's delay, or the effort needed to make it reach a given delay, goes up quadratically with its length. That's why long wire runs are broken down with repeaters. These small tranistor blocks break a wire of length N into two 1/2N sections. The cost is a slight delay and power consumption, but the wire's delay and energy cost are far more dominant until you get to very small dimensions.
The closer RFC cuts wire length down, and its smaller size also may mean smaller word lines and perhaps a different design for the sense amps on the SRAM banks.
The difference is telling that a cache can be heavily multiported and can save power so long as it is small and close by.
The FMA operation itself is energetically cheaper despite the larger number of transistors and wires because the distances are so tiny.
The register files for GPUs are sized like caches, so this idea is coming up faster for them.I haven't read the paper yet, just the abstract and a few posts here, but doesn't this evolution strike any one as absurd?
I mean the registers were supposed to hide latency, and now they themselves are burning so much power that an L1 cache for RF is being proposed. What's next? 3 levels of cache for RF?
The need to save power is what is driving this, because power scaling is lagging so much.
To the former, the L1 caches of active CPUs glow hot on thermal images. It's active SRAM and the wire distances in question are as long or longer as the millimeter discussed in the paper. This is a general problem for all designs.LRB1 might have failed, but it had a good idea of allocating registers out of gp cache and instructions to modify cache replacement policy.
To the latter, I have not seen confirmation on the ability to change the replacement policy, just some ambiguous wording.
The Larrabee software rasterizer does create a second level of scheduling, essentially a software version of the batches and warps done on GPUs when faced with texture latency. The distances involved are not necessarily better depending on how far the data migrates.Also, a combination of hw multithreading and sw multithreading had some good ideas behind it, just like a 2 level hw scheduler.
As the paper noted, multilevel scheduling has been considered for CPU designs.