GCN (like the VLIW chips) has scratch registers: registers that live in global memory. It's unclear whether they're cached. But the performance with them sucks horribly.
At least in this proposal, this tier of registers is physically adjacent to the ALUs which provides them with as much or possibly more bandwidth than the primary register file.
The wiring at that juncture might be too congested to get fancy enough to augment vector register bandwidth, something that a number of CPUs have done to make up for having fewer ports than their ALUs could consume at peak.
There could be some nice side benefits if there were a way to do this, besides power.
If one wavefront could successfully hide its accesses in the cache, the register file itself might be available for miscellaneous operations that need register access (LDS to VREG bypass, exports from VREG, etc.). The wavefront itself would not notice, but the CU overall might see better concurrency in getting movement on the other instruction queue types, or values from other domains, like scalar registers, might be able to be sourced more often after a move to the register cache/extended bypass.
You could start here and work backwards:
Xbox One November SDK Leaked
One question I have is that if the off-chip mode is not an Xbox-specific feature, why has that option not been exercised? There are definitely clear disparities in bandwidth between solutions that provide rather close benchmark numbers.
The on-chip vs. off-chip dichotomy is something GCN's memory pipeline is a philosophical example of: an advanced functionality case that works trivially due to an unadventurous physical fixation, and an expensive un-evolved fallback.
This may have come from the insistence that the CU arrays be so heavily decoupled, where movement to and from the fixed function domain is more of a straw than the compute domain is used to.
Nvidia implemented an interconnect that distributed this more freely. Possibly, their implementation is able to spawn DS instances and clone the necessary parameters and contexts, while being able to provide a stream from the tessellator to the cloned instances.
AMD does not seem to have this readily available, unless the DS CU is made so that it writes out all that data, and then the ostensibly elegant memory pipeline becomes the distributor. And then we find that this conventional memory system does not "push" data well, and the less-advanced cache and memory hierarchy are now unable to be hidden.
It's common to see comments about how Nvidia and AMD (and Intel when people still cared about CPUs) are still tuning clocks etc very short before release. And inevitable it's about how they can still go up.
Here's my take on this: I've never seen silicon speeds go up after the first weeks of bring up. They always go down: corner silicon doesn't perform as expected, false paths rear their ugly heads on some samples etc.
And second: going to mass production is a very drastic step with a lot of red tape. You do an initial trial production run and a larger volume trial run and you analyze all the failures. And, most important, you don't touch a single parameter. Definitely not clocks.
Soalways take those comments about clocks not being final with a great deal of salt: it's very likely to be all in the imagination of the writer who has no clue. Especially 2 weeks before launch, when all parameters should have locked for many weeks.
In other cases, it may be that the source is operating at the end of a grapevine, where the rumor sites breathlessly as breaking news events that have long-since been resolved.
Whether this GPU will be considered mass-produced for the X SKU or not, all speculation has been for a solution that is running on the edge for power consumption.
AMD could be tweaking its turbo bins on silicon it has already validated on a range, or fixing its firmware. It may be that silicon never physically gets what is hoped for, but the complexity of the DVFS implementation--and possibly AMD flubbing this again (Jaguar to Kabini, Trinity to Richland, 7970 to 7970 GHz edition, Kaveri to Godavari, probably something in the 3xx series rebrand stack)-- could leave a lot of slack below that point.
Possibly "working on clocks" is gauging the highest speed bin AMD can get enough of.