Last time I checked GPUs used to have massive register files (256 kB per VLIW SIMD-Engine for Radeons, that's space for 64k floats).
You think that's a good thing? This massive register file is shared by a very large number of strands, leaving only a modest number of registers per strand. When executing strands more slowly, you need more of them to reach the same throughput, which means you'd need an even larger register file. At the same time, the software is getting more complex as well, demanding even more registers. They can't continue to sacrifice die space for that. Instead, some simple forms of out-of-order execution and superscalar issue can incease ILP and lower storage demand.
Note that it's not just about registers. If the working set of all strands combined doesn't fit inside the L1 cache most of the time, you get a very high percentage of misses, which results in higher bandwidth usage, and higher latency. Ironically higher latency means you need more strands, which again means more register and cache pressure...
Either they'll attempt to reduce the pipeline depth, they'll use more dynamic scheduling, or they'll need supermassive register files and caches. It might be a combination, but only increasing the storage seems like a waste of die space to me.
Seriously? You are reading that kind of stuff into a marketing driven pictogram? It can mean almost anything and definitely doesn't give away much about the actual implementation.
It doesn't mean just anything:
Merging CPUs and GPUs.
"You can expect to talk to the GPU via extensions to the x86 ISA, and the GPU will have its own register file (much like FP and integer units each have their own register files). Elements of the architecture will be shared, especially things like the cache hierarchy, which will prove useful when running applications that require both CPU and GPU power."
So at least initially when they bought ATI they envisioned combining the flexibility of the CPU with the throughput of a GPU. It looks like Bulldozer and GCN can still be part of this long-term plan, but I wonder what the next steps will be.
Not entirely surprisingly it looks like AVX2 and Larrabee put Intel one step closer to a fully converged architecture. It can't be a coincidence that they've already reserved an encoding bit for 512 and 1024-bit AVX (they could have instead just reserved it for an undetermined feature). It's also quite interesting that Intel paid NVIDIA 1.5 billion to get access to patents which they might require to implement the
sequencing logic to execute AVX-1024 on 256-bit execution units in a power efficient manner.