I don't understand the distinction you're making.
Moving to a fully load/store ISA means that instructions that perform computation are explicitly separate from memory-access instructions. As far as the software is concerned, the register memory is more distinct from other memory pools with Fermi than it was prior, and the more robust memory model of modern GPUs would add to the expense of integrating it into every operand access.
This is where you get into a nebulous argument over whether the memory in the operand collector, holding operands for multiple cycles until a warp's worth of operands are all populated, is really the register file
In this model the "registers", the constant cache, the shared memory and global memory are all just addressable memories.
The register file is the big collection of SRAM that holds operands that resides on one side of the ALUs. It is quite distinct physically and distinct in how it is treated.
I don't see why the operand collector needs to care about memory at all. The ISA is load/store, so all it needs to track is the readiness of the destination register of a given load. No instruction other than the memory access instructions would know of an address, which is much simpler to handle.
The operand collector would be wasting its time tracking the memory addresses.
Many of these fancy new algorithms (or re-discovered supercomputing principles) push repeatedly on the gather/scatter button at the ALU instruction level.
No new high-performance ISA puts memory operands at the ALU instruction level.
x86 internally splits ALU work off from the memory access because it is such a problem. Register accesses do not generate page faults, access violations, or require paging in memory.
G80->G92->GT200 saw progressively increasing register capacity and/or increasing work-items per SIMD. Fermi actually reverses things a little, I think. In other words it seems to me NVidia hasn't really settled on anything.
What needs to be settled? An example architecture with complex ALU instructions that could source multiple operands directly from memory was VAX.
Obviously this discussion would be easier with Larrabee to play with. But I trust the Intel engineers to the extent that the originally-presented chip wasn't fundamentally broken in this respect. Though I still strongly suspect texturing is a black hole they've been struggling with.
The x86 core is what it is. Plenty of other architectures don't try to combine memory loads with ALU work, and the P55C core internally cracks the instructions apart anyway.
One could argue that texturing is still so massively important that it steers GPUs towards large RFs and the ALU-gather-scatter centric argument is merely a distraction, and Intel's stumbling block
I think its mostly a distraction. The register file does very well on its own. The failings GPUs have with spills are more a product of their design. Other designs that degrade more gracefully just spill with loads and stores. It's cheaper and faster than trying to drive a full cache or make the internal scheduling hardware capable of handling memory faults.
That's all very well. But GPU performance falls off a cliff if the context doesn't fit into the RF (don't know how successfully GF100 tackles this). So, what we're looking for is an architecture that degrades gracefully in the face of an increasing context.
Memory operands save little here.
The difference between an x86 instruction with a memory operand and a load/store equivalent is a Load and then the ALU instruction (which the x86 does implicitly anyway).
It saves a bit on the instruction decode bandwidth and Icache pressure, but that is far from the limiting factor for GPU loads, and is not considered a limiting factor without an aggressively OoO speculative processor.
The question is: can register files either keep growing or at the least retain their current size, in the face of ever more-complex workloads?
That is subject to speed and size constraints. It's not better with caches. L1s have stagnated and even begun to shrink.
What happens when GPUs have to support true multiple, heavyweight, contexts all providing real time responsiveness? The stuff we take for granted on CPUs?
Should they?
If you want latency-optimized performance, you don't design a throughput processor.