Nvidia has moved further away from unifying register and memory pools. The description of Fermi's ISA change has moved to a more fully load/store architecture, whereas its immediate predecessor had memory operands.
I don't understand the distinction you're making.
Why expose every operand access to possible TLB fills and memory faults, or why have the additional complexity in hardware to do this, and then avoid using it most of the time?
This is where you get into a nebulous argument over whether the memory in the operand collector, holding operands for multiple cycles until a warp's worth of operands are all populated, is really the register file
In this model the "registers", the constant cache, the shared memory and global memory are all just addressable memories.
The thing that makes these GPUs different from CPUs is that gather/scatter is essentially a first-class instruction. Or, at least in the future, it is. There's no choice when the whole thing is a SIMD. Historically GPU ALUs have avoided the gather/scatter problem because pixel shading doesn't expose the ALUs to it - the pipeline has been designed to farm out texture-mapping gather and pixel-blending scatter operations.
Many of these fancy new algorithms (or re-discovered supercomputing principles) push repeatedly on the gather/scatter button at the ALU instruction level.
If it weren't for the x86 core, x86 hardware thread context, comparatively miniscule reg file and its reg/memory operand legacy, I wonder if its designers would have skipped over that "feature".
G80->G92->GT200 saw progressively increasing register capacity and/or increasing work-items per SIMD. Fermi actually reverses things a little, I think. In other words it seems to me NVidia hasn't really settled on anything.
Obviously this discussion would be easier with Larrabee to play with. But I trust the Intel engineers to the extent that the originally-presented chip wasn't fundamentally broken in this respect. Though I still strongly suspect texturing is a black hole they've been struggling with.
One could argue that texturing is still so massively important that it steers GPUs towards large RFs and the ALU-gather-scatter centric argument is merely a distraction, and Intel's stumbling block
A memory access is not as cheap as a register file access, for various reasons. It is a much more complex case to get right, and getting it wrong has much bigger consequences for the system in general. The load/store and execution pipelines of even the P55 core are at least somewhat more complex because of this.
That's all very well. But GPU performance falls off a cliff if the context doesn't fit into the RF (don't know how successfully GF100 tackles this). So, what we're looking for is an architecture that degrades gracefully in the face of an increasing context.
The question is: can register files either keep growing or at the least retain their current size, in the face of ever more-complex workloads?
What happens when GPUs have to support true multiple, heavyweight, contexts all providing real time responsiveness? The stuff we take for granted on CPUs?
I wouldn't mind accessing that pool of SRAM, perhaps in some kind of linear line access absent the TLB and fault handling part of the pipeline, but those are usually an integral part of the pipeline and not totally removable.
I would be curious if Nvidia's configurable L1 does somehow convert cache accesses to the shared memory region into something addressed to the physical lines of the cache, though it could just be some kind of creative page mapping, where the cache logic does not bother to keep it coherent.
NVidia has a gather unit (the operand collector) that essentially hides a load of mess there (and a store queue). I'm presuming the cache is just coherent+bankset aligned accesses to banked shared memory.
I would potentially disagree, if I knew more of the implementation. It's possible that Larrabee would already have store queues as explicit parts of its memory pipeline.
The L1 and registers are just what they are, and what heirarchy they implement is what any other fully fleshed out memory pipeline can provide with proper software usage.
Sorry, I wasn't trying to say that L1/registers replace a conventional memory interface for the ALUs - I'm simply saying that the way Larrabee is designed, gather/scatter is built upon the workings of L1/registers. This comes back to the SIMD architecture and first-class gather/scatter. Gotta wait to see it in action...
Jawed