From the CUDA docs,
"The compiler and thread scheduler schedule the instructions as optimally as possible to avoid register memory bank conflicts. They achieve best results when the number of threads per block is a multiple of 64."
Yikes! A banked register file. Of course, they do have the threads to tolerate bank conflicts, but this still sound pretty nasty (both from a performance point of view and from a design complexity point of view). I'm sure they've found reasonable engineering solutions to minimize the impact, and it probably isn't *that* bad.
However, an SRAM memory with a banked interface is going to be less dense than a single-ported SRAM. The extra wires for the additional address bits and such aren't free. Of course, all SRAMs of any size are internally banked (to avoid long bit and word lines), but actually allowing a SRAM to take in multiple address and spit out multiple data words is going to result in a somewhat larger structure.
Of course, the real question is how does such a banked register file (G80) compare to a multi-port register file (presumably what Larrabee will use). I would say that the banking is likely a bit cheaper (in terms of area), but having, say, three ports (two read, one write) isn't that expensive either (you can overlay the wires over the SRAM cells in many cases).
Just thinking out loud, another option would be for Larrabee to use a multi-SRAM cell register file. This was used in the
IBM RS64-IV two-way multithreaded processor from the late-1990s. The observation was that on any given cycle, the processor would always read register values from the same thread (it was switch-on-cache-miss multithreaded). So, in this case, you don't need a full bit-line for each bit in the cache. You only need one bit-line for each pair of bits. I don't recall all the details, but this allowed them to build a register file with twice the bits (for two threads) without doubling the area. As multi-ported register files are often wire-limited anyway, adding the extra bits doesn't need to cost that much, but YMMV.
I found the ISSCC paper from 1998 by Storino et al that describes it: "To accomplish a dual-thread operation the registers file must have dual storage elements for each bit. the natural inclination would be to have multiplicity in write ports, read ports, and storage elements, significantly enlarging the area and lower the performance... Given the orthogonal nature of threads, it is not necessary to read or write identical word locations of the separate threads in the same cycle. Through this observation the hardware implementation is reduced significantly because both write and read ports are shared. By sharing ports, only duplicate memory elements were required. No extra decoders are needed because there is only one additional bit for thread selection."
I'm not saying that GPUs can't (or don't) play such tricks. They likely do. My main point on this is that trying to reason (and argue) about the area of various sorts of register files, caches, scratch memories, and secondary caches is more subtle that in might appear.