I know how SIMD works logically. Im talking about the physical implementation. And there you have a few boundary conditions. The important one in this context is that you don't want to build four huge 64kB register banks which are far away from the individual ALUs. Instead you split them up and place them very close to the ALUs. This pays back twofold: it reduces the access latency and saves power. As no SIMD lane can see the registers of other lanes anyway, there is no point to have register files shared. It's simpler, faster and even more power efficient.I've tried thinking about this for a while.. I'm pretty tired so I might make some obvious mistake, please bear with me :/
The virtue of SIMD is that you can the entire vector like one big register, and can access it via one very wide single register file port. So the amount of individual addressing you need shouldn't scale with the number of lanes (VLIW groups) in the SIMD. Of course this will be limited if the SIMD is broken up over multiple blocks like.
And it actually enables some trick a single register file couldn't do easily. Each SIMD lane can add an offset to the index into the register file so each lane can in fact address a different register with the single instruction shared by all lanes (look it up in the ISA manual under "relative adressing" with the offset given in the AR register set by the MOVA_* instruction). Afaik, GCN got rid of this messy stuff.
I know all the ISA manuals from R600, R700, Evergreen, HD6900, and Southern Islands (that's how AMD called the GPUs in the manuals).If using R700 series as a base this document may be helpful: http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf
The register adressing is really awful for the VLIW architectures. But anyway:Section 4.7.4 is helpful. One of GPR.X, GPR.Y, GPR.W, and GPR.Z can be read each cycle. GPR.Trans takes a read port from one of these. There's also a constant file which I think can supply one read per cycle (over 4 cycles?) I'm not sure where writes fit in though, if they can fit in with this too (and if so, if the SRAMs need to be read + write ported). So a VLIW SIMD block may only need something like 5-6 SRAM banks dedicated to it.
Let's forget a moment that each lane has to work on a 4 new elements every 4 cycles. Then there is the task of reading up to 12 operands over 3 cycles (some bank swizzling encoded in the instruction takes care of the distribution to the VLIW slots, we don't talk about that here, it's another mess; we neither consider the constants or literals, they don't come from the reg file, nor about the PS and PV "pipeline registers" which serve as some compiler directed data forwarding because the results of the previous instruction are not yet written back to the reg file). For that you clearly need 4 banks. Those banks are not only physically separate, but already logically in the ISA. The remaining 4th cycle is used to write back the four results of the previous instruction by the way. So everything could be used up.
But now we come back to the fact, that each SIMD lane actually executes 4 elements of the wavefront over four cycles in a pipelined fashion. That would mean during the first cycle one reads four 32bit values from the 4 banks for the first element. In the next cycle one has to read both, the next four operands for the first element as well as the first four operands for the second element. And so on. In the end this results one has to constantly read 12 values per cycle and also write 4 values per cycle. That needs basically 4 banks with 4 ports (3 read, one write; fetching data is done in seperate clauses and may steal the necessary cycles when changing clauses).
But one could have the idea, that one simply fetches and writes 128bit values (4x32bit) for all 4 elements processed in that lane together and everything would be fine. The relative adressing of GPRs would destroy this as the 4 elements can access different GPRs. But I strongly suspect that it uses the same mechanism as when one uses a GPR indexed access of constants called waterfalling (it basically serializes the 4 elements in each lane; not as bad as serializing all 64 elements in a wavefront for a GPR indexed constant file access). I really doubt they implemented 4 times the register ports/banks just for this basically never used feature. As I mentioned, GCN dropped support for this (and I guess nobody will notice).