Can you expand a bit more on "The key to GPRs in ATI is that their data is private to the owning ALU"?
http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf
You can see how the register files are disjoint.
http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf
Section 2.6, Figure 2.1 is a logical view of the architecture, showing that each of the 64 processors in a SIMD has 256 128-bit registers.
The "global" shared registers are described in 2.6.2.1, which is at pains to point out that each such GPR is only available for "threads" (ALUs) on that lane. This restriction wouldn't apply if register data could be arbitrarily shared across ALUs. Also note that figure 2.2 shows how the register file is really a big pool shared by all wavefronts and split amongst global shared registers, clause temporary registers and bog-standard registers.
Additionally, of course, LDS is an entirely separate structure, distinct from the register files. It is also very low bandwidth, compared with the GPRs. Evergreen's LDS has better bandwidth as a direct result of having twice as many banks, and presumably at increased bus cost getting that data to/from the GPRs.
Technically it appears possible/likely, that each VLIW ALU (set of x,y,z,w,t) has four 256-high x 128-bit register files. The old register file patents refer to 256 high blocks of memory, if I remember right. The whole thing is knitted together in a staggered timing cycle of 8 clocks (in two sets of 4 clocks) to ensure that all GPR clients get their share, centred upon the execution pipeline's 8 cycle interval, slide 9:
http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf
Page 10, too:
http://sa09.idav.ucdavis.edu/docs/SA09_AMD_IHV.pdf
That should be all you need.
Jawed