Thanks for your nice drawings! However, what is the point of skewing the components across the banks?
To provide the flexibility needed to read different combinations of data. This is all guesswork. It's part of my theory about improving ALU execution efficiency by re-ordering pixel scheduling depending on the number of components of an instruction.
e.g. in a simplistic 4-way SIMD ALU, MUL r1.rg, r0.rg, r2.rg wastes half the ALU. If you could schedule two pixels (each running the same instruction) through the same ALU, with the second pixel using the blue and alpha components (i.e. temporarily translating ".rg" into ".ba") then you waste nothing.
Additionally, as I've shown, this flexibility also supports packing threads to improve ALU utilisation for dynamic branching.
You can see my earlier discussion here (using 8-wide banks instead of 4-wide):
http://www.beyond3d.com/forum/showthread.php?p=900211#post900211
which assumes a fully symmetric 32-component ALU. i.e. special functions (like SIN or RSQ) are handled by all components of the ALU.
You assumed that you can choose the address (row) per bank. Is this feasible?
Ha, well what do I know about register file design? How does any register file support dual, triple etc. read ports?... How do you guarantee that all your read ports can always fetch the operands you require? What happens with co-issue operand fetching?
Trying to find much concrete stuff about register files in GPUs is extremely hard. One patent application I've got talks about implementing a register file (as an aside, not the main point of the patent) where every location exists
twice - that's how dual-porting is "implemented" (it's a suggestion, nothing more). Now, that seems sorta unbelievable to me, actually loony.
So, in short, there's no way I can back any of this up. I'm hopeful that there'll be a discussion, that's all...
Jawed