Not knowing the bit length of the current x86 chips' internal ops, I can't make a direct comparison of contemporary architectures.
In comparison to the P6 at 118 bits per instruction, the biggest packet in the internal instruction format of R600 would be 56 bytes in length. Five P6 micro-ops would be ~74 bytes.
How many bits would the matching external shader instruction be in comparison, I wonder?
Each of the 3 operands and the resultant in an SM4 shader instruction can relate to any one of 4096 vec4 registers - so in terms of R600 that would be any one in 16384 scalar operands. Multiply that by the number of threads supported by the architecture and you get...
From the assembly code I've looked at, it appears R600 provides for 128 vec4 registers per thread, actually implemented as 512 scalars. In order to support the resulting 16384-scalar register file, there'd presumably be some kind of abstracted register file paging mechanism never seen by the ALU pipeline, along with paging of the shader code into distinct programs that address subsets of the thread's register set.
A curiosity of the architecture is that the compiler assigns registers in two sequences: r0-upwards and r127-downwards. As far as I can tell, the r127-downwards registers are clause-temporaries, i.e. their lifetime is not that of the shader program, but the execution of the clause.
R600 breaks up shader programs into clauses of, at maximum, 32 instruction slots (or 128 scalar ops, whichever comes first - 128 scalar ops can be compiled in 26 instruction slots). So you could say that in order to run a 400-slot shader, R600 runs sequences of clauses from this shader that are a maximum of either 32 slots or 128 ops long, the latter being a harder limit, corresponding to 1024 bytes of code (125 * 8 bytes). (If the maximum number of literals per instruction are allocated then presumably only 18 instruction slots per clause could be compiled.)
So it would appear likely that R600 has a subsidiary register file that supports only clause-temporaries (r127, r126 etc.), in addition to the register file that supports "shader state", r0, r1, r2 etc.
The subsidiary register file would only need to support 2 threads at a time. If a clause can be a maximum of 128 operations, that's 128 scalar resultants required in the subsidiary register file - or 32 vec4s. So it would appear that the subsidiary register file needs to be 64 objects * 2 threads * 128 scalars * 4 bytes = 64KB. This couldn't actually be true because a clause that consisted of 128 resultants that were all junked once the clause completed would, in effect, do nothing. So that's not the whole answer...
So, instead of this, I can imagine that in setting up a thread's clause for execution by a SIMD, the relevant registers are
copied from the register file into the subsidiary register file. So, this is a different organisation from what I've just proposed. Instead of the ALUs consuming operands from both the register file and a subsidiary register file (as well as writing to both), I'm guessing that the ALUs can
only work against the subsidiary register file.
At the end of execution of a clause the updated shader state registers (r0 upwards) would need to be written back to the register file, while the clause-temporaries (r127 downwards) are junked. In order to pipeline these copy operations, I guess "triple buffering" is required. Sorta similar to the way in which Cell SPE programs often use triple buffering and DMA to split LS into "incoming", "active" and "outgoing" stages. Double buffering could work I suppose if the register copy operations are "fast".
If this is the case then operand addressing at the operation level in R600 would only need to cater for 512 scalar addresses (128 operations per clause, each operation having a maximum of 3 operands and 1 resultant), 9 bits per operand, 36 bits per operation, out of the budget of 64 bits per operation identified earlier. So out of the remaining 28 bits, perhaps 6-8 bits for opcode? Then various bits for masking and predication...
The other issue is that the ALU pipeline needs to be able to address the memory cache (streamout, memexport or just general inter-thread data swapping). It's 8KB and can be either read or written with a granularity of 32-bits (I presume). I guess this "mov" instruction would have a unique destination/source formatting, re-using bits I've already described...
Gah, I wonder if there's "CTM" documentation for R600 out there...
Later in that PDF it says, "1MB of GPR space for fast register access."
For what it's worth various patent documents describe a "low capacity" register file which has been puzzling me for absolutely ages - I guess this tallies with the idea of the "subsidiary register file" I've been talking about. If this is true then the bulk "1MB register file" is actually just a "register file cache", virtualising the full 16384-scalars per thread "register set" that R600 supports. The actual register file implemented per SIMD is a small block of memory:
SIMD processor and addressing method
Here a 10-bit address is provided for a 1KB register file - able to address individual bytes, rather than the scalars (4 bytes) that I've been describing.
---
How big would all the registers for, say, 24 cores of Larrabee be, scalar + vector?
Jawed