Larrabee: Samples in Late 08, Products in 2H09/1H10

It appears R600 does try to reduce the amount of register data moves, even if it should complicate addressing.

Is the R600 ISA document part of the SDK, or can it be found separately?
 
It appears R600 does try to reduce the amount of register data moves, even if it should complicate addressing.
I think it's more relevant to see the clause temporaries as enabling significant savings in per-thread register allocations, increasing the number of in-flight threads.

Is the R600 ISA document part of the SDK, or can it be found separately?
As far as I can tell it is only available as part of the SDK, which is annoying and short-sighted. I was able to install the SDK even though my driver is out of date (7.7 or 7.8 I think). Dunno what would happen if you don't have an ATI GPU.

Jawed
 
How are the clause temporaries addressed compared to standard registers?

Hard-wiring the highest 4 vec4 register addresses to be temporaries would be the simplest way to encode things, as in it changes nothing, but that would leave it up to the compiler/coder to make sure that switching out the clause doesn't wreck everything.
Leaving correctness up to the precise timing of the code sequence would be very old-school VLIW, if that were the case.

On the other hand, that would limit future expandability, as any further increase of scratch space capacity would eat into the normal register space.

Interestingly, 4 vec4 registers would translate to 16 32-bit registers (like another ISA has for general-purpose registers). If the temp references are encoded differently and ALUs can reference the temp and permanent sections, it would almost be like an x86 mem-reg operation, though hopefully entirely on chip and deterministic at the time of reference.
 
How are the clause temporaries addressed compared to standard registers?
I haven't read enough about addressing to give you an answer yet. Quoting table 2-6 on page 15:

GPRs:
  • number per thread : 127 minus 2 times Clause-Temporary GPRs
  • Each thread has access to up to 127 GPRs, minus two times the number of Clause-Temporary GPRs. Four GPRs are reserved as Clause-Temporary GPRs that persist only for one ALU clause (and therefore are not accessible to fetch and export units). GPRs may hold data in one of several formats: the ALU can work with 32-bit IEEE floats (S23E8 format with special values), 32-bit unsigned integers, and 32-bit signed integers.
Clause-Temporary GPRs:
  • number per thread : 4
  • GPRs containing clause-temporary variables. The number of clause-temporary GPRs used by each thread reduces the total number of GPRs available to the thread, as described immediately above.
Hard-wiring the highest 4 vec4 register addresses to be temporaries would be the simplest way to encode things, as in it changes nothing, but that would leave it up to the compiler/coder to make sure that switching out the clause doesn't wreck everything.
Leaving correctness up to the precise timing of the code sequence would be very old-school VLIW, if that were the case.
The compiler, as far as I can tell, explicitly encodes for regular versus CT registers.

On the other hand, that would limit future expandability, as any further increase of scratch space capacity would eat into the normal register space.
The rate at which CTs consume register file space is very low (since it's only the number of objects per thread * the number of threads active in the ALU pipeline - in R600 this is 64*2*CT-count). As I commented earlier, I suspect RV670 has more capacity for CTs, but that's only based on GPUSA I don't fully understand.

Interestingly, 4 vec4 registers would translate to 16 32-bit registers (like another ISA has for general-purpose registers). If the temp references are encoded differently and ALUs can reference the temp and permanent sections, it would almost be like an x86 mem-reg operation, though hopefully entirely on chip and deterministic at the time of reference.
If you look at GPUSA output you'll see quite clearly that GPRs and CTs are mixed "freely". I've discovered that there are restrictions on the sequence of operands issued in a 5-op ALU Instruction Group, but I don't understand them yet...

Code:
    212  x: ADD  R0.x,  PV(211).z,  C2.z      
         y: MAX  R123.y,  PV(211).y,  0.0f      
         z: ADD  R5.z,  PV(211).x,  C2.z      
         w: ADD  R123.w,  R2.z, -PV(211).w      
         t: MUL  R122.z,  R124.x,  C2.w      
    213  x: MUL  R124.x,  R127.w,  C2.w      
         y: ADD  R14.y,  PS(212).x,  R3.y      
         z: ADD  R11.z,  PV(212).w,  C2.z      
         w: MUL  R123.w,  PV(212).y,  PV(212).y      
         t: ADD  R14.x,  R126.w,  R3.w

GPRs:
  • R0
  • R2
  • R3
  • R5
  • R11
  • R14
CTs:
  • R122
  • R123
  • R124
  • R126
"Previous" registers (what I've called pipeline registers in the past), note the index refers to the instruction number that produced the result, the index is always the prior instruction:
  • PV(211) "V" refers to the vector of four lane resultants X, Y, Z, W
  • PV(212)
  • PS(212) "S" refers to the scalar T resultant (confusingly always referred to as "X")
The CT assignment is suspect, by the way, because the entire clause also refers to R125 and R127 (6 CTs in total), which is more than the supposed limit of 4. Sigh.

Jawed
 
Back
Top