A reduction in register pressure by ~4x in a common use case?
I'm sure this is a trivial question but why does VLIW inherently reduce register pressure?
AMD has little to prove in this regard IMO.
In games perhaps not, otherwise yes.
A reduction in register pressure by ~4x in a common use case?
AMD has little to prove in this regard IMO.
I'm sure this is a trivial question but why does VLIW inherently reduce register pressure?
Yes that's correct, but the VLIW registers are also 5 times larger in the case of ATI than a NVIDIA 1D register.Assuming a near optimal mapping to VLIW for a shader, an equivalent SIMD implementation has to run 4 times more "instances" of the shader, with 4 times the average variable life time (cause it's 4 times slower, this is of course assuming there are no texture accesses in between) with 4 times less data per variable. 4*4/4 = 4.
A float4 HLSL statement you mean? Yes.So you mean, that for the same statement the NVIDIA compiler needs more registers than ATI does?
ATI is exactly doing the opposite. They have SIMD control, but the instructions are VLIW.
Registers in ATI are vec4:I don't know, but I would assume ATI is even needing slightly more register space, because they surely have to waste space in the 5D registers sometimes.
72 x: MULADD ____, R6.x, R8.x, PV71.x
y: MULADD R4.y, R5.w, R11.w, PV71.y
z: MULADD ____, R6.x, R8.z, PV71.z
w: MULADD ____, R5.y, R7.y, R32.y
t: MULADD T0.x, R10.x, R8.x, PS71
73 x: MULADD ____, R6.z, R9.x, PV72.x
y: MULADD ____, R6.z, R9.w, T0.y VEC_021
z: MULADD ____, R6.z, R9.z, PV72.z
w: MULADD ____, R5.x, R8.y, PV72.w VEC_201
t: MULADD ____, R10.y, R7.w, R15.w
Yes, that's what I meant.One instruction packet is issued per cycle. Each instruction element of the packet applies to multiple data items.
Those aren't registers, it's the forwarding network. AMD can use as the instruction scheduling inside a clause is completely predictable and so a value can be pulled straight out of the forwarding network w/o having it written to a register. The fact that it is presented in the assembler code as pseudo-registers is just a convenient abstraction.The pipeline, itself, has its own set of registers. These registers hold the result of an instruction until an ALU lane overwrites.
If the instruction scheduling is completely static and predictable then it's not a 'trick', it's a natural consequence of the hardware design. The ISA is actually exposing the fact that you can read your operands right out of the forwarding network in a predictable manner instead of reading them from the register file.Personally, I would call them registers as well ... when I hear register bypass/forwarding network I think about runtime tricks inside the processor not something exposed by the ISA.
It's worth noting that an operand need not be consumed on the next "cycle". It persists as long as the lane that produces results is "masked out" (or NOP, erm, not sure now) on later cycles.A method and apparatus for avoiding latency in a processing system that includes a memory for storing intermediate results is presented. The processing system stores results produced by an operation unit in memory, where the results may be used by subsequent dependent operations. In order to avoid the latency of the memory, the output for the operation unit may be routed directly back into the operation unit as a subsequent operand. Furthermore, one or more memory bypass registers are included such that the results produced by the operation unit during recent operations that have not yet satisfied the latency requirements of the memory are also available. A first memory bypass register may thus provide the result of an operation that completed one cycle earlier, a second memory bypass register may provide the result of an operation that completed two cycles earlier, etc.