AMD: R8xx Speculation

trinibwoy · Apr 17, 2009

MfA said:
A reduction in register pressure by ~4x in a common use case?

I'm sure this is a trivial question but why does VLIW inherently reduce register pressure?

AMD has little to prove in this regard IMO.

In games perhaps not, otherwise yes.

aaronspink · Apr 17, 2009

trinibwoy said:
I'm sure this is a trivial question but why does VLIW inherently reduce register pressure?

It doesn't.

MfA · Apr 17, 2009

Assuming a near optimal mapping to VLIW for a shader, an equivalent SIMD implementation has to run 4 times more "instances" of the shader, with 4 times the average variable life time (cause it's 4 times slower, this is of course assuming there are no texture accesses in between) with 4 times less data per variable. 4*4/4 = 4.

If average variable lifetime is almost entirely determined by texture sampling latency this is of course irrelevant ... but then I didn't say it was always relevant.

3dilettante · Apr 17, 2009

A VLIW instruction packet's behavior cannot be any different than an equivalent SIMD instruction sequence.

The SIMD solution only takes longer if we assume the SIMD design can only issue one SIMD instruction at a time.

From the register point of view, the two have to be equivalent.

Novum · Apr 17, 2009

MfA said:
Assuming a near optimal mapping to VLIW for a shader, an equivalent SIMD implementation has to run 4 times more "instances" of the shader, with 4 times the average variable life time (cause it's 4 times slower, this is of course assuming there are no texture accesses in between) with 4 times less data per variable. 4*4/4 = 4.

Yes that's correct, but the VLIW registers are also 5 times larger in the case of ATI than a NVIDIA 1D register.

I don't know, but I would assume ATI is even needing slightly more register space, because they surely have to waste space in the 5D registers sometimes.

MfA · Apr 17, 2009

I rounded that to 4 for convenience

(ie. 4 times less data per variable part.)

Novum · Apr 17, 2009

So you mean, that for the same statement the NVIDIA compiler needs more registers than ATI does?

3dilettante · Apr 17, 2009

There is no VLIW/SIMD dichotomy.

Itanium uses a VLIW-type scheme, and it can have multimedia (SIMD) instructions inside of its instruction packets.

Novum · Apr 17, 2009

ATI is exactly doing the opposite. They have SIMD control, but the instructions are VLIW.

MfA · Apr 17, 2009

Novum said:
So you mean, that for the same statement the NVIDIA compiler needs more registers than ATI does?

A float4 HLSL statement you mean? Yes.

Novum · Apr 17, 2009

Okay, I need to think about that

MfA · Apr 17, 2009

What is important is the average variable lifetime. A lot of the time it will be dominated by texture sampling, so the fact that the SIMD implementation takes longer than a VLIW+SIMD implementation for a given "instance" of the HLSL shader doesn't really matter.

3dilettante · Apr 17, 2009

Novum said:
ATI is exactly doing the opposite. They have SIMD control, but the instructions are VLIW.

One instruction packet is issued per cycle. Each instruction element of the packet applies to multiple data items.

Jawed · Apr 17, 2009

Novum said:
I don't know, but I would assume ATI is even needing slightly more register space, because they surely have to waste space in the 5D registers sometimes.

Registers in ATI are vec4:

http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

Fetches from registers can be scalar or anything larger, upto vec4 from the same register. The rules relating to the addressing/sequencing/timing of register fetches are headache-inducing.

The pipeline, itself, has its own set of registers. These registers hold the result of an instruction until an ALU lane overwrites. This means that there are, effectively, even more register operands available to any given instruction slot:

Code:

         72  x: MULADD      ____,  R6.x,  R8.x,  PV71.x      
             y: MULADD      R4.y,  R5.w,  R11.w,  PV71.y      
             z: MULADD      ____,  R6.x,  R8.z,  PV71.z      
             w: MULADD      ____,  R5.y,  R7.y,  R32.y      
             t: MULADD      T0.x,  R10.x,  R8.x,  PS71      
         73  x: MULADD      ____,  R6.z,  R9.x,  PV72.x      
             y: MULADD      ____,  R6.z,  R9.w,  T0.y      VEC_021 
             z: MULADD      ____,  R6.z,  R9.z,  PV72.z      
             w: MULADD      ____,  R5.x,  R8.y,  PV72.w      VEC_201 
             t: MULADD      ____,  R10.y,  R7.w,  R15.w

R and T are register file entries (Ts are special temporaries). PV and PS are the pipeline registers. "____" as the resultant indicates that the result of an instruction is only consumed by a succeeding instruction using the PV/PS nomenclature. e.g. PV72.x, .z and .w in instruction 73 are all "____" in instruction 72.

Jawed

Novum · Apr 17, 2009

3dilettante said:
One instruction packet is issued per cycle. Each instruction element of the packet applies to multiple data items.

Yes, that's what I meant.

3dilettante · Apr 17, 2009

That's not significantly different from what Itanium does.

Its instruction packets just make this SIMD application more selective.
An instruction slot can go to a SIMD unit, or it can be dispersed to a scalar one, depending on the packet.

crystall · Apr 17, 2009

Jawed said:
The pipeline, itself, has its own set of registers. These registers hold the result of an instruction until an ALU lane overwrites.

Those aren't registers, it's the forwarding network. AMD can use as the instruction scheduling inside a clause is completely predictable and so a value can be pulled straight out of the forwarding network w/o having it written to a register. The fact that it is presented in the assembler code as pseudo-registers is just a convenient abstraction.

MfA · Apr 17, 2009

Semantics ...

Personally, I would call them registers as well ... when I hear register bypass/forwarding network I think about runtime tricks inside the processor not something exposed by the ISA.

crystall · Apr 17, 2009

MfA said:
Personally, I would call them registers as well ... when I hear register bypass/forwarding network I think about runtime tricks inside the processor not something exposed by the ISA.

If the instruction scheduling is completely static and predictable then it's not a 'trick', it's a natural consequence of the hardware design. The ISA is actually exposing the fact that you can read your operands right out of the forwarding network in a predictable manner instead of reading them from the register file.

Jawed · Apr 18, 2009

http://v3.espacenet.com/publication...20040427&NR=6728869B1&locale=en_GB&CC=US&FT=D

A method and apparatus for avoiding latency in a processing system that includes a memory for storing intermediate results is presented. The processing system stores results produced by an operation unit in memory, where the results may be used by subsequent dependent operations. In order to avoid the latency of the memory, the output for the operation unit may be routed directly back into the operation unit as a subsequent operand. Furthermore, one or more memory bypass registers are included such that the results produced by the operation unit during recent operations that have not yet satisfied the latency requirements of the memory are also available. A first memory bypass register may thus provide the result of an operation that completed one cycle earlier, a second memory bypass register may provide the result of an operation that completed two cycles earlier, etc.

It's worth noting that an operand need not be consumed on the next "cycle". It persists as long as the lane that produces results is "masked out" (or NOP, erm, not sure now) on later cycles.

It's also worth remembering that the pipeline length of the ALUs is 8, i.e. an operand stored in-pipeline actually persists for multiples of 8 cycles.

Jawed

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

trinibwoy

Meh

aaronspink

MfA

3dilettante

Novum

MfA

Novum

3dilettante

Novum

MfA

Novum

MfA

3dilettante

Jawed

Novum

3dilettante

crystall

MfA

crystall

Jawed

Similar threads