AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
Assuming a near optimal mapping to VLIW for a shader, an equivalent SIMD implementation has to run 4 times more "instances" of the shader, with 4 times the average variable life time (cause it's 4 times slower, this is of course assuming there are no texture accesses in between) with 4 times less data per variable. 4*4/4 = 4.

If average variable lifetime is almost entirely determined by texture sampling latency this is of course irrelevant ... but then I didn't say it was always relevant.
 
Last edited by a moderator:
A VLIW instruction packet's behavior cannot be any different than an equivalent SIMD instruction sequence.

The SIMD solution only takes longer if we assume the SIMD design can only issue one SIMD instruction at a time.

From the register point of view, the two have to be equivalent.
 
Assuming a near optimal mapping to VLIW for a shader, an equivalent SIMD implementation has to run 4 times more "instances" of the shader, with 4 times the average variable life time (cause it's 4 times slower, this is of course assuming there are no texture accesses in between) with 4 times less data per variable. 4*4/4 = 4.
Yes that's correct, but the VLIW registers are also 5 times larger in the case of ATI than a NVIDIA 1D register.

I don't know, but I would assume ATI is even needing slightly more register space, because they surely have to waste space in the 5D registers sometimes.
 
There is no VLIW/SIMD dichotomy.

Itanium uses a VLIW-type scheme, and it can have multimedia (SIMD) instructions inside of its instruction packets.
 
What is important is the average variable lifetime. A lot of the time it will be dominated by texture sampling, so the fact that the SIMD implementation takes longer than a VLIW+SIMD implementation for a given "instance" of the HLSL shader doesn't really matter.
 
Last edited by a moderator:
I don't know, but I would assume ATI is even needing slightly more register space, because they surely have to waste space in the 5D registers sometimes.
Registers in ATI are vec4:

http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

Fetches from registers can be scalar or anything larger, upto vec4 from the same register. The rules relating to the addressing/sequencing/timing of register fetches are headache-inducing.

The pipeline, itself, has its own set of registers. These registers hold the result of an instruction until an ALU lane overwrites. This means that there are, effectively, even more register operands available to any given instruction slot:

Code:
         72  x: MULADD      ____,  R6.x,  R8.x,  PV71.x      
             y: MULADD      R4.y,  R5.w,  R11.w,  PV71.y      
             z: MULADD      ____,  R6.x,  R8.z,  PV71.z      
             w: MULADD      ____,  R5.y,  R7.y,  R32.y      
             t: MULADD      T0.x,  R10.x,  R8.x,  PS71      
         73  x: MULADD      ____,  R6.z,  R9.x,  PV72.x      
             y: MULADD      ____,  R6.z,  R9.w,  T0.y      VEC_021 
             z: MULADD      ____,  R6.z,  R9.z,  PV72.z      
             w: MULADD      ____,  R5.x,  R8.y,  PV72.w      VEC_201 
             t: MULADD      ____,  R10.y,  R7.w,  R15.w

R and T are register file entries (Ts are special temporaries). PV and PS are the pipeline registers. "____" as the resultant indicates that the result of an instruction is only consumed by a succeeding instruction using the PV/PS nomenclature. e.g. PV72.x, .z and .w in instruction 73 are all "____" in instruction 72.

Jawed
 
That's not significantly different from what Itanium does.

Its instruction packets just make this SIMD application more selective.
An instruction slot can go to a SIMD unit, or it can be dispersed to a scalar one, depending on the packet.
 
The pipeline, itself, has its own set of registers. These registers hold the result of an instruction until an ALU lane overwrites.
Those aren't registers, it's the forwarding network. AMD can use as the instruction scheduling inside a clause is completely predictable and so a value can be pulled straight out of the forwarding network w/o having it written to a register. The fact that it is presented in the assembler code as pseudo-registers is just a convenient abstraction.
 
Semantics ...

Personally, I would call them registers as well ... when I hear register bypass/forwarding network I think about runtime tricks inside the processor not something exposed by the ISA.
 
Personally, I would call them registers as well ... when I hear register bypass/forwarding network I think about runtime tricks inside the processor not something exposed by the ISA.
If the instruction scheduling is completely static and predictable then it's not a 'trick', it's a natural consequence of the hardware design. The ISA is actually exposing the fact that you can read your operands right out of the forwarding network in a predictable manner instead of reading them from the register file.
 
http://v3.espacenet.com/publication...20040427&NR=6728869B1&locale=en_GB&CC=US&FT=D

A method and apparatus for avoiding latency in a processing system that includes a memory for storing intermediate results is presented. The processing system stores results produced by an operation unit in memory, where the results may be used by subsequent dependent operations. In order to avoid the latency of the memory, the output for the operation unit may be routed directly back into the operation unit as a subsequent operand. Furthermore, one or more memory bypass registers are included such that the results produced by the operation unit during recent operations that have not yet satisfied the latency requirements of the memory are also available. A first memory bypass register may thus provide the result of an operation that completed one cycle earlier, a second memory bypass register may provide the result of an operation that completed two cycles earlier, etc.
It's worth noting that an operand need not be consumed on the next "cycle". It persists as long as the lane that produces results is "masked out" (or NOP, erm, not sure now) on later cycles.

It's also worth remembering that the pipeline length of the ALUs is 8, i.e. an operand stored in-pipeline actually persists for multiples of 8 cycles.

Jawed
 
Last edited by a moderator:
Back
Top