AMD: R9xx Speculation

mczak · Oct 18, 2010

Sontin said:
Don't forget the Frond-End with the 25% higher clock rate.

And things like internal l2-l1 bandwidth (if the bus stays the same that is).

DarthShader · Oct 18, 2010

nApoleon says 1120sp are official for XT: http://www.chiphell.com/thread-130363-1-1.html

While pclab.pl post more pics of XFX cards, saying the 6870 has 1280 shaders: http://pclab.pl/news43574.html . Funnily they are unsure about 6850 final clocks, either 725 or 775 mhz.

Specualtion:

Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.

jimbo75 · Oct 18, 2010

Honestly AMD must be pissing themselves with laughter.

Gipsel · Oct 18, 2010

Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight

a RCP will use 3 slots, the same as a SQRT or a RSQ:

Code:

x: RCP_sat     R2.x,  R1.y      
y: RCP_sat     ____,  R1.y      
z: RCP_sat     ____,  R1.y

a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)

Code:

          3  x: MUL         ____,  R1.x,  (0x3E22F983, 0.1591549367f).x
          4  x: SIN         R2.x,  PV3.x      
             y: SIN         ____,  PV3.x      
             z: SIN         ____,  PV3.x

a 32bit integer MUL takes the full 4 slots:

Code:

x: MULLO_INT   R2.x,  R1.x,  R1.x      
y: MULLO_INT   ____,  R1.x,  R1.x      
z: MULLO_INT   ____,  R1.x,  R1.x      
w: MULLO_INT   ____,  R1.x,  R1.x

a 24bit integer MUL takes a single slot as expected

Code:

          3  x: MUL_INT24__NI  R2.x,  R1.x,  R1.x

Jawed · Oct 18, 2010

You sneaky chappy! Hmm, very interesting.

Gipsel · Oct 18, 2010

Jawed said:
You sneaky chappy! Hmm, very interesting.

Any wishes? You can send me IL code and I will show you how it will look like in ISA.

Btw., it really looks like you were right.

Kaotik · Oct 18, 2010

DarthShader said:
Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.

The drivers still state clearly that Antilles = Cayman based, not Barts.

For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?

DarthShader · Oct 18, 2010

Kaotik said:
For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?

That's what I am wondering too.

But I think I got it: RCP = reciprocal. And if it takes three slots, starting from x, then it means the shaders have been indeed buffed, but not in a why I'd like them too. Still, could be pretty efficient, if there's a 4D-vliw structure.

Rebel44 · Oct 18, 2010

LordEC911 said:
On a slight tangent- Was helping out a Fry's Electronics employee at my work, started chatting about his job, he does sales in computer accessories. Just asked him if he had heard about AMD releasing a new product. He said that the last week or so employees kept seeing them in the back and putting them out on the shelf so they finally locked them up. I was like damn, could have gotten one early.

I'll have to try and find one of those clueless employees a week or two before Cayman is released.

As long as product isnt really locked and it already has bar code, its easier to bribe some random emploee with few $ and he will bring it to you - at least that is my experience from retail.

Gipsel · Oct 19, 2010

Kaotik said:
For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?

It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.

And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.

mczak · Oct 19, 2010

Gipsel said:
It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.

So if that's vliw-4 how many simds?
If even RCP is 3 taking 3 slots are the other transcendentals taking all 4 (given that RCP should be the cheapest one)?

Arty · Oct 19, 2010

Gipsel said:
It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.

And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.

DarthShader · Oct 19, 2010

Gipsel said:
Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.

How is this different to a the situation, where the t-unit used other lanes for help? I can only see benefit, if a transcendental takes up only two slots, so two in parallel could be perfromed.

And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.

Double precision for Barts enabled? Would be big news for me!

Or do you mean "ne member of a future GPU line from AMD".

Jawed · Oct 19, 2010

Gipsel said:
Any wishes? You can send me IL code and I will show you how it will look like in ISA.

Ooh, thanks.

Here's something old that I've tweaked for extra transcendentals:

Code:

il_ps_2_0
dcldef_x(*)_y(*)_z(*)_w(*) r0
def c0, 0.6931471825, 1.442695022, 0.0, 0.0
dclpin_usage(color)_usageIndex(10)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn0
dclpin_usage(color)_usageIndex(26)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn1
mov r0, vPixIn0
add r1, r0, vPixIn1
log_zeroop(fltmax) r2.x___, r1.x_abs
log_zeroop(fltmax) r2._y__, r1.y_abs
log_zeroop(fltmax) r2.__z_, r1.z_abs
log_zeroop(fltmax) r2.___w, r1.w_abs
mul r1, r2, c0.x
mul r0, r0, vPixIn1
mul r0, r0, c0.y
exp r0.x___, r0.x
rcp_zeroop(infinity) r2.x___, r0.x
exp r0.x___, r0.y
rcp_zeroop(infinity) r2._y__, r0.x
exp r0.x___, r0.z
exp r0._y__, r0.w
rcp_zeroop(infinity) r2.___w, r0.y
rcp_zeroop(infinity) r2.__z_, r0.x
mul r0, r1, r2
dp4 r1.x___, r0, r0
rsq_zeroop(infinity) r1.x___, r1.x_abs
mul r37, r0, r1.x
colorclamp oC0, r37
end

A comparison of VLIW-5 and VLIW-4 would be fairly interesting.

Btw., it really looks like you were right.

I'm dead chuffed. Maybe I should retire, got something right for a change

Gipsel · Oct 19, 2010

Arty said:

Maybe I should say at least one. Unfortunately I cannot connect names to it right now.

3dilettante · Oct 19, 2010

What is the throughput of RCP when issued through the current T lane?

I'm trying to think of what sequence of operations can be done across three lanes.
A successive approximation of an RCP would take 5 iterations to yield a single-precision result if using plain math ops. Perhaps a more elaborate method is used?

Jawed · Oct 19, 2010

3dilettante said:
What is the throughput of RCP when issued through the current T lane?

1 cycle, for one scalar per cycle.

I'm trying to think of what sequence of operations can be done across three lanes.
A successive approximation of an RCP would take 5 iterations to yield a single-precision result if using plain math ops. Perhaps a more elaborate method is used?

My original proposal was to take the look-up tables and spread them amongst the four lanes. Then use the single-cycle dependent math capability of the 4 lanes, working together, to perform all the math (Lagrange polynomials).

http://forum.beyond3d.com/showthread.php?p=1417026#post1417026

Back then the question became about throughput. Also, it's worth bearing in mind that for graphics the error can be quite large (it's single precision for a start).

I didn't put any trig in the shader earlier, because that gets swamped by operand-normalisation (or at least it does some of the time - not sure, something I haven't studied).

Gipsel · Oct 19, 2010

Jawed said:
Ooh, thanks.

Here's something old that I've tweaked for extra transcendentals:
[..]
A comparison of VLIW-5 and VLIW-4 would be fairly interesting.

There was a slight problem with it, I didn't get it to compile with the "def c0" in it. The error message wasn't telling too much (unsupported opcode without saying wich one

), so I started to rip out everything (and it worked after exchanging c0 with a literal). What was left is that (edit: I just see there got a bit other stuff pasted in, but outside of the loop, so it shouldn't matter):

Code:

il_ps_2_0
dcl_literal l1, 0.6931471825, 1.442695022, 0.0, 0.0
dcl_literal l0, 0x10000, 0xffffffff, 2.1, 0.01
mov r13.x, l0.x ; loop counter
mov r1.xyzw,l0.zzzz ; arbitrary values, r1 = 2*2-2 = 2
mov r2.xyzw,l0.wwww ; r2 = 0*0+0 = 0
mov r0, l0
add r1, r0, l0.wzyx
whileloop
break_logicalz r13 ; while(r13 > 0)
log_zeroop(fltmax) r2.x___, r1.x_abs
log_zeroop(fltmax) r2._y__, r1.y_abs
log_zeroop(fltmax) r2.__z_, r1.z_abs
log_zeroop(fltmax) r2.___w, r1.w_abs
mul r1, r2, l1.x
mul r0, r0, l0.wzyx
mul r0, r0, l1.y
exp r0.x___, r0.x
rcp_zeroop(infinity) r2.x___, r0.x
exp r0.x___, r0.y
rcp_zeroop(infinity) r2._y__, r0.x
exp r0.x___, r0.z
exp r0._y__, r0.w
rcp_zeroop(infinity) r2.___w, r0.y
rcp_zeroop(infinity) r2.__z_, r0.x
mul r0, r1, r2
dp4 r1.x___, r0, r0
rsq_zeroop(infinity) r1.x___, r1.x_abs
mul r1, r0, r1.x
iadd r13.x, r13.x, l0.y ; counter--
endloop
mov g[0], r1
end

Still okay?
On Cypress the ISA looks like that:

Code:

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) 
      0  x: MOV         R0.x,  (0x00010000, 9.183549616e-41f).x      
         y: MOV         R0.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R0.z,  (0x40066666, 2.099999905f).z      
         w: MOV         R0.w,  (0x3C23D70A, 0.009999999776f).w      
         t: MOV         R1.x,  (0x3C23D70A, 0.009999999776f).w      
      1  x: MOV         R2.x,  (0x00010000, 9.183549616e-41f).x      
         y: MOV         R1.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R1.z,  (0xFFFFFFFF, -1.#QNANf).y      
         w: MOV         R1.w,  (0x3C23D70A, 0.009999999776f).z      
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
    02 ALU_BREAK: ADDR(45) CNT(1) 
          2  x: PREDNE_INT  ____,  R2.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
    03 ALU: ADDR(46) CNT(44) 
          3  x: MUL         ____,  R0.z,  (0xFFFFFFFF, -1.#QNANf).x      
             y: MUL         ____,  R0.y,  (0x40066666, 2.099999905f).y      
             z: MUL         ____,  R0.x,  (0x3C23D70A, 0.009999999776f).z      
             w: MUL         ____,  R0.w,  (0x00010000, 9.183549616e-41f).w      
             t: LOG_sat     T0.z,  |R1.x|      
          4  x: MUL         T0.x,  PV3.x,  (0x3FB8AA3B, 1.442695022f).x      
             y: MUL         T0.y,  PV3.y,  (0x3FB8AA3B, 1.442695022f).x      
             z: MUL         T1.z,  PV3.z,  (0x3FB8AA3B, 1.442695022f).x      
             w: MUL         T0.w,  PV3.w,  (0x3FB8AA3B, 1.442695022f).x      
             t: LOG_sat     ____,  |R1.y|      
          5  x: ADD_INT     R2.x,  -1,  R2.x      
             y: MUL         T1.y,  PS4,  (0x3F317218, 0.6931471825f).x      
             z: MUL         T0.z,  T0.z,  (0x3F317218, 0.6931471825f).x      
             t: LOG_sat     ____,  |R1.z|      
          6  x: MUL         T2.x,  PS5,  (0x3F317218, 0.6931471825f).x      
             t: LOG_sat     ____,  |R1.w|      
          7  w: MUL         T1.w,  PS6,  (0x3F317218, 0.6931471825f).x      
             t: EXP_e       T1.z,  T1.z      
          8  t: EXP_e       T1.x,  T0.y      
          9  t: EXP_e       T2.z,  T0.x      
         10  t: EXP_e       T0.y,  T0.w      
         11  t: RCP_e       ____,  T1.z      
         12  x: MUL         R0.x,  T0.z,  PS11      
             t: RCP_e       ____,  T1.x      
         13  y: MUL         R0.y,  T1.y,  PS12      
             t: RCP_e       T1.x,  T0.y      
         14  t: RCP_e       ____,  T2.z      
         15  z: MUL         R0.z,  T2.x,  PS14      
             w: MUL         R0.w,  T1.w,  T1.x      
         16  x: DOT4        ____,  R0.x,  R0.x      
             y: DOT4        ____,  R0.y,  R0.y      
             z: DOT4        ____,  PV15.z,  PV15.z      
             w: DOT4        ____,  PV15.w,  PV15.w      
         17  t: RSQ_e       ____,  |PV16.x|      
         18  x: MUL         R1.x,  R0.x,  PS17      
             y: MUL         R1.y,  R0.y,  PS17      
             z: MUL         R1.z,  R0.z,  PS17      
             w: MUL         R1.w,  R0.w,  PS17      
04 ENDLOOP i0 PASS_JUMP_ADDR(2) 
05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3) 
06 ALU: ADDR(90) CNT(4) 
     19  x: MOV         R0.x,  0.0f      
         y: MOV         R0.y,  0.0f      
         z: MOV         R0.z,  0.0f      
         w: MOV         R0.w,  0.0f      
07 EXP_DONE: PIX0, R0
END_OF_PROGRAM

On that future 4 slot VLIW architecture like that:

Code:

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) 
      0  x: MOV         R0.x,  (0x00010000, 9.183549616e-41f).x      
         y: MOV         R0.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R0.z,  (0x40066666, 2.099999905f).z      
         w: MOV         R0.w,  (0x3C23D70A, 0.009999999776f).w      
      1  x: MOV         R1.x,  (0x3C23D70A, 0.009999999776f).x      
         y: MOV         R1.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R1.z,  (0xFFFFFFFF, -1.#QNANf).y      
         w: MOV         R1.w,  (0x3C23D70A, 0.009999999776f).x      
      2  x: MOV         R2.x,  (0x00010000, 9.183549616e-41f).x      
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
    02 ALU: ADDR(45) CNT(1) 
          3  x: PREDNE_INT  ____,  R2.x,  0.0f      UPDATE_EXEC_MASK BREAK UPDATE_PRED 
    03 ALU: ADDR(46) CNT(71) 
          4  x: MUL         ____,  R0.z,  (0xFFFFFFFF, -1.#QNANf).x      
             y: MUL         ____,  R0.y,  (0x40066666, 2.099999905f).y      
             z: MUL         ____,  R0.x,  (0x3C23D70A, 0.009999999776f).z      
             w: MUL         ____,  R0.w,  (0x00010000, 9.183549616e-41f).w      
          5  x: MUL         T0.x,  PV4.x,  (0x3FB8AA3B, 1.442695022f).x      
             y: MUL         T0.y,  PV4.w,  (0x3FB8AA3B, 1.442695022f).x      
             z: MUL         T0.z,  PV4.z,  (0x3FB8AA3B, 1.442695022f).x      
             w: MUL         T0.w,  PV4.y,  (0x3FB8AA3B, 1.442695022f).x      
          6  x: LOG_sat     ____,  |R1.x|      
             y: LOG_sat     ____,  |R1.x|      
             z: LOG_sat     ____,  |R1.x|      
          7  x: LOG_sat     ____,  |R1.y|      
             y: LOG_sat     ____,  |R1.y|      
             z: LOG_sat     ____,  |R1.y|      
             w: MUL         T1.w,  PV6.x,  (0x3F317218, 0.6931471825f).x      
          8  x: LOG_sat     ____,  |R1.z|      
             y: LOG_sat     ____,  |R1.z|      
             z: LOG_sat     ____,  |R1.z|      
             w: MUL         T2.w,  PV7.z,  (0x3F317218, 0.6931471825f).x      
          9  x: LOG_sat     ____,  |R1.w|      
             y: LOG_sat     ____,  |R1.w|      
             z: LOG_sat     ____,  |R1.w|      
             w: MUL         T3.w,  PV8.y,  (0x3F317218, 0.6931471825f).x      
         10  x: EXP_e       ____,  T0.z      
             y: EXP_e       T1.y,  T0.z      
             z: EXP_e       ____,  T0.z      
             w: MUL         R0.w,  PV9.x,  (0x3F317218, 0.6931471825f).x      
         11  x: EXP_e       ____,  T0.w      
             y: EXP_e       ____,  T0.w      
             z: EXP_e       T0.z,  T0.w      
         12  x: EXP_e       T0.x,  T0.x      
             y: EXP_e       ____,  T0.x      
             z: EXP_e       ____,  T0.x      
         13  x: EXP_e       ____,  T0.y      
             y: EXP_e       ____,  T0.y      
             z: EXP_e       T1.z,  T0.y      
         14  x: RCP_e       T1.x,  T1.y      
             y: RCP_e       ____,  T1.y      
             z: RCP_e       ____,  T1.y      
         15  x: RCP_e       ____,  T0.z      
             y: RCP_e       T1.y,  T0.z      
             z: RCP_e       ____,  T0.z      
         16  x: RCP_e       ____,  T1.z      
             y: RCP_e       T0.y,  T1.z      
             z: RCP_e       ____,  T1.z      
         17  x: RCP_e       ____,  T0.x      
             y: RCP_e       ____,  T0.x      
             z: RCP_e       ____,  T0.x      
         18  x: MUL         R0.x,  T1.w,  T1.x      
             y: MUL         R0.y,  T2.w,  T1.y      VEC_120 
             z: MUL         R0.z,  T3.w,  PV17.x      VEC_201 
         19  x: ADD_INT     R2.x,  -1,  R2.x      
             w: MUL         R0.w,  R0.w,  T0.y      
         20  x: DOT4        ____,  R0.x,  R0.x      
             y: DOT4        ____,  R0.y,  R0.y      
             z: DOT4        ____,  R0.z,  R0.z      
             w: DOT4        ____,  PV19.w,  PV19.w      
         21  x: RSQ_e       ____,  |PV20.x|      
             y: RSQ_e       ____,  |PV20.x|      
             z: RSQ_e       ____,  |PV20.x|      
         22  x: MUL         R1.x,  R0.x,  PV21.y      
             y: MUL         R1.y,  R0.y,  PV21.y      
             z: MUL         R1.z,  R0.z,  PV21.y      
             w: MUL         R1.w,  R0.w,  PV21.y      
04 ENDLOOP i0 PASS_JUMP_ADDR(2) 
05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3)  VPM 
06 ALU: ADDR(117) CNT(4) 
     23  x: MOV         R0.x,  0.0f      
         y: MOV         R0.y,  0.0f      
         z: MOV         R0.z,  0.0f      
         w: MOV         R0.w,  0.0f      
07 EXP_DONE: PIX0, R0
08 END 
END_OF_PROGRAM

So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.

DavidGraham · Oct 19, 2010

DarthShader said:
Specualtion:

Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.

Still , that isn't consistent with wavefront size , which would be 80 too .

Gipsel said:
Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight

a RCP will use 3 slots, the same as a SQRT or a RSQ:

Code:

x: RCP_sat R2.x, R1.y y: RCP_sat ____, R1.y z: RCP_sat ____, R1.y

a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)

Code:

3 x: MUL ____, R1.x, (0x3E22F983, 0.1591549367f).x 4 x: SIN R2.x, PV3.x y: SIN ____, PV3.x z: SIN ____, PV3.x

a 32bit integer MUL takes the full 4 slots:

Code:

x: MULLO_INT R2.x, R1.x, R1.x y: MULLO_INT ____, R1.x, R1.x z: MULLO_INT ____, R1.x, R1.x w: MULLO_INT ____, R1.x, R1.x

a 24bit integer MUL takes a single slot as expected

Code:

3 x: MUL_INT24__NI R2.x, R1.x, R1.x

Excellent work Mr.Gpisel , thx for the insight !

a 32bit integer MUL takes the full 4 slots:

Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! (possibly utilizing the mantissa portion) ?

mczak · Oct 19, 2010

Gipsel said:
So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.

So all transcendentals are using x,y,z? No special tricks w lane can do?
Who handles the float to int conversion stuff?

AMD: R9xx Speculation

mczak

DarthShader

jimbo75

Gipsel

Jawed

Gipsel

Kaotik

Drunk Member

DarthShader

Rebel44

Gipsel

mczak

Arty

KEPLER

DarthShader

Jawed

Gipsel

3dilettante

Jawed

Gipsel

DavidGraham

mczak

Similar threads