AMD: R9xx Speculation

Discussion in 'Architecture and Products' started by Lukfi, Oct 5, 2009.

  1. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,020
    Likes Received:
    115
    And things like internal l2-l1 bandwidth (if the bus stays the same that is).
     
  2. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    nApoleon says 1120sp are official for XT: http://www.chiphell.com/thread-130363-1-1.html

    While pclab.pl post more pics of XFX cards, saying the 6870 has 1280 shaders: http://pclab.pl/news43574.html . Funnily they are unsure about 6850 final clocks, either 725 or 775 mhz.

    Specualtion:

    Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.
     
  3. jimbo75

    Veteran

    Joined:
    Jan 17, 2010
    Messages:
    1,211
    Likes Received:
    0
    Honestly AMD must be pissing themselves with laughter.
     
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight :lol:

    a RCP will use 3 slots, the same as a SQRT or a RSQ:
    Code:
    x: RCP_sat     R2.x,  R1.y      
    y: RCP_sat     ____,  R1.y      
    z: RCP_sat     ____,  R1.y
    a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)
    Code:
              3  x: MUL         ____,  R1.x,  (0x3E22F983, 0.1591549367f).x
              4  x: SIN         R2.x,  PV3.x      
                 y: SIN         ____,  PV3.x      
                 z: SIN         ____,  PV3.x
    a 32bit integer MUL takes the full 4 slots:
    Code:
    x: MULLO_INT   R2.x,  R1.x,  R1.x      
    y: MULLO_INT   ____,  R1.x,  R1.x      
    z: MULLO_INT   ____,  R1.x,  R1.x      
    w: MULLO_INT   ____,  R1.x,  R1.x
    a 24bit integer MUL takes a single slot as expected
    Code:
              3  x: MUL_INT24__NI  R2.x,  R1.x,  R1.x
     
    #3404 Gipsel, Oct 18, 2010
    Last edited by a moderator: Oct 19, 2010
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    You sneaky chappy! Hmm, very interesting.
     
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Any wishes? You can send me IL code and I will show you how it will look like in ISA.

    Btw., it really looks like you were right.
     
  7. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,792
    Likes Received:
    3,959
    Location:
    Finland
    The drivers still state clearly that Antilles = Cayman based, not Barts.

    For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?
     
  8. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    That's what I am wondering too. :)

    But I think I got it: RCP = reciprocal. And if it takes three slots, starting from x, then it means the shaders have been indeed buffed, but not in a why I'd like them too. Still, could be pretty efficient, if there's a 4D-vliw structure.
     
  9. Rebel44

    Newcomer

    Joined:
    Feb 7, 2007
    Messages:
    65
    Likes Received:
    0
    As long as product isnt really locked and it already has bar code, its easier to bribe some random emploee with few $ and he will bring it to you - at least that is my experience from retail.
     
  10. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.

    And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.
     
  11. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,020
    Likes Received:
    115
    So if that's vliw-4 how many simds?
    If even RCP is 3 taking 3 slots are the other transcendentals taking all 4 (given that RCP should be the cheapest one)?
     
  12. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    :sad:
     
  13. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    How is this different to a the situation, where the t-unit used other lanes for help? I can only see benefit, if a transcendental takes up only two slots, so two in parallel could be perfromed.

    Double precision for Barts enabled? Would be big news for me! :shock: Or do you mean "ne member of a future GPU line from AMD". ;)
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    Ooh, thanks.

    Here's something old that I've tweaked for extra transcendentals:
    Code:
    il_ps_2_0
    dcldef_x(*)_y(*)_z(*)_w(*) r0
    def c0, 0.6931471825, 1.442695022, 0.0, 0.0
    dclpin_usage(color)_usageIndex(10)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn0
    dclpin_usage(color)_usageIndex(26)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn1
    mov r0, vPixIn0
    add r1, r0, vPixIn1
    log_zeroop(fltmax) r2.x___, r1.x_abs
    log_zeroop(fltmax) r2._y__, r1.y_abs
    log_zeroop(fltmax) r2.__z_, r1.z_abs
    log_zeroop(fltmax) r2.___w, r1.w_abs
    mul r1, r2, c0.x
    mul r0, r0, vPixIn1
    mul r0, r0, c0.y
    exp r0.x___, r0.x
    rcp_zeroop(infinity) r2.x___, r0.x
    exp r0.x___, r0.y
    rcp_zeroop(infinity) r2._y__, r0.x
    exp r0.x___, r0.z
    exp r0._y__, r0.w
    rcp_zeroop(infinity) r2.___w, r0.y
    rcp_zeroop(infinity) r2.__z_, r0.x
    mul r0, r1, r2
    dp4 r1.x___, r0, r0
    rsq_zeroop(infinity) r1.x___, r1.x_abs
    mul r37, r0, r1.x
    colorclamp oC0, r37
    end
    A comparison of VLIW-5 and VLIW-4 would be fairly interesting.

    I'm dead chuffed. Maybe I should retire, got something right for a change :razz:
     
  15. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Maybe I should say at least one. Unfortunately I cannot connect names to it right now.
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,553
    Likes Received:
    4,719
    Location:
    Well within 3d
    What is the throughput of RCP when issued through the current T lane?

    I'm trying to think of what sequence of operations can be done across three lanes.
    A successive approximation of an RCP would take 5 iterations to yield a single-precision result if using plain math ops. Perhaps a more elaborate method is used?
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    1 cycle, for one scalar per cycle.

    My original proposal was to take the look-up tables and spread them amongst the four lanes. Then use the single-cycle dependent math capability of the 4 lanes, working together, to perform all the math (Lagrange polynomials).

    http://forum.beyond3d.com/showthread.php?p=1417026#post1417026

    Back then the question became about throughput. Also, it's worth bearing in mind that for graphics the error can be quite large (it's single precision for a start).

    I didn't put any trig in the shader earlier, because that gets swamped by operand-normalisation (or at least it does some of the time - not sure, something I haven't studied).
     
  18. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    There was a slight problem with it, I didn't get it to compile with the "def c0" in it. The error message wasn't telling too much (unsupported opcode without saying wich one :roll:), so I started to rip out everything (and it worked after exchanging c0 with a literal). What was left is that (edit: I just see there got a bit other stuff pasted in, but outside of the loop, so it shouldn't matter):
    Code:
    il_ps_2_0
    dcl_literal l1, 0.6931471825, 1.442695022, 0.0, 0.0
    dcl_literal l0, 0x10000, 0xffffffff, 2.1, 0.01
    mov r13.x, l0.x ; loop counter
    mov r1.xyzw,l0.zzzz ; arbitrary values, r1 = 2*2-2 = 2
    mov r2.xyzw,l0.wwww ; r2 = 0*0+0 = 0
    mov r0, l0
    add r1, r0, l0.wzyx
    whileloop
    break_logicalz r13 ; while(r13 > 0)
    log_zeroop(fltmax) r2.x___, r1.x_abs
    log_zeroop(fltmax) r2._y__, r1.y_abs
    log_zeroop(fltmax) r2.__z_, r1.z_abs
    log_zeroop(fltmax) r2.___w, r1.w_abs
    mul r1, r2, l1.x
    mul r0, r0, l0.wzyx
    mul r0, r0, l1.y
    exp r0.x___, r0.x
    rcp_zeroop(infinity) r2.x___, r0.x
    exp r0.x___, r0.y
    rcp_zeroop(infinity) r2._y__, r0.x
    exp r0.x___, r0.z
    exp r0._y__, r0.w
    rcp_zeroop(infinity) r2.___w, r0.y
    rcp_zeroop(infinity) r2.__z_, r0.x
    mul r0, r1, r2
    dp4 r1.x___, r0, r0
    rsq_zeroop(infinity) r1.x___, r1.x_abs
    mul r1, r0, r1.x
    iadd r13.x, r13.x, l0.y ; counter--
    endloop
    mov g[0], r1
    end
    Still okay?
    On Cypress the ISA looks like that:
    Code:
    ; --------  Disassembly --------------------
    00 ALU: ADDR(32) CNT(13) 
          0  x: MOV         R0.x,  (0x00010000, 9.183549616e-41f).x      
             y: MOV         R0.y,  (0xFFFFFFFF, -1.#QNANf).y      
             z: MOV         R0.z,  (0x40066666, 2.099999905f).z      
             w: MOV         R0.w,  (0x3C23D70A, 0.009999999776f).w      
             t: MOV         R1.x,  (0x3C23D70A, 0.009999999776f).w      
          1  x: MOV         R2.x,  (0x00010000, 9.183549616e-41f).x      
             y: MOV         R1.y,  (0xFFFFFFFF, -1.#QNANf).y      
             z: MOV         R1.z,  (0xFFFFFFFF, -1.#QNANf).y      
             w: MOV         R1.w,  (0x3C23D70A, 0.009999999776f).z      
    01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
        02 ALU_BREAK: ADDR(45) CNT(1) 
              2  x: PREDNE_INT  ____,  R2.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        03 ALU: ADDR(46) CNT(44) 
              3  x: MUL         ____,  R0.z,  (0xFFFFFFFF, -1.#QNANf).x      
                 y: MUL         ____,  R0.y,  (0x40066666, 2.099999905f).y      
                 z: MUL         ____,  R0.x,  (0x3C23D70A, 0.009999999776f).z      
                 w: MUL         ____,  R0.w,  (0x00010000, 9.183549616e-41f).w      
                 t: LOG_sat     T0.z,  |R1.x|      
              4  x: MUL         T0.x,  PV3.x,  (0x3FB8AA3B, 1.442695022f).x      
                 y: MUL         T0.y,  PV3.y,  (0x3FB8AA3B, 1.442695022f).x      
                 z: MUL         T1.z,  PV3.z,  (0x3FB8AA3B, 1.442695022f).x      
                 w: MUL         T0.w,  PV3.w,  (0x3FB8AA3B, 1.442695022f).x      
                 t: LOG_sat     ____,  |R1.y|      
              5  x: ADD_INT     R2.x,  -1,  R2.x      
                 y: MUL         T1.y,  PS4,  (0x3F317218, 0.6931471825f).x      
                 z: MUL         T0.z,  T0.z,  (0x3F317218, 0.6931471825f).x      
                 t: LOG_sat     ____,  |R1.z|      
              6  x: MUL         T2.x,  PS5,  (0x3F317218, 0.6931471825f).x      
                 t: LOG_sat     ____,  |R1.w|      
              7  w: MUL         T1.w,  PS6,  (0x3F317218, 0.6931471825f).x      
                 t: EXP_e       T1.z,  T1.z      
              8  t: EXP_e       T1.x,  T0.y      
              9  t: EXP_e       T2.z,  T0.x      
             10  t: EXP_e       T0.y,  T0.w      
             11  t: RCP_e       ____,  T1.z      
             12  x: MUL         R0.x,  T0.z,  PS11      
                 t: RCP_e       ____,  T1.x      
             13  y: MUL         R0.y,  T1.y,  PS12      
                 t: RCP_e       T1.x,  T0.y      
             14  t: RCP_e       ____,  T2.z      
             15  z: MUL         R0.z,  T2.x,  PS14      
                 w: MUL         R0.w,  T1.w,  T1.x      
             16  x: DOT4        ____,  R0.x,  R0.x      
                 y: DOT4        ____,  R0.y,  R0.y      
                 z: DOT4        ____,  PV15.z,  PV15.z      
                 w: DOT4        ____,  PV15.w,  PV15.w      
             17  t: RSQ_e       ____,  |PV16.x|      
             18  x: MUL         R1.x,  R0.x,  PS17      
                 y: MUL         R1.y,  R0.y,  PS17      
                 z: MUL         R1.z,  R0.z,  PS17      
                 w: MUL         R1.w,  R0.w,  PS17      
    04 ENDLOOP i0 PASS_JUMP_ADDR(2) 
    05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3) 
    06 ALU: ADDR(90) CNT(4) 
         19  x: MOV         R0.x,  0.0f      
             y: MOV         R0.y,  0.0f      
             z: MOV         R0.z,  0.0f      
             w: MOV         R0.w,  0.0f      
    07 EXP_DONE: PIX0, R0
    END_OF_PROGRAM
    On that future 4 slot VLIW architecture like that:
    Code:
    ; --------  Disassembly --------------------
    00 ALU: ADDR(32) CNT(13) 
          0  x: MOV         R0.x,  (0x00010000, 9.183549616e-41f).x      
             y: MOV         R0.y,  (0xFFFFFFFF, -1.#QNANf).y      
             z: MOV         R0.z,  (0x40066666, 2.099999905f).z      
             w: MOV         R0.w,  (0x3C23D70A, 0.009999999776f).w      
          1  x: MOV         R1.x,  (0x3C23D70A, 0.009999999776f).x      
             y: MOV         R1.y,  (0xFFFFFFFF, -1.#QNANf).y      
             z: MOV         R1.z,  (0xFFFFFFFF, -1.#QNANf).y      
             w: MOV         R1.w,  (0x3C23D70A, 0.009999999776f).x      
          2  x: MOV         R2.x,  (0x00010000, 9.183549616e-41f).x      
    01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
        02 ALU: ADDR(45) CNT(1) 
              3  x: PREDNE_INT  ____,  R2.x,  0.0f      UPDATE_EXEC_MASK BREAK UPDATE_PRED 
        03 ALU: ADDR(46) CNT(71) 
              4  x: MUL         ____,  R0.z,  (0xFFFFFFFF, -1.#QNANf).x      
                 y: MUL         ____,  R0.y,  (0x40066666, 2.099999905f).y      
                 z: MUL         ____,  R0.x,  (0x3C23D70A, 0.009999999776f).z      
                 w: MUL         ____,  R0.w,  (0x00010000, 9.183549616e-41f).w      
              5  x: MUL         T0.x,  PV4.x,  (0x3FB8AA3B, 1.442695022f).x      
                 y: MUL         T0.y,  PV4.w,  (0x3FB8AA3B, 1.442695022f).x      
                 z: MUL         T0.z,  PV4.z,  (0x3FB8AA3B, 1.442695022f).x      
                 w: MUL         T0.w,  PV4.y,  (0x3FB8AA3B, 1.442695022f).x      
              6  x: LOG_sat     ____,  |R1.x|      
                 y: LOG_sat     ____,  |R1.x|      
                 z: LOG_sat     ____,  |R1.x|      
              7  x: LOG_sat     ____,  |R1.y|      
                 y: LOG_sat     ____,  |R1.y|      
                 z: LOG_sat     ____,  |R1.y|      
                 w: MUL         T1.w,  PV6.x,  (0x3F317218, 0.6931471825f).x      
              8  x: LOG_sat     ____,  |R1.z|      
                 y: LOG_sat     ____,  |R1.z|      
                 z: LOG_sat     ____,  |R1.z|      
                 w: MUL         T2.w,  PV7.z,  (0x3F317218, 0.6931471825f).x      
              9  x: LOG_sat     ____,  |R1.w|      
                 y: LOG_sat     ____,  |R1.w|      
                 z: LOG_sat     ____,  |R1.w|      
                 w: MUL         T3.w,  PV8.y,  (0x3F317218, 0.6931471825f).x      
             10  x: EXP_e       ____,  T0.z      
                 y: EXP_e       T1.y,  T0.z      
                 z: EXP_e       ____,  T0.z      
                 w: MUL         R0.w,  PV9.x,  (0x3F317218, 0.6931471825f).x      
             11  x: EXP_e       ____,  T0.w      
                 y: EXP_e       ____,  T0.w      
                 z: EXP_e       T0.z,  T0.w      
             12  x: EXP_e       T0.x,  T0.x      
                 y: EXP_e       ____,  T0.x      
                 z: EXP_e       ____,  T0.x      
             13  x: EXP_e       ____,  T0.y      
                 y: EXP_e       ____,  T0.y      
                 z: EXP_e       T1.z,  T0.y      
             14  x: RCP_e       T1.x,  T1.y      
                 y: RCP_e       ____,  T1.y      
                 z: RCP_e       ____,  T1.y      
             15  x: RCP_e       ____,  T0.z      
                 y: RCP_e       T1.y,  T0.z      
                 z: RCP_e       ____,  T0.z      
             16  x: RCP_e       ____,  T1.z      
                 y: RCP_e       T0.y,  T1.z      
                 z: RCP_e       ____,  T1.z      
             17  x: RCP_e       ____,  T0.x      
                 y: RCP_e       ____,  T0.x      
                 z: RCP_e       ____,  T0.x      
             18  x: MUL         R0.x,  T1.w,  T1.x      
                 y: MUL         R0.y,  T2.w,  T1.y      VEC_120 
                 z: MUL         R0.z,  T3.w,  PV17.x      VEC_201 
             19  x: ADD_INT     R2.x,  -1,  R2.x      
                 w: MUL         R0.w,  R0.w,  T0.y      
             20  x: DOT4        ____,  R0.x,  R0.x      
                 y: DOT4        ____,  R0.y,  R0.y      
                 z: DOT4        ____,  R0.z,  R0.z      
                 w: DOT4        ____,  PV19.w,  PV19.w      
             21  x: RSQ_e       ____,  |PV20.x|      
                 y: RSQ_e       ____,  |PV20.x|      
                 z: RSQ_e       ____,  |PV20.x|      
             22  x: MUL         R1.x,  R0.x,  PV21.y      
                 y: MUL         R1.y,  R0.y,  PV21.y      
                 z: MUL         R1.z,  R0.z,  PV21.y      
                 w: MUL         R1.w,  R0.w,  PV21.y      
    04 ENDLOOP i0 PASS_JUMP_ADDR(2) 
    05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3)  VPM 
    06 ALU: ADDR(117) CNT(4) 
         23  x: MOV         R0.x,  0.0f      
             y: MOV         R0.y,  0.0f      
             z: MOV         R0.z,  0.0f      
             w: MOV         R0.w,  0.0f      
    07 EXP_DONE: PIX0, R0
    08 END 
    END_OF_PROGRAM
    So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.
     
    #3418 Gipsel, Oct 19, 2010
    Last edited by a moderator: Oct 19, 2010
  19. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,511
    Likes Received:
    4,129
    Still , that isn't consistent with wavefront size , which would be 80 too .

    Excellent work Mr.Gpisel , thx for the insight !:grin:

    Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! (possibly utilizing the mantissa portion) ?
     
  20. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,020
    Likes Received:
    115
    So all transcendentals are using x,y,z? No special tricks w lane can do?
    Who handles the float to int conversion stuff?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...