Faster dense matrix-matrix products on ATi hardware

Discussion in 'GPGPU Technology & Programming' started by prunedtree, Aug 18, 2009.

  1. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    This is incorrect. You may be confused by how scheduling is prioritized - namely, "common" instructions will first be assigned to the "vector" ALUs (x,y,z,w) and only if those are occupied will they be assigned to the transcendental unit as well. Of course, transcendental ops (or stuff like INT MUL/DIV, for example) get scheduled to the trans ALU implicitly. There are also some GPR read port restrictions in place, which end up not always allowing an instruction to be scheduled there. But it does MADs just fine, and quite often, really.
     
  2. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Average utilization doesn't really indicate how often the t unit is being used. If you have a bunch of very scalar code, utilization may go down, but you may find the rest of the code is fully utilizing all slots.
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Which instruction is a CNDE btw?
     
  4. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    I believe it checks if a number is equal to 0. If so, chooses one of the operands, if not, chooses the other. I don't have the specs in front of me but Jawed posted a link to the instruction set specs recently.

    Edit: Did you mean CNDGE? I believe that checks if a number is greater than or equal to 0 with similar behavior to what I posted above.

    Edit again: Compare to the cmp instruction in the Direct3D instruction specs.
     
  5. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    Well, given that double precision multiply-adds are four (five if you count the `t' unit) times slower but only require twice the bandwidth, it's much easier to achieve high ALU utilization. ATi's implementation is almost optimal, over 200 Gflop/s (out of 240 Gflop/s peak).

    No, I didn't measure more than ~444 GB/s even with all threads fetching the same value(s) over and over. Running ATi's various synthetic tests (among the samples in the SDK) gives similar results. As texture fetches are the bottleneck, it's actually impressive that the hardware manages to loose only 1% of efficiency with a more complex access pattern.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    Code:
    t: CNDE_INT    R6.w,  R3.x,  R4.z,  PV35.z
    Table 4.4 in the ISA Guide says that it's a conditional move based on the first operand being equal to 0.0 (for the floating point version - very loose description for the integer version), so it looks like it chooses between operand 2 and operand 3 to put into the resultant.

    Jawed
     
  7. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    The 8x8 block kernel in the original post uses only 26 float4 registers. There's clearly plenty of margin, so how much further can we go ? Well, it's possible to fit 8x10 blocks using the integrality of the register file. This is 11% faster in theory. It achieves 980 Gflop/s in practice, over 4 multiply-adds per cycle on average.

    However, the clarity of the code suffers a little ^^;;

    Code:
    00 ALU: ADDR(64) CNT(87) 
          0  x: MOV         R13.x,  0.0f      
             y: MOV         R13.y,  0.0f      
             z: AND_INT     T0.z,  R0.x,  (0x0000003F, 8.828180325e-44f).x      
             w: LSHR        T0.w,  R0.x,  (0x00000006, 8.407790786e-45f).y      
             t: MOV         R13.z,  0.0f      
          1  x: MOV         R11.x,  0.0f      
             y: MOV         R11.y,  0.0f      
             z: MOV         R11.z,  0.0f      
             w: MOV         R13.w,  0.0f      
             t: MOV         R11.w,  0.0f      
          2  x: MOV         R10.x,  0.0f      
             y: MOV         R10.y,  0.0f      
             z: MOV         R10.z,  0.0f      
             w: MOV         R10.w,  0.0f      
             t: MOV         R9.x,  0.0f      
          3  x: MOV         R8.x,  0.0f      
             y: MOV         R9.y,  0.0f      
             z: MOV         R9.z,  0.0f      
             w: MOV         R9.w,  0.0f      
             t: MOV         R22.z,  0.0f      
          4  x: MOV         R7.x,  0.0f      
             y: MOV         R8.y,  0.0f      
             z: MOV         R8.z,  0.0f      
             w: MOV         R22.w,  0.0f      
             t: MOV         R8.w,  0.0f      
          5  x: MOV         R29.x,  0.0f      
             y: MOV         R7.y,  0.0f      
             z: MOV         R7.z,  0.0f      
             w: MOV         R7.w,  0.0f      
             t: MOV         R29.y,  0.0f      
          6  x: MOV         R6.x,  0.0f      
             y: MOV         R6.y,  0.0f      
             z: MOV         R29.z,  0.0f      
             w: MOV         R29.w,  0.0f      
             t: MOV         R6.z,  0.0f      
          7  x: MOV         R21.x,  0.0f      
             y: MOV         R21.y,  0.0f      
             z: MOV         R21.z,  0.0f      
             w: MOV         R6.w,  0.0f      
             t: MOV         R21.w,  0.0f      
          8  x: MOV         R28.x,  0.0f      
             y: MOV         R28.y,  0.0f      
             z: MOV         R28.z,  0.0f      
             w: MOV         R28.w,  0.0f      
             t: MOV         R20.x,  0.0f      
          9  x: MOV         R5.x,  0.0f      
             y: MOV         R20.y,  0.0f      
             z: MOV         R20.z,  0.0f      
             w: MOV         R20.w,  0.0f      
             t: MOV         R5.y,  0.0f      
         10  x: MOV         R19.x,  0.0f      
             y: MOV         R19.y,  0.0f      
             z: MOV         R5.z,  0.0f      
             w: MOV         R5.w,  0.0f      
             t: MOV         R19.z,  0.0f      
         11  x: MOV         R4.x,  0.0f      
             y: MOV         R4.y,  0.0f      
             z: MOV         R4.z,  0.0f      
             w: MOV         R19.w,  0.0f      
             t: MOV         R4.w,  0.0f      
         12  x: MOV         R18.x,  0.0f      
             y: MOV         R18.y,  0.0f      
             z: MOV         R18.z,  0.0f      
             w: MOV         R18.w,  0.0f      
             t: MOV         R17.x,  0.0f      
         13  x: MOV         R16.x,  0.0f      
             y: MOV         R17.y,  0.0f      
             z: MOV         R17.z,  0.0f      
             w: MOV         R17.w,  0.0f      
             t: MOV         R16.y,  0.0f      
         14  x: MOV         R15.x,  0.0f      
             y: MOV         R15.y,  0.0f      
             z: MOV         R16.z,  0.0f      
             w: MOV         R16.w,  0.0f      
             t: MOV         R15.z,  0.0f      
         15  x: MOV         R14.x,  0.0f      
             y: MOV         R14.y,  0.0f      
             z: MOV         R14.z,  0.0f      
             w: MOV         R15.w,  0.0f      
             t: MOV         R14.w,  0.0f      
         16  x: MOV         R12.x,  0.0f      
             y: MOV         R12.y,  0.0f      
             z: MOV         R12.z,  0.0f      
             w: MOV         R12.w,  0.0f      
             t: I_TO_F      R0.x,  T0.z      
         17  t: I_TO_F      R0.y,  T0.w      
    01 TEX: ADDR(880) CNT(1) 
         18  SAMPLE R22.xy__, R0.xyxx, t8, s8  UNNORM(XYZW) 
    02 LOOP_DX10 i0 FAIL_JUMP_ADDR(33) 
        03 ALU_BREAK: ADDR(151) CNT(1) KCACHE0(CB0:0-15) 
             19  x: PREDGT      ____,  KC0[0].x,  R22.z      UPDATE_EXEC_MASK UPDATE_PRED 
        04 ALU: ADDR(152) CNT(2) 
             20  z: ADD         R22.z,  R22.w,  1.0f      
                 w: ADD         R22.w,  R22.w,  1.0f      
        05 TEX: ADDR(882) CNT(8) 
             21  SAMPLE R1, R22.xwxx, t0, s0  UNNORM(XYZW) 
             22  SAMPLE R23, R22.xwxx, t1, s1  UNNORM(XYZW) 
             23  SAMPLE R24, R22.xwxx, t2, s2  UNNORM(XYZW) 
             24  SAMPLE R25, R22.xwxx, t3, s3  UNNORM(XYZW) 
             25  SAMPLE R0, R22.yzyy, t4, s4  UNNORM(XYZW) 
             26  SAMPLE R2, R22.yzyy, t5, s5  UNNORM(XYZW) 
             27  SAMPLE R26, R22.yzyy, t6, s6  UNNORM(XYZW) 
             28  SAMPLE R27, R22.yzyy, t7, s7  UNNORM(XYZW) 
        06 ALU_PUSH_BEFORE: ADDR(154) CNT(81) KCACHE0(CB0:0-15) 
             29  x: MULADD      R29.x,  R1.x,  R0.x,  R29.x      
                 y: MULADD      R29.y,  R1.x,  R0.y,  R29.y      
                 z: MULADD      R29.z,  R1.x,  R0.z,  R29.z      
                 w: MULADD      R29.w,  R1.x,  R0.w,  R29.w      
             30  x: MULADD      R21.x,  R1.x,  R2.x,  R21.x      
                 y: MULADD      R21.y,  R1.x,  R2.y,  R21.y      
                 z: MULADD      R21.z,  R1.x,  R2.z,  R21.z      
                 w: MULADD      R21.w,  R1.x,  R2.w,  R21.w      
             31  x: MULADD      R20.x,  R1.y,  R0.x,  R20.x      VEC_210 
                 y: MULADD      R20.y,  R1.y,  R0.y,  R20.y      VEC_201 
                 z: MULADD      R20.z,  R1.y,  R0.z,  R20.z      VEC_201 
                 w: MULADD      R20.w,  R1.y,  R0.w,  R20.w      VEC_201 
                 t: MULADD      R18.x,  R1.z,  R0.x,  R18.x      VEC_120 
             32  x: MULADD      R19.x,  R1.y,  R2.x,  R19.x      VEC_210 
                 y: MULADD      R19.y,  R1.y,  R2.y,  R19.y      VEC_201 
                 z: MULADD      R19.z,  R1.y,  R2.z,  R19.z      VEC_201 
                 w: MULADD      R19.w,  R1.y,  R2.w,  R19.w      VEC_201 
                 t: MULADD      R17.x,  R1.z,  R2.x,  R17.x      VEC_120 
             33  x: MULADD      R16.x,  R1.w,  R0.x,  R16.x      VEC_201 
                 y: MULADD      R18.y,  R1.z,  R0.y,  R18.y      VEC_210 
                 z: MULADD      R18.z,  R1.z,  R0.z,  R18.z      VEC_201 
                 w: MULADD      R18.w,  R1.z,  R0.w,  R18.w      VEC_201 
                 t: MULADD      R16.y,  R1.w,  R0.y,  R16.y      VEC_120 
             34  x: MULADD      R15.x,  R1.w,  R2.x,  R15.x      VEC_201 
                 y: MULADD      R17.y,  R1.z,  R2.y,  R17.y      VEC_210 
                 z: MULADD      R17.z,  R1.z,  R2.z,  R17.z      VEC_201 
                 w: MULADD      R17.w,  R1.z,  R2.w,  R17.w      VEC_201 
                 t: MULADD      R15.y,  R1.w,  R2.y,  R15.y      VEC_120 
             35  x: MULADD      R14.x,  R23.x,  R0.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R23.x,  R0.y,  R14.y      VEC_201 
                 z: MULADD      R16.z,  R1.w,  R0.z,  R16.z      
                 w: MULADD      R16.w,  R1.w,  R0.w,  R16.w      
                 t: MULADD      R14.z,  R23.x,  R0.z,  R14.z      
             36  x: MULADD      R12.x,  R23.x,  R2.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R23.x,  R2.y,  R12.y      VEC_201 
                 z: MULADD      R15.z,  R1.w,  R2.z,  R15.z      
                 w: MULADD      R15.w,  R1.w,  R2.w,  R15.w      
                 t: MULADD      R12.z,  R23.x,  R2.z,  R12.z      
             37  x: MULADD      R13.x,  R23.y,  R0.x,  R13.x      VEC_201 
                 y: MULADD      R13.y,  R23.y,  R0.y,  R13.y      VEC_201 
                 z: MULADD      R13.z,  R23.y,  R0.z,  R13.z      VEC_201 
                 w: MULADD      R14.w,  R23.x,  R0.w,  R14.w      VEC_210 
                 t: MULADD      R13.w,  R23.y,  R0.w,  R13.w      VEC_120 
             38  x: MULADD      R11.x,  R23.y,  R2.x,  R11.x      VEC_201 
                 y: MULADD      R11.y,  R23.y,  R2.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R23.y,  R2.z,  R11.z      VEC_201 
                 w: MULADD      R12.w,  R23.x,  R2.w,  R12.w      VEC_210 
                 t: MULADD      R11.w,  R23.y,  R2.w,  R11.w      VEC_120 
             39  x: MULADD      R10.x,  R23.z,  R0.x,  R10.x      VEC_210 
                 y: MULADD      R10.y,  R23.z,  R0.y,  R10.y      VEC_201 
                 z: MULADD      R10.z,  R23.z,  R0.z,  R10.z      VEC_201 
                 w: MULADD      R10.w,  R23.z,  R0.w,  R10.w      VEC_201 
                 t: MULADD      R8.x,  R23.w,  R0.x,  R8.x      VEC_120 
             40  x: MULADD      R9.x,  R23.z,  R2.x,  R9.x      VEC_210 
                 y: MULADD      R9.y,  R23.z,  R2.y,  R9.y      VEC_201 
                 z: MULADD      R9.z,  R23.z,  R2.z,  R9.z      VEC_201 
                 w: MULADD      R9.w,  R23.z,  R2.w,  R9.w      VEC_201 
                 t: MULADD      R7.x,  R23.w,  R2.x,  R7.x      VEC_120 
             41  x: MULADD      R6.x,  R24.x,  R0.x,  R6.x      VEC_201 
                 y: MULADD      R8.y,  R23.w,  R0.y,  R8.y      
                 z: MULADD      R8.z,  R23.w,  R0.z,  R8.z      
                 w: MULADD      R8.w,  R23.w,  R0.w,  R8.w      
                 t: MULADD      R6.y,  R24.x,  R0.y,  R6.y      
             42  x: MULADD      R3.x,  R24.x,  R2.x,  R28.x      VEC_201 
                 y: MULADD      R7.y,  R23.w,  R2.y,  R7.y      
                 z: MULADD      R7.z,  R23.w,  R2.z,  R7.z      
                 w: MULADD      R7.w,  R23.w,  R2.w,  R7.w      
                 t: MULADD      R3.y,  R24.x,  R2.y,  R28.y      
             43  x: MULADD      R5.x,  R24.y,  R0.x,  R5.x      VEC_201 
                 y: MULADD      R5.y,  R24.y,  R0.y,  R5.y      VEC_201 
                 z: MULADD      R6.z,  R24.x,  R0.z,  R6.z      VEC_210 
                 w: MULADD      R6.w,  R24.x,  R0.w,  R6.w      VEC_201 
                 t: MULADD      R5.z,  R24.y,  R0.z,  R5.z      VEC_120 
             44  x: MULADD      R4.x,  R24.y,  R2.x,  R4.x      VEC_201 
                 y: MULADD      R4.y,  R24.y,  R2.y,  R4.y      VEC_201 
                 z: MULADD      R3.z,  R24.x,  R2.z,  R28.z      VEC_210 
                 w: MULADD      R5.w,  R24.y,  R0.w,  R5.w      VEC_201 
                 t: MULADD      R4.z,  R24.y,  R2.z,  R4.z      VEC_120 
             45  w: MULADD      R3.w,  R24.x,  R2.w,  R28.w      
                 t: MULADD      R4.w,  R24.y,  R2.w,  R4.w      
             46  x: PREDE_INT   ____,  KC0[1].y,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        07 JUMP  POP_CNT(1) ADDR(9) 
        08 ALU_POP_AFTER: ADDR(235) CNT(48) 
             47  x: MULADD      R29.x,  R24.z,  R26.x,  R29.x      VEC_210 
                 y: MULADD      R29.y,  R24.z,  R26.y,  R29.y      VEC_201 
                 z: MULADD      R29.z,  R24.z,  R26.z,  R29.z      VEC_201 
                 w: MULADD      R29.w,  R24.z,  R26.w,  R29.w      VEC_201 
                 t: MULADD      R20.x,  R24.w,  R26.x,  R20.x      VEC_120 
             48  x: MULADD      R21.x,  R24.z,  R27.x,  R21.x      VEC_210 
                 y: MULADD      R21.y,  R24.z,  R27.y,  R21.y      VEC_201 
                 z: MULADD      R21.z,  R24.z,  R27.z,  R21.z      VEC_201 
                 w: MULADD      R21.w,  R24.z,  R27.w,  R21.w      VEC_201 
                 t: MULADD      R19.x,  R24.w,  R27.x,  R19.x      VEC_120 
             49  x: MULADD      R18.x,  R25.x,  R26.x,  R18.x      VEC_201 
                 y: MULADD      R20.y,  R24.w,  R26.y,  R20.y      
                 z: MULADD      R20.z,  R24.w,  R26.z,  R20.z      
                 w: MULADD      R20.w,  R24.w,  R26.w,  R20.w      
                 t: MULADD      R18.y,  R25.x,  R26.y,  R18.y      
             50  x: MULADD      R17.x,  R25.x,  R27.x,  R17.x      VEC_201 
                 y: MULADD      R19.y,  R24.w,  R27.y,  R19.y      
                 z: MULADD      R19.z,  R24.w,  R27.z,  R19.z      
                 w: MULADD      R19.w,  R24.w,  R27.w,  R19.w      
                 t: MULADD      R17.y,  R25.x,  R27.y,  R17.y      
             51  x: MULADD      R16.x,  R25.y,  R26.x,  R16.x      VEC_201 
                 y: MULADD      R16.y,  R25.y,  R26.y,  R16.y      VEC_201 
                 z: MULADD      R18.z,  R25.x,  R26.z,  R18.z      VEC_210 
                 w: MULADD      R18.w,  R25.x,  R26.w,  R18.w      VEC_201 
                 t: MULADD      R16.z,  R25.y,  R26.z,  R16.z      VEC_120 
             52  x: MULADD      R15.x,  R25.y,  R27.x,  R15.x      VEC_201 
                 y: MULADD      R15.y,  R25.y,  R27.y,  R15.y      VEC_201 
                 z: MULADD      R17.z,  R25.x,  R27.z,  R17.z      VEC_210 
                 w: MULADD      R17.w,  R25.x,  R27.w,  R17.w      VEC_201 
                 t: MULADD      R15.z,  R25.y,  R27.z,  R15.z      VEC_120 
             53  x: MULADD      R14.x,  R25.z,  R26.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R25.z,  R26.y,  R14.y      VEC_201 
                 z: MULADD      R14.z,  R25.z,  R26.z,  R14.z      VEC_201 
                 w: MULADD      R16.w,  R25.y,  R26.w,  R16.w      VEC_210 
                 t: MULADD      R14.w,  R25.z,  R26.w,  R14.w      VEC_120 
             54  x: MULADD      R12.x,  R25.z,  R27.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R25.z,  R27.y,  R12.y      VEC_201 
                 z: MULADD      R12.z,  R25.z,  R27.z,  R12.z      VEC_201 
                 w: MULADD      R15.w,  R25.y,  R27.w,  R15.w      VEC_210 
                 t: MULADD      R12.w,  R25.z,  R27.w,  R12.w      VEC_120 
             55  x: MULADD      R13.x,  R25.w,  R26.x,  R13.x      
                 y: MULADD      R13.y,  R25.w,  R26.y,  R13.y      
                 z: MULADD      R13.z,  R25.w,  R26.z,  R13.z      
                 w: MULADD      R13.w,  R25.w,  R26.w,  R13.w      
             56  x: MULADD      R11.x,  R25.w,  R27.x,  R11.x      
                 y: MULADD      R11.y,  R25.w,  R27.y,  R11.y      
                 z: MULADD      R11.z,  R25.w,  R27.z,  R11.z      
                 w: MULADD      R11.w,  R25.w,  R27.w,  R11.w      
        09 ALU_PUSH_BEFORE: ADDR(283) CNT(3) KCACHE0(CB0:0-15) 
             57  z: ADD         R22.z,  R22.z,  1.0f      
                 w: ADD         R22.w,  R22.z,  1.0f      
             58  x: PREDE_INT   ____,  KC0[1].w,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        10 JUMP  POP_CNT(1) ADDR(16) 
        11 TEX: ADDR(898) CNT(8) 
             59  SAMPLE R1, R22.xwxx, t0, s0  UNNORM(XYZW) 
             60  SAMPLE R23, R22.xwxx, t1, s1  UNNORM(XYZW) 
             61  SAMPLE R24, R22.xwxx, t2, s2  UNNORM(XYZW) 
             62  SAMPLE R25, R22.xwxx, t3, s3  UNNORM(XYZW) 
             63  SAMPLE R0, R22.yzyy, t4, s4  UNNORM(XYZW) 
             64  SAMPLE R2, R22.yzyy, t5, s5  UNNORM(XYZW) 
             65  SAMPLE R28, R22.yzyy, t6, s6  UNNORM(XYZW) 
             66  SAMPLE R30, R22.yzyy, t7, s7  UNNORM(XYZW) 
        12 ALU_PUSH_BEFORE: ADDR(286) CNT(33) KCACHE0(CB0:0-15) 
             67  x: MULADD      R10.x,  R1.x,  R26.x,  R10.x      
                 y: MULADD      R10.y,  R1.x,  R26.y,  R10.y      
                 z: MULADD      R10.z,  R1.x,  R26.z,  R10.z      
                 w: MULADD      R10.w,  R1.x,  R26.w,  R10.w      
             68  x: MULADD      R9.x,  R1.x,  R27.x,  R9.x      
                 y: MULADD      R9.y,  R1.x,  R27.y,  R9.y      
                 z: MULADD      R9.z,  R1.x,  R27.z,  R9.z      
                 w: MULADD      R9.w,  R1.x,  R27.w,  R9.w      
             69  x: MULADD      R8.x,  R1.y,  R26.x,  R8.x      VEC_210 
                 y: MULADD      R8.y,  R1.y,  R26.y,  R8.y      VEC_201 
                 z: MULADD      R8.z,  R1.y,  R26.z,  R8.z      VEC_201 
                 w: MULADD      R8.w,  R1.y,  R26.w,  R8.w      VEC_201 
                 t: MULADD      R6.x,  R1.z,  R26.x,  R6.x      VEC_120 
             70  x: MULADD      R7.x,  R1.y,  R27.x,  R7.x      VEC_210 
                 y: MULADD      R7.y,  R1.y,  R27.y,  R7.y      VEC_201 
                 z: MULADD      R7.z,  R1.y,  R27.z,  R7.z      VEC_201 
                 w: MULADD      R7.w,  R1.y,  R27.w,  R7.w      VEC_201 
                 t: MULADD      R3.x,  R1.z,  R27.x,  R3.x      VEC_120 
             71  x: MULADD      R5.x,  R1.w,  R26.x,  R5.x      VEC_201 
                 y: MULADD      R6.y,  R1.z,  R26.y,  R6.y      VEC_210 
                 z: MULADD      R6.z,  R1.z,  R26.z,  R6.z      VEC_201 
                 w: MULADD      R6.w,  R1.z,  R26.w,  R6.w      VEC_201 
                 t: MULADD      R5.y,  R1.w,  R26.y,  R5.y      VEC_120 
             72  x: MULADD      R4.x,  R1.w,  R27.x,  R4.x      VEC_201 
                 y: MULADD      R3.y,  R1.z,  R27.y,  R3.y      VEC_210 
                 z: MULADD      R5.z,  R1.w,  R26.z,  R5.z      VEC_201 
                 w: MULADD      R5.w,  R1.w,  R26.w,  R5.w      VEC_201 
                 t: MULADD      R4.y,  R1.w,  R27.y,  R4.y      VEC_120 
             73  z: MULADD      R3.z,  R1.z,  R27.z,  R3.z      
                 w: MULADD      R3.w,  R1.z,  R27.w,  R3.w      
             74  z: MULADD      R4.z,  R1.w,  R27.z,  R4.z      
                 w: MULADD      R4.w,  R1.w,  R27.w,  R4.w      
             75  x: PREDE_INT   ____,  KC0[1].y,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        13 JUMP  POP_CNT(1) ADDR(15) 
        14 ALU_POP_AFTER: ADDR(319) CNT(80) 
             76  x: MULADD      R29.x,  R23.x,  R0.x,  R29.x      
                 y: MULADD      R29.y,  R23.x,  R0.y,  R29.y      
                 z: MULADD      R29.z,  R23.x,  R0.z,  R29.z      
                 w: MULADD      R29.w,  R23.x,  R0.w,  R29.w      
             77  x: MULADD      R21.x,  R23.x,  R2.x,  R21.x      
                 y: MULADD      R21.y,  R23.x,  R2.y,  R21.y      
                 z: MULADD      R21.z,  R23.x,  R2.z,  R21.z      
                 w: MULADD      R21.w,  R23.x,  R2.w,  R21.w      
             78  x: MULADD      R20.x,  R23.y,  R0.x,  R20.x      VEC_210 
                 y: MULADD      R20.y,  R23.y,  R0.y,  R20.y      VEC_201 
                 z: MULADD      R20.z,  R23.y,  R0.z,  R20.z      VEC_201 
                 w: MULADD      R20.w,  R23.y,  R0.w,  R20.w      VEC_201 
                 t: MULADD      R18.x,  R23.z,  R0.x,  R18.x      VEC_120 
             79  x: MULADD      R19.x,  R23.y,  R2.x,  R19.x      VEC_210 
                 y: MULADD      R19.y,  R23.y,  R2.y,  R19.y      VEC_201 
                 z: MULADD      R19.z,  R23.y,  R2.z,  R19.z      VEC_201 
                 w: MULADD      R19.w,  R23.y,  R2.w,  R19.w      VEC_201 
                 t: MULADD      R17.x,  R23.z,  R2.x,  R17.x      VEC_120 
             80  x: MULADD      R16.x,  R23.w,  R0.x,  R16.x      VEC_201 
                 y: MULADD      R18.y,  R23.z,  R0.y,  R18.y      VEC_210 
                 z: MULADD      R18.z,  R23.z,  R0.z,  R18.z      VEC_201 
                 w: MULADD      R18.w,  R23.z,  R0.w,  R18.w      VEC_201 
                 t: MULADD      R16.y,  R23.w,  R0.y,  R16.y      VEC_120 
             81  x: MULADD      R15.x,  R23.w,  R2.x,  R15.x      VEC_201 
                 y: MULADD      R17.y,  R23.z,  R2.y,  R17.y      VEC_210 
                 z: MULADD      R17.z,  R23.z,  R2.z,  R17.z      VEC_201 
                 w: MULADD      R17.w,  R23.z,  R2.w,  R17.w      VEC_201 
                 t: MULADD      R15.y,  R23.w,  R2.y,  R15.y      VEC_120 
             82  x: MULADD      R14.x,  R24.x,  R0.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R24.x,  R0.y,  R14.y      VEC_201 
                 z: MULADD      R16.z,  R23.w,  R0.z,  R16.z      
                 w: MULADD      R16.w,  R23.w,  R0.w,  R16.w      
                 t: MULADD      R14.z,  R24.x,  R0.z,  R14.z      
             83  x: MULADD      R12.x,  R24.x,  R2.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R24.x,  R2.y,  R12.y      VEC_201 
                 z: MULADD      R15.z,  R23.w,  R2.z,  R15.z      
                 w: MULADD      R15.w,  R23.w,  R2.w,  R15.w      
                 t: MULADD      R12.z,  R24.x,  R2.z,  R12.z      
             84  x: MULADD      R13.x,  R24.y,  R0.x,  R13.x      VEC_201 
                 y: MULADD      R13.y,  R24.y,  R0.y,  R13.y      VEC_201 
                 z: MULADD      R13.z,  R24.y,  R0.z,  R13.z      VEC_201 
                 w: MULADD      R14.w,  R24.x,  R0.w,  R14.w      VEC_210 
                 t: MULADD      R13.w,  R24.y,  R0.w,  R13.w      VEC_120 
             85  x: MULADD      R11.x,  R24.y,  R2.x,  R11.x      VEC_201 
                 y: MULADD      R11.y,  R24.y,  R2.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R24.y,  R2.z,  R11.z      VEC_201 
                 w: MULADD      R12.w,  R24.x,  R2.w,  R12.w      VEC_210 
                 t: MULADD      R11.w,  R24.y,  R2.w,  R11.w      VEC_120 
             86  x: MULADD      R10.x,  R24.z,  R0.x,  R10.x      VEC_210 
                 y: MULADD      R10.y,  R24.z,  R0.y,  R10.y      VEC_201 
                 z: MULADD      R10.z,  R24.z,  R0.z,  R10.z      VEC_201 
                 w: MULADD      R10.w,  R24.z,  R0.w,  R10.w      VEC_201 
                 t: MULADD      R8.x,  R24.w,  R0.x,  R8.x      VEC_120 
             87  x: MULADD      R9.x,  R24.z,  R2.x,  R9.x      VEC_210 
                 y: MULADD      R9.y,  R24.z,  R2.y,  R9.y      VEC_201 
                 z: MULADD      R9.z,  R24.z,  R2.z,  R9.z      VEC_201 
                 w: MULADD      R9.w,  R24.z,  R2.w,  R9.w      VEC_201 
                 t: MULADD      R7.x,  R24.w,  R2.x,  R7.x      VEC_120 
             88  x: MULADD      R6.x,  R25.x,  R0.x,  R6.x      VEC_201 
                 y: MULADD      R8.y,  R24.w,  R0.y,  R8.y      
                 z: MULADD      R8.z,  R24.w,  R0.z,  R8.z      
                 w: MULADD      R8.w,  R24.w,  R0.w,  R8.w      
                 t: MULADD      R6.y,  R25.x,  R0.y,  R6.y      
             89  x: MULADD      R3.x,  R25.x,  R2.x,  R3.x      VEC_201 
                 y: MULADD      R7.y,  R24.w,  R2.y,  R7.y      
                 z: MULADD      R7.z,  R24.w,  R2.z,  R7.z      
                 w: MULADD      R7.w,  R24.w,  R2.w,  R7.w      
                 t: MULADD      R3.y,  R25.x,  R2.y,  R3.y      
             90  x: MULADD      R5.x,  R25.y,  R0.x,  R5.x      VEC_201 
                 y: MULADD      R5.y,  R25.y,  R0.y,  R5.y      VEC_201 
                 z: MULADD      R6.z,  R25.x,  R0.z,  R6.z      VEC_210 
                 w: MULADD      R6.w,  R25.x,  R0.w,  R6.w      VEC_201 
                 t: MULADD      R5.z,  R25.y,  R0.z,  R5.z      VEC_120 
             91  x: MULADD      R4.x,  R25.y,  R2.x,  R4.x      VEC_201 
                 y: MULADD      R4.y,  R25.y,  R2.y,  R4.y      VEC_201 
                 z: MULADD      R3.z,  R25.x,  R2.z,  R3.z      VEC_210 
                 w: MULADD      R5.w,  R25.y,  R0.w,  R5.w      VEC_201 
                 t: MULADD      R4.z,  R25.y,  R2.z,  R4.z      VEC_120 
             92  w: MULADD      R3.w,  R25.x,  R2.w,  R3.w      
                 t: MULADD      R4.w,  R25.y,  R2.w,  R4.w      
        15 ALU_POP_AFTER: ADDR(399) CNT(25) 
             93  x: MULADD      R29.x,  R25.z,  R28.x,  R29.x      VEC_210 
                 y: MULADD      R29.y,  R25.z,  R28.y,  R29.y      VEC_201 
                 z: MULADD      R29.z,  R25.z,  R28.z,  R29.z      VEC_201 
                 w: MULADD      R29.w,  R25.z,  R28.w,  R29.w      VEC_201 
                 t: MULADD      R20.x,  R25.w,  R28.x,  R20.x      VEC_120 
             94  x: MULADD      R21.x,  R25.z,  R30.x,  R21.x      VEC_210 
                 y: MULADD      R20.y,  R25.w,  R28.y,  R20.y      VEC_201 
                 z: MULADD      R20.z,  R25.w,  R28.z,  R20.z      VEC_201 
                 w: MULADD      R20.w,  R25.w,  R28.w,  R20.w      VEC_201 
                 t: MULADD      R19.x,  R25.w,  R30.x,  R19.x      VEC_120 
             95  y: MULADD      R21.y,  R25.z,  R30.y,  R21.y      VEC_210 
                 z: MULADD      R21.z,  R25.z,  R30.z,  R21.z      VEC_201 
                 w: MULADD      R21.w,  R25.z,  R30.w,  R21.w      VEC_201 
                 t: MULADD      R19.y,  R25.w,  R30.y,  R19.y      VEC_120 
             96  z: MULADD      R19.z,  R25.w,  R30.z,  R19.z      VEC_201 
                 w: MULADD      R19.w,  R25.w,  R30.w,  R19.w      VEC_201 
                 t: ADD         R22.w,  R22.z,  1.0f      
             97  x: MOV         R26.x,  R28.x      
                 y: MOV         R26.y,  R28.y      
                 z: MOV         R26.z,  R28.z      
                 w: MOV         R26.w,  R28.w      
             98  x: MOV         R27.x,  R30.x      
                 y: MOV         R27.y,  R30.y      
                 z: MOV         R27.z,  R30.z      
                 w: MOV         R27.w,  R30.w      
        16 ALU_PUSH_BEFORE: ADDR(424) CNT(1) KCACHE0(CB0:0-15) 
             99  x: PREDE_INT   ____,  KC0[1].w,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        17 JUMP  POP_CNT(1) ADDR(20) 
        18 TEX: ADDR(914) CNT(4) 
            100  SAMPLE R1, R22.xwxx, t0, s0  UNNORM(XYZW) 
            101  SAMPLE R23, R22.xwxx, t1, s1  UNNORM(XYZW) 
            102  SAMPLE R24, R22.xwxx, t2, s2  UNNORM(XYZW) 
            103  SAMPLE R25, R22.xwxx, t3, s3  UNNORM(XYZW) 
        19 ALU_POP_AFTER: ADDR(425) CNT(66) 
            104  x: MULADD      R18.x,  R1.x,  R26.x,  R18.x      VEC_201 
                 y: MULADD      R18.y,  R1.x,  R26.y,  R18.y      VEC_201 
                 z: MULADD      R18.z,  R1.x,  R26.z,  R18.z      VEC_201 
                 w: MULADD      R18.w,  R1.x,  R26.w,  R18.w      VEC_201 
                 t: ADD         R22.z,  R22.w,  1.0f      
            105  x: MULADD      R17.x,  R1.x,  R27.x,  R17.x      VEC_201 
                 y: MULADD      R17.y,  R1.x,  R27.y,  R17.y      VEC_201 
                 z: MULADD      R17.z,  R1.x,  R27.z,  R17.z      VEC_201 
                 w: MULADD      R17.w,  R1.x,  R27.w,  R17.w      VEC_201 
                 t: ADD         R22.w,  R22.w,  1.0f      
            106  x: MULADD      R16.x,  R1.y,  R26.x,  R16.x      VEC_210 
                 y: MULADD      R16.y,  R1.y,  R26.y,  R16.y      VEC_201 
                 z: MULADD      R16.z,  R1.y,  R26.z,  R16.z      VEC_201 
                 w: MULADD      R16.w,  R1.y,  R26.w,  R16.w      VEC_201 
                 t: MULADD      R14.x,  R1.z,  R26.x,  R14.x      VEC_120 
            107  x: MULADD      R15.x,  R1.y,  R27.x,  R15.x      VEC_210 
                 y: MULADD      R15.y,  R1.y,  R27.y,  R15.y      VEC_201 
                 z: MULADD      R15.z,  R1.y,  R27.z,  R15.z      VEC_201 
                 w: MULADD      R15.w,  R1.y,  R27.w,  R15.w      VEC_201 
                 t: MULADD      R12.x,  R1.z,  R27.x,  R12.x      VEC_120 
            108  x: MULADD      R13.x,  R1.w,  R26.x,  R13.x      VEC_201 
                 y: MULADD      R14.y,  R1.z,  R26.y,  R14.y      VEC_210 
                 z: MULADD      R14.z,  R1.z,  R26.z,  R14.z      VEC_201 
                 w: MULADD      R14.w,  R1.z,  R26.w,  R14.w      VEC_201 
                 t: MULADD      R13.y,  R1.w,  R26.y,  R13.y      VEC_120 
            109  x: MULADD      R11.x,  R1.w,  R27.x,  R11.x      VEC_201 
                 y: MULADD      R12.y,  R1.z,  R27.y,  R12.y      VEC_210 
                 z: MULADD      R12.z,  R1.z,  R27.z,  R12.z      VEC_201 
                 w: MULADD      R12.w,  R1.z,  R27.w,  R12.w      VEC_201 
                 t: MULADD      R11.y,  R1.w,  R27.y,  R11.y      VEC_120 
            110  x: MULADD      R10.x,  R23.x,  R26.x,  R10.x      VEC_201 
                 y: MULADD      R10.y,  R23.x,  R26.y,  R10.y      VEC_201 
                 z: MULADD      R13.z,  R1.w,  R26.z,  R13.z      
                 w: MULADD      R13.w,  R1.w,  R26.w,  R13.w      
                 t: MULADD      R10.z,  R23.x,  R26.z,  R10.z      
            111  x: MULADD      R9.x,  R23.x,  R27.x,  R9.x      VEC_201 
                 y: MULADD      R9.y,  R23.x,  R27.y,  R9.y      VEC_201 
                 z: MULADD      R11.z,  R1.w,  R27.z,  R11.z      
                 w: MULADD      R11.w,  R1.w,  R27.w,  R11.w      
                 t: MULADD      R9.z,  R23.x,  R27.z,  R9.z      
            112  x: MULADD      R8.x,  R23.y,  R26.x,  R8.x      VEC_201 
                 y: MULADD      R8.y,  R23.y,  R26.y,  R8.y      VEC_201 
                 z: MULADD      R8.z,  R23.y,  R26.z,  R8.z      VEC_201 
                 w: MULADD      R10.w,  R23.x,  R26.w,  R10.w      VEC_210 
                 t: MULADD      R8.w,  R23.y,  R26.w,  R8.w      VEC_120 
            113  x: MULADD      R7.x,  R23.y,  R27.x,  R7.x      VEC_201 
                 y: MULADD      R7.y,  R23.y,  R27.y,  R7.y      VEC_201 
                 z: MULADD      R7.z,  R23.y,  R27.z,  R7.z      VEC_201 
                 w: MULADD      R9.w,  R23.x,  R27.w,  R9.w      VEC_210 
                 t: MULADD      R7.w,  R23.y,  R27.w,  R7.w      VEC_120 
            114  x: MULADD      R6.x,  R23.z,  R26.x,  R6.x      VEC_210 
                 y: MULADD      R6.y,  R23.z,  R26.y,  R6.y      VEC_201 
                 z: MULADD      R6.z,  R23.z,  R26.z,  R6.z      VEC_201 
                 w: MULADD      R6.w,  R23.z,  R26.w,  R6.w      VEC_201 
                 t: MULADD      R5.x,  R23.w,  R26.x,  R5.x      VEC_120 
            115  x: MULADD      R3.x,  R23.z,  R27.x,  R3.x      VEC_210 
                 y: MULADD      R5.y,  R23.w,  R26.y,  R5.y      VEC_201 
                 z: MULADD      R5.z,  R23.w,  R26.z,  R5.z      VEC_201 
                 w: MULADD      R5.w,  R23.w,  R26.w,  R5.w      VEC_201 
                 t: MULADD      R4.x,  R23.w,  R27.x,  R4.x      VEC_120 
            116  y: MULADD      R3.y,  R23.z,  R27.y,  R3.y      VEC_210 
                 z: MULADD      R3.z,  R23.z,  R27.z,  R3.z      VEC_201 
                 w: MULADD      R3.w,  R23.z,  R27.w,  R3.w      VEC_201 
                 t: MULADD      R4.y,  R23.w,  R27.y,  R4.y      VEC_120 
            117  z: MULADD      R4.z,  R23.w,  R27.z,  R4.z      
                 w: MULADD      R4.w,  R23.w,  R27.w,  R4.w      
        20 ALU_PUSH_BEFORE: ADDR(491) CNT(1) KCACHE0(CB0:0-15) 
            118  x: PREDE_INT   ____,  KC0[1].w,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        21 JUMP  POP_CNT(1) ADDR(24) 
        22 TEX: ADDR(922) CNT(8) 
            119  SAMPLE R1, R22.xwxx, t0, s0  UNNORM(XYZW) 
            120  SAMPLE R23, R22.xwxx, t1, s1  UNNORM(XYZW) 
            121  SAMPLE R28, R22.xwxx, t2, s2  UNNORM(XYZW) 
            122  SAMPLE R30, R22.xwxx, t3, s3  UNNORM(XYZW) 
            123  SAMPLE R0, R22.yzyy, t4, s4  UNNORM(XYZW) 
            124  SAMPLE R2, R22.yzyy, t5, s5  UNNORM(XYZW) 
            125  SAMPLE R26, R22.yzyy, t6, s6  UNNORM(XYZW) 
            126  SAMPLE R27, R22.yzyy, t7, s7  UNNORM(XYZW) 
        23 ALU_POP_AFTER: ADDR(492) CNT(88) 
            127  x: MULADD      R29.x,  R24.x,  R0.x,  R29.x      
                 y: MULADD      R29.y,  R24.x,  R0.y,  R29.y      
                 z: MULADD      R29.z,  R24.x,  R0.z,  R29.z      
                 w: MULADD      R29.w,  R24.x,  R0.w,  R29.w      
            128  x: MULADD      R21.x,  R24.x,  R2.x,  R21.x      
                 y: MULADD      R21.y,  R24.x,  R2.y,  R21.y      
                 z: MULADD      R21.z,  R24.x,  R2.z,  R21.z      
                 w: MULADD      R21.w,  R24.x,  R2.w,  R21.w      
            129  x: MULADD      R20.x,  R24.y,  R0.x,  R20.x      VEC_210 
                 y: MULADD      R20.y,  R24.y,  R0.y,  R20.y      VEC_201 
                 z: MULADD      R20.z,  R24.y,  R0.z,  R20.z      VEC_201 
                 w: MULADD      R20.w,  R24.y,  R0.w,  R20.w      VEC_201 
                 t: MULADD      R18.x,  R24.z,  R0.x,  R18.x      VEC_120 
            130  x: MULADD      R19.x,  R24.y,  R2.x,  R19.x      VEC_210 
                 y: MULADD      R19.y,  R24.y,  R2.y,  R19.y      VEC_201 
                 z: MULADD      R19.z,  R24.y,  R2.z,  R19.z      VEC_201 
                 w: MULADD      R19.w,  R24.y,  R2.w,  R19.w      VEC_201 
                 t: MULADD      R17.x,  R24.z,  R2.x,  R17.x      VEC_120 
            131  x: MULADD      R16.x,  R24.w,  R0.x,  R16.x      VEC_201 
                 y: MULADD      R18.y,  R24.z,  R0.y,  R18.y      VEC_210 
                 z: MULADD      R18.z,  R24.z,  R0.z,  R18.z      VEC_201 
                 w: MULADD      R18.w,  R24.z,  R0.w,  R18.w      VEC_201 
                 t: MULADD      R16.y,  R24.w,  R0.y,  R16.y      VEC_120 
            132  x: MULADD      R15.x,  R24.w,  R2.x,  R15.x      VEC_201 
                 y: MULADD      R17.y,  R24.z,  R2.y,  R17.y      VEC_210 
                 z: MULADD      R17.z,  R24.z,  R2.z,  R17.z      VEC_201 
                 w: MULADD      R17.w,  R24.z,  R2.w,  R17.w      VEC_201 
                 t: MULADD      R15.y,  R24.w,  R2.y,  R15.y      VEC_120 
            133  x: MULADD      R14.x,  R25.x,  R0.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R25.x,  R0.y,  R14.y      VEC_201 
                 z: MULADD      R16.z,  R24.w,  R0.z,  R16.z      
                 w: MULADD      R16.w,  R24.w,  R0.w,  R16.w      
                 t: MULADD      R14.z,  R25.x,  R0.z,  R14.z      
            134  x: MULADD      R12.x,  R25.x,  R2.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R25.x,  R2.y,  R12.y      VEC_201 
                 z: MULADD      R15.z,  R24.w,  R2.z,  R15.z      
                 w: MULADD      R15.w,  R24.w,  R2.w,  R15.w      
                 t: MULADD      R12.z,  R25.x,  R2.z,  R12.z      
            135  x: MULADD      R13.x,  R25.y,  R0.x,  R13.x      VEC_201 
                 y: MULADD      R13.y,  R25.y,  R0.y,  R13.y      VEC_201 
                 z: MULADD      R13.z,  R25.y,  R0.z,  R13.z      VEC_201 
                 w: MULADD      R14.w,  R25.x,  R0.w,  R14.w      VEC_210 
                 t: MULADD      R13.w,  R25.y,  R0.w,  R13.w      VEC_120 
            136  x: MULADD      R11.x,  R25.y,  R2.x,  R11.x      VEC_201 
                 y: MULADD      R11.y,  R25.y,  R2.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R25.y,  R2.z,  R11.z      VEC_201 
                 w: MULADD      R12.w,  R25.x,  R2.w,  R12.w      VEC_210 
                 t: MULADD      R11.w,  R25.y,  R2.w,  R11.w      VEC_120 
            137  x: MULADD      R10.x,  R25.z,  R0.x,  R10.x      VEC_210 
                 y: MULADD      R10.y,  R25.z,  R0.y,  R10.y      VEC_201 
                 z: MULADD      R10.z,  R25.z,  R0.z,  R10.z      VEC_201 
                 w: MULADD      R10.w,  R25.z,  R0.w,  R10.w      VEC_201 
                 t: MULADD      R8.x,  R25.w,  R0.x,  R8.x      VEC_120 
            138  x: MULADD      R9.x,  R25.z,  R2.x,  R9.x      VEC_210 
                 y: MULADD      R9.y,  R25.z,  R2.y,  R9.y      VEC_201 
                 z: MULADD      R9.z,  R25.z,  R2.z,  R9.z      VEC_201 
                 w: MULADD      R9.w,  R25.z,  R2.w,  R9.w      VEC_201 
                 t: MULADD      R7.x,  R25.w,  R2.x,  R7.x      VEC_120 
            139  x: MULADD      R6.x,  R1.x,  R0.x,  R6.x      VEC_201 
                 y: MULADD      R8.y,  R25.w,  R0.y,  R8.y      
                 z: MULADD      R8.z,  R25.w,  R0.z,  R8.z      
                 w: MULADD      R8.w,  R25.w,  R0.w,  R8.w      
                 t: MULADD      R6.y,  R1.x,  R0.y,  R6.y      
            140  x: MULADD      R3.x,  R1.x,  R2.x,  R3.x      VEC_201 
                 y: MULADD      R7.y,  R25.w,  R2.y,  R7.y      
                 z: MULADD      R7.z,  R25.w,  R2.z,  R7.z      
                 w: MULADD      R7.w,  R25.w,  R2.w,  R7.w      
                 t: MULADD      R3.y,  R1.x,  R2.y,  R3.y      
            141  x: MULADD      R5.x,  R1.y,  R0.x,  R5.x      VEC_201 
                 y: MULADD      R5.y,  R1.y,  R0.y,  R5.y      VEC_201 
                 z: MULADD      R6.z,  R1.x,  R0.z,  R6.z      VEC_210 
                 w: MULADD      R6.w,  R1.x,  R0.w,  R6.w      VEC_201 
                 t: MULADD      R5.z,  R1.y,  R0.z,  R5.z      VEC_120 
            142  x: MULADD      R4.x,  R1.y,  R2.x,  R4.x      VEC_201 
                 y: MULADD      R4.y,  R1.y,  R2.y,  R4.y      VEC_201 
                 z: MULADD      R3.z,  R1.x,  R2.z,  R3.z      VEC_210 
                 w: MULADD      R5.w,  R1.y,  R0.w,  R5.w      VEC_201 
                 t: MULADD      R4.z,  R1.y,  R2.z,  R4.z      VEC_120 
            143  w: MULADD      R3.w,  R1.x,  R2.w,  R3.w      
                 t: MULADD      R4.w,  R1.y,  R2.w,  R4.w      
            144  x: MOV         R24.x,  R28.x      
                 y: MOV         R24.y,  R28.y      
                 z: MOV         R24.z,  R28.z      
                 w: MOV         R24.w,  R28.w      
            145  x: MOV         R25.x,  R30.x      
                 y: MOV         R25.y,  R30.y      
                 z: MOV         R25.z,  R30.z      
                 w: MOV         R25.w,  R30.w      
        24 ALU_PUSH_BEFORE: ADDR(580) CNT(1) KCACHE0(CB0:0-15) 
            146  x: PREDE_INT   ____,  KC0[1].y,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        25 JUMP  POP_CNT(1) ADDR(27) 
        26 ALU_POP_AFTER: ADDR(581) CNT(66) 
            147  x: MULADD      R29.x,  R1.z,  R2.x,  R29.x      VEC_210 
                 y: MULADD      R29.y,  R1.z,  R2.y,  R29.y      VEC_201 
                 z: MULADD      R29.z,  R1.z,  R2.z,  R29.z      VEC_201 
                 w: MULADD      R29.w,  R1.z,  R2.w,  R29.w      VEC_201 
                 t: MULADD      R20.x,  R1.w,  R2.x,  R20.x      VEC_120 
            148  x: MULADD      R21.x,  R1.z,  R26.x,  R21.x      VEC_210 
                 y: MULADD      R21.y,  R1.z,  R26.y,  R21.y      VEC_201 
                 z: MULADD      R21.z,  R1.z,  R26.z,  R21.z      VEC_201 
                 w: MULADD      R21.w,  R1.z,  R26.w,  R21.w      VEC_201 
                 t: MULADD      R19.x,  R1.w,  R26.x,  R19.x      VEC_120 
            149  x: MULADD      R18.x,  R23.x,  R2.x,  R18.x      VEC_201 
                 y: MULADD      R20.y,  R1.w,  R2.y,  R20.y      
                 z: MULADD      R20.z,  R1.w,  R2.z,  R20.z      
                 w: MULADD      R20.w,  R1.w,  R2.w,  R20.w      
                 t: MULADD      R18.y,  R23.x,  R2.y,  R18.y      
            150  x: MULADD      R17.x,  R23.x,  R26.x,  R17.x      VEC_201 
                 y: MULADD      R19.y,  R1.w,  R26.y,  R19.y      
                 z: MULADD      R19.z,  R1.w,  R26.z,  R19.z      
                 w: MULADD      R19.w,  R1.w,  R26.w,  R19.w      
                 t: MULADD      R17.y,  R23.x,  R26.y,  R17.y      
            151  x: MULADD      R16.x,  R23.y,  R2.x,  R16.x      VEC_201 
                 y: MULADD      R16.y,  R23.y,  R2.y,  R16.y      VEC_201 
                 z: MULADD      R18.z,  R23.x,  R2.z,  R18.z      VEC_210 
                 w: MULADD      R18.w,  R23.x,  R2.w,  R18.w      VEC_201 
                 t: MULADD      R16.z,  R23.y,  R2.z,  R16.z      VEC_120 
            152  x: MULADD      R15.x,  R23.y,  R26.x,  R15.x      VEC_201 
                 y: MULADD      R15.y,  R23.y,  R26.y,  R15.y      VEC_201 
                 z: MULADD      R17.z,  R23.x,  R26.z,  R17.z      VEC_210 
                 w: MULADD      R17.w,  R23.x,  R26.w,  R17.w      VEC_201 
                 t: MULADD      R15.z,  R23.y,  R26.z,  R15.z      VEC_120 
            153  x: MULADD      R14.x,  R23.z,  R2.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R23.z,  R2.y,  R14.y      VEC_201 
                 z: MULADD      R14.z,  R23.z,  R2.z,  R14.z      VEC_201 
                 w: MULADD      R16.w,  R23.y,  R2.w,  R16.w      VEC_210 
                 t: MULADD      R14.w,  R23.z,  R2.w,  R14.w      VEC_120 
            154  x: MULADD      R12.x,  R23.z,  R26.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R23.z,  R26.y,  R12.y      VEC_201 
                 z: MULADD      R12.z,  R23.z,  R26.z,  R12.z      VEC_201 
                 w: MULADD      R15.w,  R23.y,  R26.w,  R15.w      VEC_210 
                 t: MULADD      R12.w,  R23.z,  R26.w,  R12.w      VEC_120 
            155  x: MULADD      R13.x,  R23.w,  R2.x,  R13.x      VEC_210 
                 y: MULADD      R13.y,  R23.w,  R2.y,  R13.y      VEC_201 
                 z: MULADD      R13.z,  R23.w,  R2.z,  R13.z      VEC_201 
                 w: MULADD      R13.w,  R23.w,  R2.w,  R13.w      VEC_201 
                 t: MULADD      R6.x,  R24.z,  R2.x,  R6.x      VEC_120 
            156  x: MULADD      R11.x,  R23.w,  R26.x,  R11.x      VEC_210 
                 y: MULADD      R11.y,  R23.w,  R26.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R23.w,  R26.z,  R11.z      VEC_201 
                 w: MULADD      R11.w,  R23.w,  R26.w,  R11.w      VEC_201 
                 t: MULADD      R3.x,  R24.z,  R26.x,  R3.x      VEC_120 
            157  x: MULADD      R5.x,  R24.w,  R2.x,  R5.x      VEC_201 
                 y: MULADD      R6.y,  R24.z,  R2.y,  R6.y      VEC_210 
                 z: MULADD      R6.z,  R24.z,  R2.z,  R6.z      VEC_201 
                 w: MULADD      R6.w,  R24.z,  R2.w,  R6.w      VEC_201 
                 t: MULADD      R5.y,  R24.w,  R2.y,  R5.y      VEC_120 
            158  x: MULADD      R4.x,  R24.w,  R26.x,  R4.x      VEC_201 
                 y: MULADD      R3.y,  R24.z,  R26.y,  R3.y      VEC_210 
                 z: MULADD      R5.z,  R24.w,  R2.z,  R5.z      VEC_201 
                 w: MULADD      R5.w,  R24.w,  R2.w,  R5.w      VEC_201 
                 t: MULADD      R4.y,  R24.w,  R26.y,  R4.y      VEC_120 
            159  z: MULADD      R3.z,  R24.z,  R26.z,  R3.z      VEC_201 
                 w: MULADD      R3.w,  R24.z,  R26.w,  R3.w      VEC_201 
                 t: ADD         R22.z,  R22.w,  1.0f      
            160  z: MULADD      R4.z,  R24.w,  R26.z,  R4.z      
                 w: MULADD      R4.w,  R24.w,  R26.w,  R4.w      
            161  w: ADD         R22.w,  R22.w,  1.0f      
        27 ALU_PUSH_BEFORE: ADDR(647) CNT(1) KCACHE0(CB0:0-15) 
            162  x: PREDE_INT   ____,  KC0[1].w,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        28 JUMP  POP_CNT(1) ADDR(31) 
        29 TEX: ADDR(938) CNT(8) 
            163  SAMPLE R0, R22.xwxx, t0, s0  UNNORM(XYZW) 
            164  SAMPLE R23, R22.xwxx, t1, s1  UNNORM(XYZW) 
            165  SAMPLE R24, R22.xwxx, t2, s2  UNNORM(XYZW) 
            166  SAMPLE R28, R22.xwxx, t3, s3  UNNORM(XYZW) 
            167  SAMPLE R1, R22.yzyy, t4, s4  UNNORM(XYZW) 
            168  SAMPLE R2, R22.yzyy, t5, s5  UNNORM(XYZW) 
            169  SAMPLE R26, R22.yzyy, t6, s6  UNNORM(XYZW) 
            170  SAMPLE R27, R22.yzyy, t7, s7  UNNORM(XYZW) 
        30 ALU_POP_AFTER: ADDR(648) CNT(84) 
            171  x: MULADD      R29.x,  R25.x,  R1.x,  R29.x      
                 y: MULADD      R29.y,  R25.x,  R1.y,  R29.y      
                 z: MULADD      R29.z,  R25.x,  R1.z,  R29.z      
                 w: MULADD      R29.w,  R25.x,  R1.w,  R29.w      
            172  x: MULADD      R21.x,  R25.x,  R2.x,  R21.x      
                 y: MULADD      R21.y,  R25.x,  R2.y,  R21.y      
                 z: MULADD      R21.z,  R25.x,  R2.z,  R21.z      
                 w: MULADD      R21.w,  R25.x,  R2.w,  R21.w      
            173  x: MULADD      R20.x,  R25.y,  R1.x,  R20.x      VEC_210 
                 y: MULADD      R20.y,  R25.y,  R1.y,  R20.y      VEC_201 
                 z: MULADD      R20.z,  R25.y,  R1.z,  R20.z      VEC_201 
                 w: MULADD      R20.w,  R25.y,  R1.w,  R20.w      VEC_201 
                 t: MULADD      R18.x,  R25.z,  R1.x,  R18.x      VEC_120 
            174  x: MULADD      R19.x,  R25.y,  R2.x,  R19.x      VEC_210 
                 y: MULADD      R19.y,  R25.y,  R2.y,  R19.y      VEC_201 
                 z: MULADD      R19.z,  R25.y,  R2.z,  R19.z      VEC_201 
                 w: MULADD      R19.w,  R25.y,  R2.w,  R19.w      VEC_201 
                 t: MULADD      R17.x,  R25.z,  R2.x,  R17.x      VEC_120 
            175  x: MULADD      R16.x,  R25.w,  R1.x,  R16.x      VEC_201 
                 y: MULADD      R18.y,  R25.z,  R1.y,  R18.y      VEC_210 
                 z: MULADD      R18.z,  R25.z,  R1.z,  R18.z      VEC_201 
                 w: MULADD      R18.w,  R25.z,  R1.w,  R18.w      VEC_201 
                 t: MULADD      R16.y,  R25.w,  R1.y,  R16.y      VEC_120 
            176  x: MULADD      R15.x,  R25.w,  R2.x,  R15.x      VEC_201 
                 y: MULADD      R17.y,  R25.z,  R2.y,  R17.y      VEC_210 
                 z: MULADD      R17.z,  R25.z,  R2.z,  R17.z      VEC_201 
                 w: MULADD      R17.w,  R25.z,  R2.w,  R17.w      VEC_201 
                 t: MULADD      R15.y,  R25.w,  R2.y,  R15.y      VEC_120 
            177  x: MULADD      R14.x,  R0.x,  R1.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R0.x,  R1.y,  R14.y      VEC_201 
                 z: MULADD      R16.z,  R25.w,  R1.z,  R16.z      
                 w: MULADD      R16.w,  R25.w,  R1.w,  R16.w      
                 t: MULADD      R14.z,  R0.x,  R1.z,  R14.z      
            178  x: MULADD      R12.x,  R0.x,  R2.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R0.x,  R2.y,  R12.y      VEC_201 
                 z: MULADD      R15.z,  R25.w,  R2.z,  R15.z      
                 w: MULADD      R15.w,  R25.w,  R2.w,  R15.w      
                 t: MULADD      R12.z,  R0.x,  R2.z,  R12.z      
            179  x: MULADD      R13.x,  R0.y,  R1.x,  R13.x      VEC_201 
                 y: MULADD      R13.y,  R0.y,  R1.y,  R13.y      VEC_201 
                 z: MULADD      R13.z,  R0.y,  R1.z,  R13.z      VEC_201 
                 w: MULADD      R14.w,  R0.x,  R1.w,  R14.w      VEC_210 
                 t: MULADD      R13.w,  R0.y,  R1.w,  R13.w      VEC_120 
            180  x: MULADD      R11.x,  R0.y,  R2.x,  R11.x      VEC_201 
                 y: MULADD      R11.y,  R0.y,  R2.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R0.y,  R2.z,  R11.z      VEC_201 
                 w: MULADD      R12.w,  R0.x,  R2.w,  R12.w      VEC_210 
                 t: MULADD      R11.w,  R0.y,  R2.w,  R11.w      VEC_120 
            181  x: MULADD      R10.x,  R0.z,  R1.x,  R10.x      VEC_210 
                 y: MULADD      R10.y,  R0.z,  R1.y,  R10.y      VEC_201 
                 z: MULADD      R10.z,  R0.z,  R1.z,  R10.z      VEC_201 
                 w: MULADD      R10.w,  R0.z,  R1.w,  R10.w      VEC_201 
                 t: MULADD      R8.x,  R0.w,  R1.x,  R8.x      VEC_120 
            182  x: MULADD      R9.x,  R0.z,  R2.x,  R9.x      VEC_210 
                 y: MULADD      R9.y,  R0.z,  R2.y,  R9.y      VEC_201 
                 z: MULADD      R9.z,  R0.z,  R2.z,  R9.z      VEC_201 
                 w: MULADD      R9.w,  R0.z,  R2.w,  R9.w      VEC_201 
                 t: MULADD      R7.x,  R0.w,  R2.x,  R7.x      VEC_120 
            183  x: MULADD      R6.x,  R23.x,  R1.x,  R6.x      VEC_201 
                 y: MULADD      R8.y,  R0.w,  R1.y,  R8.y      
                 z: MULADD      R8.z,  R0.w,  R1.z,  R8.z      
                 w: MULADD      R8.w,  R0.w,  R1.w,  R8.w      
                 t: MULADD      R6.y,  R23.x,  R1.y,  R6.y      
            184  x: MULADD      R3.x,  R23.x,  R2.x,  R3.x      VEC_201 
                 y: MULADD      R7.y,  R0.w,  R2.y,  R7.y      
                 z: MULADD      R7.z,  R0.w,  R2.z,  R7.z      
                 w: MULADD      R7.w,  R0.w,  R2.w,  R7.w      
                 t: MULADD      R3.y,  R23.x,  R2.y,  R3.y      
            185  x: MULADD      R5.x,  R23.y,  R1.x,  R5.x      VEC_201 
                 y: MULADD      R5.y,  R23.y,  R1.y,  R5.y      VEC_201 
                 z: MULADD      R6.z,  R23.x,  R1.z,  R6.z      VEC_210 
                 w: MULADD      R6.w,  R23.x,  R1.w,  R6.w      VEC_201 
                 t: MULADD      R5.z,  R23.y,  R1.z,  R5.z      VEC_120 
            186  x: MULADD      R4.x,  R23.y,  R2.x,  R4.x      VEC_201 
                 y: MULADD      R4.y,  R23.y,  R2.y,  R4.y      VEC_201 
                 z: MULADD      R3.z,  R23.x,  R2.z,  R3.z      VEC_210 
                 w: MULADD      R5.w,  R23.y,  R1.w,  R5.w      VEC_201 
                 t: MULADD      R4.z,  R23.y,  R2.z,  R4.z      VEC_120 
            187  w: MULADD      R3.w,  R23.x,  R2.w,  R3.w      
                 t: MULADD      R4.w,  R23.y,  R2.w,  R4.w      
            188  x: MOV         R25.x,  R28.x      
                 y: MOV         R25.y,  R28.y      
                 z: MOV         R25.z,  R28.z      
                 w: MOV         R25.w,  R28.w      
        31 ALU: ADDR(732) CNT(80) 
            189  x: MULADD      R29.x,  R23.z,  R26.x,  R29.x      VEC_210 
                 y: MULADD      R29.y,  R23.z,  R26.y,  R29.y      VEC_201 
                 z: MULADD      R29.z,  R23.z,  R26.z,  R29.z      VEC_201 
                 w: MULADD      R29.w,  R23.z,  R26.w,  R29.w      VEC_201 
                 t: MULADD      R20.x,  R23.w,  R26.x,  R20.x      VEC_120 
            190  x: MULADD      R21.x,  R23.z,  R27.x,  R21.x      VEC_210 
                 y: MULADD      R21.y,  R23.z,  R27.y,  R21.y      VEC_201 
                 z: MULADD      R21.z,  R23.z,  R27.z,  R21.z      VEC_201 
                 w: MULADD      R21.w,  R23.z,  R27.w,  R21.w      VEC_201 
                 t: MULADD      R19.x,  R23.w,  R27.x,  R19.x      VEC_120 
            191  x: MULADD      R18.x,  R24.x,  R26.x,  R18.x      VEC_201 
                 y: MULADD      R20.y,  R23.w,  R26.y,  R20.y      
                 z: MULADD      R20.z,  R23.w,  R26.z,  R20.z      
                 w: MULADD      R20.w,  R23.w,  R26.w,  R20.w      
                 t: MULADD      R18.y,  R24.x,  R26.y,  R18.y      
            192  x: MULADD      R17.x,  R24.x,  R27.x,  R17.x      VEC_201 
                 y: MULADD      R19.y,  R23.w,  R27.y,  R19.y      
                 z: MULADD      R19.z,  R23.w,  R27.z,  R19.z      
                 w: MULADD      R19.w,  R23.w,  R27.w,  R19.w      
                 t: MULADD      R17.y,  R24.x,  R27.y,  R17.y      
            193  x: MULADD      R16.x,  R24.y,  R26.x,  R16.x      VEC_201 
                 y: MULADD      R16.y,  R24.y,  R26.y,  R16.y      VEC_201 
                 z: MULADD      R18.z,  R24.x,  R26.z,  R18.z      VEC_210 
                 w: MULADD      R18.w,  R24.x,  R26.w,  R18.w      VEC_201 
                 t: MULADD      R16.z,  R24.y,  R26.z,  R16.z      VEC_120 
            194  x: MULADD      R15.x,  R24.y,  R27.x,  R15.x      VEC_201 
                 y: MULADD      R15.y,  R24.y,  R27.y,  R15.y      VEC_201 
                 z: MULADD      R17.z,  R24.x,  R27.z,  R17.z      VEC_210 
                 w: MULADD      R17.w,  R24.x,  R27.w,  R17.w      VEC_201 
                 t: MULADD      R15.z,  R24.y,  R27.z,  R15.z      VEC_120 
            195  x: MULADD      R14.x,  R24.z,  R26.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R24.z,  R26.y,  R14.y      VEC_201 
                 z: MULADD      R14.z,  R24.z,  R26.z,  R14.z      VEC_201 
                 w: MULADD      R16.w,  R24.y,  R26.w,  R16.w      VEC_210 
                 t: MULADD      R14.w,  R24.z,  R26.w,  R14.w      VEC_120 
            196  x: MULADD      R12.x,  R24.z,  R27.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R24.z,  R27.y,  R12.y      VEC_201 
                 z: MULADD      R12.z,  R24.z,  R27.z,  R12.z      VEC_201 
                 w: MULADD      R15.w,  R24.y,  R27.w,  R15.w      VEC_210 
                 t: MULADD      R12.w,  R24.z,  R27.w,  R12.w      VEC_120 
            197  x: MULADD      R13.x,  R24.w,  R26.x,  R13.x      VEC_210 
                 y: MULADD      R13.y,  R24.w,  R26.y,  R13.y      VEC_201 
                 z: MULADD      R13.z,  R24.w,  R26.z,  R13.z      VEC_201 
                 w: MULADD      R13.w,  R24.w,  R26.w,  R13.w      VEC_201 
                 t: MULADD      R8.x,  R25.y,  R26.x,  R8.x      VEC_120 
            198  x: MULADD      R11.x,  R24.w,  R27.x,  R11.x      VEC_210 
                 y: MULADD      R11.y,  R24.w,  R27.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R24.w,  R27.z,  R11.z      VEC_201 
                 w: MULADD      R11.w,  R24.w,  R27.w,  R11.w      VEC_201 
                 t: MULADD      R7.x,  R25.y,  R27.x,  R7.x      VEC_120 
            199  x: MULADD      R10.x,  R25.x,  R26.x,  R10.x      
                 y: MULADD      R10.y,  R25.x,  R26.y,  R10.y      
                 z: MULADD      R10.z,  R25.x,  R26.z,  R10.z      
                 w: MULADD      R10.w,  R25.x,  R26.w,  R10.w      
            200  x: MULADD      R9.x,  R25.x,  R27.x,  R9.x      
                 y: MULADD      R9.y,  R25.x,  R27.y,  R9.y      
                 z: MULADD      R9.z,  R25.x,  R27.z,  R9.z      
                 w: MULADD      R9.w,  R25.x,  R27.w,  R9.w      
            201  x: MULADD      R6.x,  R25.z,  R26.x,  R6.x      VEC_210 
                 y: MULADD      R8.y,  R25.y,  R26.y,  R8.y      VEC_201 
                 z: MULADD      R8.z,  R25.y,  R26.z,  R8.z      VEC_201 
                 w: MULADD      R8.w,  R25.y,  R26.w,  R8.w      VEC_201 
                 t: MULADD      R5.x,  R25.w,  R26.x,  R5.x      VEC_120 
            202  x: MULADD      R28.x,  R25.z,  R27.x,  R3.x      VEC_210 
                 y: MULADD      R7.y,  R25.y,  R27.y,  R7.y      VEC_201 
                 z: MULADD      R7.z,  R25.y,  R27.z,  R7.z      VEC_201 
                 w: MULADD      R7.w,  R25.y,  R27.w,  R7.w      VEC_201 
                 t: MULADD      R4.x,  R25.w,  R27.x,  R4.x      VEC_120 
            203  y: MULADD      R6.y,  R25.z,  R26.y,  R6.y      VEC_210 
                 z: MULADD      R6.z,  R25.z,  R26.z,  R6.z      VEC_201 
                 w: MULADD      R6.w,  R25.z,  R26.w,  R6.w      VEC_201 
                 t: MULADD      R5.y,  R25.w,  R26.y,  R5.y      VEC_120 
            204  y: MULADD      R28.y,  R25.z,  R27.y,  R3.y      VEC_210 
                 z: MULADD      R5.z,  R25.w,  R26.z,  R5.z      VEC_201 
                 w: MULADD      R5.w,  R25.w,  R26.w,  R5.w      VEC_201 
                 t: MULADD      R4.y,  R25.w,  R27.y,  R4.y      VEC_120 
            205  z: MULADD      R28.z,  R25.z,  R27.z,  R3.z      
                 w: MULADD      R28.w,  R25.z,  R27.w,  R3.w      
            206  z: MULADD      R4.z,  R25.w,  R27.z,  R4.z      
                 w: MULADD      R4.w,  R25.w,  R27.w,  R4.w      
    32 ENDLOOP i0 PASS_JUMP_ADDR(3) 
    33 ALU: ADDR(812) CNT(20) KCACHE0(CB0:0-15) 
        207  t: MULLO_INT   T0.z,  R22.x,  KC0[0].z      
        208  t: MULLO_INT   ____,  R22.y,  KC0[0].w      
        209  w: ADD_INT     ____,  T0.z,  PS208      
        210  x: ADD_INT     T0.x,  PV209.w,  (0x00000003, 4.203895393e-45f).x      
             y: ADD_INT     ____,  PV209.w,  1      
             z: ADD_INT     ____,  PV209.w,  0.0f      
             w: ADD_INT     T0.w,  PV209.w,  (0x00000002, 2.802596929e-45f).y      
        211  x: LSHL        R0.x,  PV210.z,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R0.y,  PV210.z,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     R0.z,  PV210.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     R0.w,  PV210.y,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R1.x,  PV210.y,  (0x00000002, 2.802596929e-45f).x      
        212  x: LSHL        R2.x,  T0.w,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R1.y,  T0.x,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     R1.z,  PV211.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     R1.w,  PV211.y,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R3.x,  T0.x,  (0x00000002, 2.802596929e-45f).x      
    34 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R29, ELEM_SIZE(3) 
    35 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R21, ELEM_SIZE(3) 
    36 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R20, ELEM_SIZE(3) 
    37 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R19, ELEM_SIZE(3) 
    38 ALU: ADDR(832) CNT(12) 
        213  x: LSHL        R3.x,  R0.y,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R0.y,  R0.z,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     R2.z,  R1.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     R0.w,  R1.y,  (0x00000004, 5.605193857e-45f).y      VEC_120 
             t: LSHL        R2.x,  R0.w,  (0x00000002, 2.802596929e-45f).x      
        214  x: LSHL        R1.x,  R0.z,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R1.y,  R1.z,  (0x00000004, 5.605193857e-45f).y      VEC_120 
             z: ADD_INT     R0.z,  PV213.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     R2.w,  PV213.y,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R0.x,  R1.y,  (0x00000002, 2.802596929e-45f).x      
    39 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R18, ELEM_SIZE(3) 
    40 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R17, ELEM_SIZE(3) 
    41 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R16, ELEM_SIZE(3) 
    42 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R15, ELEM_SIZE(3) 
    43 ALU: ADDR(844) CNT(10) 
        215  x: LSHL        R0.x,  R1.w,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R2.y,  R2.z,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     R1.z,  R2.w,  (0x00000004, 5.605193857e-45f).y      VEC_120 
             w: ADD_INT     R1.w,  R1.y,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R1.x,  R1.z,  (0x00000002, 2.802596929e-45f).x      
        216  x: LSHL        R2.x,  R0.y,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R0.y,  R0.z,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R3.x,  R0.w,  (0x00000002, 2.802596929e-45f).x      
    44 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R14, ELEM_SIZE(3) 
    45 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R12, ELEM_SIZE(3) 
    46 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R13, ELEM_SIZE(3) 
    47 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R11, ELEM_SIZE(3) 
    48 ALU: ADDR(854) CNT(6) 
        217  x: LSHL        R3.x,  R2.z,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R2.x,  R1.y,  (0x00000002, 2.802596929e-45f).x      
        218  x: LSHL        R1.x,  R2.w,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R0.x,  R0.z,  (0x00000002, 2.802596929e-45f).x      
    49 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R10, ELEM_SIZE(3) 
    50 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R9, ELEM_SIZE(3) 
    51 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R8, ELEM_SIZE(3) 
    52 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R7, ELEM_SIZE(3) 
    53 ALU: ADDR(860) CNT(6) 
        219  x: LSHL        R0.x,  R2.y,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R1.x,  R1.w,  (0x00000002, 2.802596929e-45f).x      
        220  x: LSHL        R2.x,  R1.z,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R3.x,  R0.y,  (0x00000002, 2.802596929e-45f).x      
    54 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R6, ELEM_SIZE(3) 
    55 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R28, ELEM_SIZE(3) 
    56 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R5, ELEM_SIZE(3) 
    57 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R4, ELEM_SIZE(3) 
    END_OF_PROGRAM
    
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    :razz: OK, now you're just showing off. That's absurd! :shock:

    I count 31 registers which means you should have 8 wavefronts per SIMD (8*31*64=15,872 + 4 clause temporaries * 64 strands * 2 wavefronts = 512 makes 16,384 registers total).

    I get ALU:TEX of 152:36, 4.22:1, 87% ALU utilisation. The 20 MOVs (5 clocks) look like wastage, effectively making utilisation 84%.

    Overall I guess HD3870 is roughly as fast as GTX285. Ouch.

    Jawed
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    No really, you are actually showing off!!!:grin:
    May be you should talk to AMD about licensing your codes. You might end up having something tangible to show off then. :lol:

    GREAT job though.
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I assume you calculation of gigaflops here follows the 10^9=1Giga rule. If so, how long it's gonna be before you treat us to the 1Tflop matrix multiplication running on ~$200 gpu? :) :)
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    If he had an HD4890 at stock clocks it should do it without any trouble.

    Jawed
     
  12. MicahVillmow

    Newcomer

    Joined:
    Aug 25, 2009
    Messages:
    2
    Likes Received:
    0
    prunedtree,
    Congratulations on improving on our algorithm for dense-matmat mul. It is very impressive to see that people can take our code developed on older hardware and improve it past its original design. The original code was developed on the R600 hence the 8x4 design and later optimized for R670 but the original design didn't change.
     
  13. riza.guntur

    Newcomer

    Joined:
    Aug 26, 2009
    Messages:
    8
    Likes Received:
    0
    Impressive
    Congratulations prunedtree
    Somebody please tried it on 4890 1GHz :grin:
     
  14. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    Actually, these MOVs are critical in order to save registers.
    If I understand you right, this kernel could achieve ~1008 Gflop/s if it was not bottlenecked by L1 bandwidth ?

    The problem with the 8x10 layout is that we have to ensure we don't waste any fetches as otherwise there would be no improvement. However each external product needs 10+8 scalars, which doesn't fit float4 fetches well. So I spread the loads over (20+16)x 4 = 144 = 8x 10+8 loads. The 8x10 block takes 20 registers, and we need a register for texture indexing, which leaves 10 registers. It's thus impossible to hold 36 temporary float4, which is why the code is scheduled in order to do external products as early as possible, and needs two float4 registers to juggle around fetches.

    The pattern looks like this:

    Code:
        10a | 6a 4b | 10b | 2b 8c | [8c] 2d | 10d | [4d] 6e | 10e
         8a   [8a]     8b    [8b]      8d      8d      8e      8e
    
    The five texture clauses are labeled a-e, and brackets show the registers that are kept through a fetch. There are eight 8x10 outer products in total.

    Indeed. I thought for a moment that the ~450 GB/s limit on L1 bandwith I am measuring was due to a 2^30=1Giga mistake, but that is not the case.

    Well it would be cute if that was possible (1000 Gflop/s is symbolic after all) but given the 450 GB/s limit, with 8x10 blocks that give a 80/9 bandwidth reduction, this kernel can only achieve up to 1000 Gflop/s. As it's already using the whole register file I'm afraid it's difficult to do considerably better using a similar approach.

    That's tempting... even a very mild overclock (3%) on HD4870 could do the trick...

    Code:
    fuu@hydra:~$ aticonfig --adapter=3 --odgc --odgt
    
    Adapter 3 - ATI Radeon HD 4870 X2
                                Core (MHz)    Memory (MHz)
               Current Clocks :    507           500
                 Current Peak :    770           900
      Configurable Peak Range : [507-778]     [500-980]
                     GPU load :    0%
    
    Adapter 3 - ATI Radeon HD 4870 X2
                Sensor 0: Temperature - 55.50 C
    
    And there it is: 1 Teraflop/s SGEMM ^^

    Thank you for these precisions. Can you tell us where this 450 GB/s L1 bandwidth limitation (instead of the expected 480 GB/s) comes from? Is it a scheduler bottleneck?
     
  15. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I wonder how did you partition a 4096x4096 matrix into 8x10 blocks? 8x8 I can understand. Did you use a different size for this?
     
  16. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    Yes, you are pushing the limits of the scheduler and the hardware, excellent work. ;-)

    There may be issues hiding all of the latency since you are dropping wavefront count as you increase register usage. This may be one cause of the L1 drop off, or it could be that you aren't quite getting the L2 latency coverage and stalls are happening there. Remember that you will have some cold misses into the cache which will drop utilization and it's possible there are some conflict misses in the chain as well. If you write a simple texture cache throughput benchmark, for example everyone fetching from texel 0, you will minimize access outside of the L1 and should get very close to peaks.
     
  17. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Likes Received:
    0
    You can always pad with zeros, using 4100x4096 matrices. With large matrices and small blocks it doesn't affect results much, so you can use any block size. The difficult part here is to ensure we don't waste float4 fetches.

    I've never obtained more than 450 GB/s no matter how I try, even with highly synthetic tests all fetching from the same location (and plenty of variations to avoid potential bank conflicts I wouldn't know of...) or with the samples in the SDK, which is why I'm asking.
     
  18. riza.guntur

    Newcomer

    Joined:
    Aug 26, 2009
    Messages:
    8
    Likes Received:
    0
    I wonder, how differ it is compared to Core i7 using ATLAS :)
     
  19. riza.guntur

    Newcomer

    Joined:
    Aug 26, 2009
    Messages:
    8
    Likes Received:
    0
    I wonder if you could help me port my Brook+ program to CAL
    If you can use global buffer I'm sure it can blast of to thousand times faster :grin:
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    I had a rummage and it seems I had a working Brook+ 64-scalars-output MM working a few months back (pure gather/scatter, but not in CS mode). It still seems to work as it verifies OK. I've broken something as only the debug version compiles so I can't produce a .exe (and I can only run on CPU anyway). I gave up because the assembly it outputs is fragmented due to a storm of IFs - though it looks like I can rearrange the code to get rid of a lot of them. Seems I got bored/cheesed-off or decided it was a blind alley and just abandoned it :razz:

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...