Faster dense matrix-matrix products on ATi hardware

Discussion in 'GPGPU Technology & Programming' started by prunedtree, Aug 18, 2009.

  1. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Dense matrix-matrix products, a basic element for dense linear algebra, is rather straightforward in terms of multiply-add operations, and very friendly to memory hierarchies, which makes it an interesting benchmark for throughput-oriented hardware: The performance of matrix-matrix multiply is often much more relevant to performance than the sum of all ALU operations than can be completed in a second.

    Contrast for instance the peak performance claimed by both major IHVs:

    nVidia GT200 (GTX280) : 933 Gflop/s
    ATi RV770 (HD4870) : 1200 Gflop/s

    ...to the current fastest matrix-matrix product implementations:

    CUBLAS 2.0 on GTX280[1] achieves 375 Gflop/s
    and on HD4870[2] ATi reckons 540 Gflop/s

    However, there is a significant difference as Volkov's implementation achieves the peak multiply-add rate with on operand from shared memory, while ATi's implementation is limited by the speed of the texture units. As many others, I thought that higher performance could be possible on ATi boards, using some mechanism to avoid the memory bottleneck. However, according to ATi, this is not possible[3], but they do not want to disclose details on the hardware.

    Thus, I experimented with all the ideas I could come up with. And I've hit these limitations ATi knew about, one after another: Shared memory (LDS in ATi parlance) is no faster than texture fetches that hit L1 (30 billion float4 per second, 480 GB/s, for both). Shared memory broadcasting requires unpractical amounts of ALU in order to put addresses into registers (sigh). Shared registers have half the peak bandwidth of local registers, giving us a limit of 480 Gflop/s...

    ATi's claim checks out: The limited features of their hardware do not really offer any help. But is their implementation really optimal ? The bandwidth intensity of a matrix-matrix product implementation is directly related to the size of the blocks in the destination matrix, and `simple_matmult' uses 8x4 blocks (this is also the maximum for their pixel shader approach). RV770's texture units can deliver 120 billion single precision values per second and we need two input values for each multiply-add operation. With 8x4 blocks, the bandwith reduction is ~5, and thus we obtain a peak of 600 Gflop/s.

    Using 8x8 blocks would bring the bandwidth reduction to 8, for a peak of 960 Gflop/s. However, the obvious limitation to higher block sizes is that we need enough space in the register file to store them. The size of the register file (1024 scalars) on RV770 may seem impressive, but with 180 cycles of latency for L1 hits, you need 8 threads (wraps in ATi parlance) to hide one texture clause behind 30 cycles of computation (128 multiply-add). This gives us less than ~120 scalars in order to compute, for instance, two 8x8 outer products per loop: A single texture clause can load the four float8 inputs, and the two outer products amount to 128 multiply-add instructions.

    How much register space would we need ? 64 scalars for the output block, the 32 values that are fetched by the texture units, and some registers for the loop index and texture addresses... a hundred scalars. This looks quite reasonable, so I implemented it. The major difficulty is to trick ATi's horrible compiler (which reflects the current quality of their `GPU computing' software stack well) into producing decent machine code. Here's what it looks like:

    Code:
    00 ALU: ADDR(32) CNT(71) 
          0  x: LSHR        T0.x,  R0.x,  (0x00000006, 8.407790786e-45f).x      
             y: MOV         R23.y,  0.0f      
             z: MOV         R23.z,  0.0f      
             w: AND_INT     T0.w,  R0.x,  (0x0000003F, 8.828180325e-44f).y      
             t: MOV         R23.x,  0.0f      
          1  x: MOV         R22.x,  0.0f      
             y: MOV         R22.y,  0.0f      
             z: MOV         R22.z,  0.0f      
             w: MOV         R23.w,  0.0f      
             t: MOV         R22.w,  0.0f      
          2  x: MOV         R21.x,  0.0f      
             y: MOV         R21.y,  0.0f      
             z: MOV         R21.z,  0.0f      
             w: MOV         R21.w,  0.0f      
             t: MOV         R20.x,  0.0f      
          3  x: MOV         R19.x,  0.0f      
             y: MOV         R20.y,  0.0f      
             z: MOV         R20.z,  0.0f      
             w: MOV         R20.w,  0.0f      
             t: MOV         R4.z,  (0xC0000000, -2.0f).x      
          4  x: MOV         R18.x,  0.0f      
             y: MOV         R19.y,  0.0f      
             z: MOV         R19.z,  0.0f      
             w: MOV         R19.w,  0.0f      
             t: MOV         R18.y,  0.0f      
          5  x: MOV         R17.x,  0.0f      
             y: MOV         R17.y,  0.0f      
             z: MOV         R18.z,  0.0f      
             w: MOV         R18.w,  0.0f      
             t: MOV         R17.z,  0.0f      
          6  x: MOV         R16.x,  0.0f      
             y: MOV         R16.y,  0.0f      
             z: MOV         R16.z,  0.0f      
             w: MOV         R17.w,  0.0f      
             t: MOV         R16.w,  0.0f      
          7  x: MOV         R15.x,  0.0f      
             y: MOV         R15.y,  0.0f      
             z: MOV         R15.z,  0.0f      
             w: MOV         R15.w,  0.0f      
             t: MOV         R13.x,  0.0f      
          8  x: MOV         R14.x,  0.0f      
             y: MOV         R13.y,  0.0f      
             z: MOV         R13.z,  0.0f      
             w: MOV         R13.w,  0.0f      
             t: MOV         R14.y,  0.0f      
          9  x: MOV         R12.x,  0.0f      
             y: MOV         R12.y,  0.0f      
             z: MOV         R14.z,  0.0f      
             w: MOV         R14.w,  0.0f      
             t: MOV         R12.z,  0.0f      
         10  x: MOV         R11.x,  0.0f      
             y: MOV         R11.y,  0.0f      
             z: MOV         R11.z,  0.0f      
             w: MOV         R12.w,  0.0f      
             t: MOV         R11.w,  0.0f      
         11  x: MOV         R9.x,  0.0f      
             y: MOV         R9.y,  0.0f      
             z: MOV         R9.z,  0.0f      
             w: MOV         R9.w,  0.0f      
             t: MOV         R10.x,  0.0f      
         12  x: MOV         R8.x,  0.0f      
             y: MOV         R10.y,  0.0f      
             z: MOV         R10.z,  0.0f      
             w: MOV         R10.w,  0.0f      
             t: MOV         R8.y,  0.0f      
         13  z: MOV         R8.z,  0.0f      
             w: MOV         R8.w,  0.0f      
             t: I_TO_F      R0.x,  T0.w      
         14  t: I_TO_F      R0.y,  T0.x      
    01 TEX: ADDR(288) CNT(1) 
         15  SAMPLE R5.xyz_, R0.xyxx, t4, s4  UNNORM(XYZW) 
    02 ALU: ADDR(103) CNT(2) 
         16  x: MOV         R4.x,  R5.x      
             y: MOV         R4.y,  R5.y      
    03 LOOP_DX10 i0 FAIL_JUMP_ADDR(11) 
        04 ALU_BREAK: ADDR(105) CNT(3) KCACHE0(CB0:0-15) 
             17  z: ADD         R4.z,  R4.z,  (0x40000000, 2.0f).x      
             18  x: PREDGT      ____,  KC0[0].x,  R4.z      UPDATE_EXEC_MASK UPDATE_PRED 
        05 ALU: ADDR(108) CNT(3) KCACHE0(CB0:0-15) 
             19  x: ADD         R4.x,  R4.x,  KC0[0].y      
                 y: ADD         R4.y,  R4.y,  KC0[0].y      
                 w: ADD         R4.w,  R4.z,  1.0f      
        06 TEX: ADDR(290) CNT(8) 
             20  SAMPLE R0, R4.xzxx, t0, s0  UNNORM(XYZW) 
             21  SAMPLE R2, R4.xzxx, t1, s1  UNNORM(XYZW) 
             22  SAMPLE R1, R4.yzyy, t2, s2  UNNORM(XYZW) 
             23  SAMPLE R3, R4.yzyy, t3, s3  UNNORM(XYZW) 
             24  SAMPLE R6, R4.xwxx, t0, s0  UNNORM(XYZW) 
             25  SAMPLE R7, R4.xwxx, t1, s1  UNNORM(XYZW) 
             26  SAMPLE R24, R4.ywyy, t2, s2  UNNORM(XYZW) 
             27  SAMPLE R25, R4.ywyy, t3, s3  UNNORM(XYZW) 
        07 ALU_PUSH_BEFORE: ADDR(111) CNT(65) KCACHE0(CB0:0-15) 
             28  x: MULADD      R23.x,  R0.x,  R1.x,  R23.x      
                 y: MULADD      R23.y,  R0.x,  R1.y,  R23.y      
                 z: MULADD      R23.z,  R0.x,  R1.z,  R23.z      
                 w: MULADD      R23.w,  R0.x,  R1.w,  R23.w      
             29  x: MULADD      R22.x,  R0.x,  R3.x,  R22.x      
                 y: MULADD      R22.y,  R0.x,  R3.y,  R22.y      
                 z: MULADD      R22.z,  R0.x,  R3.z,  R22.z      
                 w: MULADD      R22.w,  R0.x,  R3.w,  R22.w      
             30  x: MULADD      R21.x,  R0.y,  R1.x,  R21.x      VEC_210 
                 y: MULADD      R21.y,  R0.y,  R1.y,  R21.y      VEC_201 
                 z: MULADD      R21.z,  R0.y,  R1.z,  R21.z      VEC_201 
                 w: MULADD      R21.w,  R0.y,  R1.w,  R21.w      VEC_201 
                 t: MULADD      R19.x,  R0.z,  R1.x,  R19.x      VEC_120 
             31  x: MULADD      R20.x,  R0.y,  R3.x,  R20.x      VEC_210 
                 y: MULADD      R20.y,  R0.y,  R3.y,  R20.y      VEC_201 
                 z: MULADD      R20.z,  R0.y,  R3.z,  R20.z      VEC_201 
                 w: MULADD      R20.w,  R0.y,  R3.w,  R20.w      VEC_201 
                 t: MULADD      R18.x,  R0.z,  R3.x,  R18.x      VEC_120 
             32  x: MULADD      R17.x,  R0.w,  R1.x,  R17.x      VEC_201 
                 y: MULADD      R19.y,  R0.z,  R1.y,  R19.y      VEC_210 
                 z: MULADD      R19.z,  R0.z,  R1.z,  R19.z      VEC_201 
                 w: MULADD      R19.w,  R0.z,  R1.w,  R19.w      VEC_201 
                 t: MULADD      R17.y,  R0.w,  R1.y,  R17.y      VEC_120 
             33  x: MULADD      R16.x,  R0.w,  R3.x,  R16.x      VEC_201 
                 y: MULADD      R18.y,  R0.z,  R3.y,  R18.y      VEC_210 
                 z: MULADD      R18.z,  R0.z,  R3.z,  R18.z      VEC_201 
                 w: MULADD      R18.w,  R0.z,  R3.w,  R18.w      VEC_201 
                 t: MULADD      R16.y,  R0.w,  R3.y,  R16.y      VEC_120 
             34  x: MULADD      R15.x,  R2.x,  R1.x,  R15.x      VEC_201 
                 y: MULADD      R15.y,  R2.x,  R1.y,  R15.y      VEC_201 
                 z: MULADD      R17.z,  R0.w,  R1.z,  R17.z      
                 w: MULADD      R17.w,  R0.w,  R1.w,  R17.w      
                 t: MULADD      R15.z,  R2.x,  R1.z,  R15.z      
             35  x: MULADD      R13.x,  R2.x,  R3.x,  R13.x      VEC_201 
                 y: MULADD      R13.y,  R2.x,  R3.y,  R13.y      VEC_201 
                 z: MULADD      R16.z,  R0.w,  R3.z,  R16.z      
                 w: MULADD      R16.w,  R0.w,  R3.w,  R16.w      
                 t: MULADD      R13.z,  R2.x,  R3.z,  R13.z      
             36  x: MULADD      R14.x,  R2.y,  R1.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R2.y,  R1.y,  R14.y      VEC_201 
                 z: MULADD      R14.z,  R2.y,  R1.z,  R14.z      VEC_201 
                 w: MULADD      R15.w,  R2.x,  R1.w,  R15.w      VEC_210 
                 t: MULADD      R14.w,  R2.y,  R1.w,  R14.w      VEC_120 
             37  x: MULADD      R12.x,  R2.y,  R3.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R2.y,  R3.y,  R12.y      VEC_201 
                 z: MULADD      R12.z,  R2.y,  R3.z,  R12.z      VEC_201 
                 w: MULADD      R13.w,  R2.x,  R3.w,  R13.w      VEC_210 
                 t: MULADD      R12.w,  R2.y,  R3.w,  R12.w      VEC_120 
             38  x: MULADD      R11.x,  R2.z,  R1.x,  R11.x      VEC_210 
                 y: MULADD      R11.y,  R2.z,  R1.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R2.z,  R1.z,  R11.z      VEC_201 
                 w: MULADD      R11.w,  R2.z,  R1.w,  R11.w      VEC_201 
                 t: MULADD      R10.x,  R2.w,  R1.x,  R10.x      VEC_120 
             39  x: MULADD      R9.x,  R2.z,  R3.x,  R9.x      VEC_210 
                 y: MULADD      R10.y,  R2.w,  R1.y,  R10.y      VEC_201 
                 z: MULADD      R10.z,  R2.w,  R1.z,  R10.z      VEC_201 
                 w: MULADD      R10.w,  R2.w,  R1.w,  R10.w      VEC_201 
                 t: MULADD      R8.x,  R2.w,  R3.x,  R8.x      VEC_120 
             40  y: MULADD      R9.y,  R2.z,  R3.y,  R9.y      VEC_210 
                 z: MULADD      R9.z,  R2.z,  R3.z,  R9.z      VEC_201 
                 w: MULADD      R9.w,  R2.z,  R3.w,  R9.w      VEC_201 
                 t: MULADD      R8.y,  R2.w,  R3.y,  R8.y      VEC_120 
             41  z: MULADD      R8.z,  R2.w,  R3.z,  R8.z      
                 w: MULADD      R8.w,  R2.w,  R3.w,  R8.w      
             42  x: PREDE_INT   ____,  KC0[0].y,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
        08 JUMP  POP_CNT(1) ADDR(10) 
        09 ALU_POP_AFTER: ADDR(176) CNT(64) 
             43  x: MULADD      R23.x,  R6.x,  R24.x,  R23.x      
                 y: MULADD      R23.y,  R6.x,  R24.y,  R23.y      
                 z: MULADD      R23.z,  R6.x,  R24.z,  R23.z      
                 w: MULADD      R23.w,  R6.x,  R24.w,  R23.w      
             44  x: MULADD      R22.x,  R6.x,  R25.x,  R22.x      
                 y: MULADD      R22.y,  R6.x,  R25.y,  R22.y      
                 z: MULADD      R22.z,  R6.x,  R25.z,  R22.z      
                 w: MULADD      R22.w,  R6.x,  R25.w,  R22.w      
             45  x: MULADD      R20.x,  R6.y,  R24.x,  R20.x      VEC_210 
                 y: MULADD      R20.y,  R6.y,  R24.y,  R20.y      VEC_201 
                 z: MULADD      R20.z,  R6.y,  R24.z,  R20.z      VEC_201 
                 w: MULADD      R20.w,  R6.y,  R24.w,  R20.w      VEC_201 
                 t: MULADD      R19.x,  R6.z,  R24.x,  R19.x      VEC_120 
             46  x: MULADD      R21.x,  R6.y,  R25.x,  R21.x      VEC_210 
                 y: MULADD      R21.y,  R6.y,  R25.y,  R21.y      VEC_201 
                 z: MULADD      R21.z,  R6.y,  R25.z,  R21.z      VEC_201 
                 w: MULADD      R21.w,  R6.y,  R25.w,  R21.w      VEC_201 
                 t: MULADD      R18.x,  R6.z,  R25.x,  R18.x      VEC_120 
             47  x: MULADD      R17.x,  R6.w,  R24.x,  R17.x      VEC_201 
                 y: MULADD      R19.y,  R6.z,  R24.y,  R19.y      VEC_210 
                 z: MULADD      R19.z,  R6.z,  R24.z,  R19.z      VEC_201 
                 w: MULADD      R19.w,  R6.z,  R24.w,  R19.w      VEC_201 
                 t: MULADD      R17.y,  R6.w,  R24.y,  R17.y      VEC_120 
             48  x: MULADD      R16.x,  R6.w,  R25.x,  R16.x      VEC_201 
                 y: MULADD      R18.y,  R6.z,  R25.y,  R18.y      VEC_210 
                 z: MULADD      R18.z,  R6.z,  R25.z,  R18.z      VEC_201 
                 w: MULADD      R18.w,  R6.z,  R25.w,  R18.w      VEC_201 
                 t: MULADD      R16.y,  R6.w,  R25.y,  R16.y      VEC_120 
             49  x: MULADD      R15.x,  R7.x,  R24.x,  R15.x      VEC_201 
                 y: MULADD      R15.y,  R7.x,  R24.y,  R15.y      VEC_201 
                 z: MULADD      R17.z,  R6.w,  R24.z,  R17.z      
                 w: MULADD      R17.w,  R6.w,  R24.w,  R17.w      
                 t: MULADD      R15.z,  R7.x,  R24.z,  R15.z      
             50  x: MULADD      R13.x,  R7.x,  R25.x,  R13.x      VEC_201 
                 y: MULADD      R13.y,  R7.x,  R25.y,  R13.y      VEC_201 
                 z: MULADD      R16.z,  R6.w,  R25.z,  R16.z      
                 w: MULADD      R16.w,  R6.w,  R25.w,  R16.w      
                 t: MULADD      R13.z,  R7.x,  R25.z,  R13.z      
             51  x: MULADD      R14.x,  R7.y,  R24.x,  R14.x      VEC_201 
                 y: MULADD      R14.y,  R7.y,  R24.y,  R14.y      VEC_201 
                 z: MULADD      R14.z,  R7.y,  R24.z,  R14.z      VEC_201 
                 w: MULADD      R15.w,  R7.x,  R24.w,  R15.w      VEC_210 
                 t: MULADD      R14.w,  R7.y,  R24.w,  R14.w      VEC_120 
             52  x: MULADD      R12.x,  R7.y,  R25.x,  R12.x      VEC_201 
                 y: MULADD      R12.y,  R7.y,  R25.y,  R12.y      VEC_201 
                 z: MULADD      R12.z,  R7.y,  R25.z,  R12.z      VEC_201 
                 w: MULADD      R13.w,  R7.x,  R25.w,  R13.w      VEC_210 
                 t: MULADD      R12.w,  R7.y,  R25.w,  R12.w      VEC_120 
             53  x: MULADD      R11.x,  R7.z,  R24.x,  R11.x      VEC_210 
                 y: MULADD      R11.y,  R7.z,  R24.y,  R11.y      VEC_201 
                 z: MULADD      R11.z,  R7.z,  R24.z,  R11.z      VEC_201 
                 w: MULADD      R11.w,  R7.z,  R24.w,  R11.w      VEC_201 
                 t: MULADD      R10.x,  R7.w,  R24.x,  R10.x      VEC_120 
             54  x: MULADD      R9.x,  R7.z,  R25.x,  R9.x      VEC_210 
                 y: MULADD      R10.y,  R7.w,  R24.y,  R10.y      VEC_201 
                 z: MULADD      R10.z,  R7.w,  R24.z,  R10.z      VEC_201 
                 w: MULADD      R10.w,  R7.w,  R24.w,  R10.w      VEC_201 
                 t: MULADD      R8.x,  R7.w,  R25.x,  R8.x      VEC_120 
             55  y: MULADD      R9.y,  R7.z,  R25.y,  R9.y      VEC_210 
                 z: MULADD      R9.z,  R7.z,  R25.z,  R9.z      VEC_201 
                 w: MULADD      R9.w,  R7.z,  R25.w,  R9.w      VEC_201 
                 t: MULADD      R8.y,  R7.w,  R25.y,  R8.y      VEC_120 
             56  z: MULADD      R8.z,  R7.w,  R25.z,  R8.z      
                 w: MULADD      R8.w,  R7.w,  R25.w,  R8.w      
    10 ENDLOOP i0 PASS_JUMP_ADDR(4) 
    11 ALU: ADDR(240) CNT(29) 
         57  x: ADD_INT     T0.x,  R5.z,  (0x00000003, 4.203895393e-45f).x      
             y: ADD_INT     ____,  R5.z,  0.0f      
             z: ADD_INT     T0.z,  R5.z,  (0x00000002, 2.802596929e-45f).y      
             w: ADD_INT     ____,  R5.z,  1      
         58  x: LSHL        R0.x,  PV57.y,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     T0.y,  PV57.z,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     T1.z,  PV57.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     T0.w,  PV57.y,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R1.x,  PV57.w,  (0x00000002, 2.802596929e-45f).x      
         59  x: LSHL        R2.x,  T0.z,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R0.y,  PV58.z,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     R0.z,  PV58.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     T1.w,  T0.x,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R3.x,  T0.x,  (0x00000002, 2.802596929e-45f).x      
         60  x: LSHL        R4.x,  T0.w,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R1.y,  PV59.z,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     R1.z,  PV59.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     R0.w,  T0.y,  (0x00000004, 5.605193857e-45f).y      
             t: LSHL        R5.x,  T1.z,  (0x00000002, 2.802596929e-45f).x      
         61  x: LSHL        R6.x,  T0.y,  (0x00000002, 2.802596929e-45f).x      
             y: ADD_INT     R2.y,  PV60.z,  (0x00000004, 5.605193857e-45f).y      
             z: ADD_INT     R2.z,  PV60.w,  (0x00000004, 5.605193857e-45f).y      
             w: ADD_INT     R1.w,  R0.y,  (0x00000004, 5.605193857e-45f).y      VEC_120 
             t: LSHL        R7.x,  T1.w,  (0x00000002, 2.802596929e-45f).x      
    12 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R23, ELEM_SIZE(3) 
    13 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R22, ELEM_SIZE(3) 
    14 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R21, ELEM_SIZE(3) 
    15 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R20, ELEM_SIZE(3) 
    16 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R19, ELEM_SIZE(3) 
    17 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R18, ELEM_SIZE(3) 
    18 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R17, ELEM_SIZE(3) 
    19 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R16, ELEM_SIZE(3) 
    20 ALU: ADDR(269) CNT(12) 
         62  x: LSHL        R7.x,  R0.z,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R6.x,  R0.y,  (0x00000002, 2.802596929e-45f).x      
         63  x: LSHL        R5.x,  R0.w,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R4.x,  R1.z,  (0x00000002, 2.802596929e-45f).x      
         64  x: LSHL        R3.x,  R1.y,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R2.x,  R1.w,  (0x00000002, 2.802596929e-45f).x      
         65  x: LSHL        R1.x,  R2.z,  (0x00000002, 2.802596929e-45f).x      
             t: LSHL        R0.x,  R2.y,  (0x00000002, 2.802596929e-45f).x      
    21 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R15, ELEM_SIZE(3) 
    22 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R13, ELEM_SIZE(3) 
    23 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R14, ELEM_SIZE(3) 
    24 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R12, ELEM_SIZE(3) 
    25 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R11, ELEM_SIZE(3) 
    26 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R9, ELEM_SIZE(3) 
    27 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R10, ELEM_SIZE(3) 
    28 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R8, ELEM_SIZE(3) 
    END_OF_PROGRAM
    
    The result ? I measure 880 Gflop/s for 4096x4096 dense matrix-matrix products. That makes a pair of HD4870x2 boards faster than nine GTX280s ^^

    EDIT: 1000 Gflop/s later in this thread

    References:
    [1] V. Volkov, J. W. Demmel: Benchmarking GPUs to tune dense linear algebra. Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008
    http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf

    [2] `What we see on our optimized MM kernel is ~540 gflops in IL.'
    Micah Villmow, AMD. Answering to vvolkov on the ATi Stream sectionof the AMD Developer Forums
    http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=105221

    [3] `The simple_matmult example that we have is pretty much optimal for our hardware'
    Micah Villmow, AMD. Answering to sgratton on the ATi Stream sectionof the AMD Developer Forums
    http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=102771
     
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    I have always believed that the mul on nv and the t unit on amd's gpu's should be ignored while calculating the peak throughput. Volkov's paper does the same too.

    Err..., you quote vendor's peak claims and calculate your fraction of flops from your own estimate of peak flops???:shock:

    GREAT JOB, nevertheless. :mrgreen::mrgreen::runaway:

    I assume that you are using a 4870. Right?
     
  3. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Yes, it's more realistic to expect to achieve such rates in ALU-bound situations.

    No, if you look at the code, you will see that the arithmetic peak is actually a bit over 960 Gflop/s (31 cycles for the loop if there's no overhead, around 4.13 multiply-accumulate per cycle in average, or 990 Gflop/s)

    My peak estimate is from the L1 bandwidth, which I assume to be the bottleneck. 100% would be unlikely as there's no cache prefetching.

    Thanks. I'm using a pair of HD4870x2, but my numbers are for a single device (one RV770 and its gigabyte of GDDR5) which is equivalent to a HD4870 board (in theory PCIe is not a bottleneck for sustained SGEMM computation).

    Given that it's essentially lots of calls to the SGEMM kernel, it could be funny to try to achieve 3500 Gflop/s in single precision LU factorization (LINPACK benchmark) using Volkov's approach for multi-GPU computation.
     
  4. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Why would you ignore the t unit? Can you not see from the example given that it's being utilized in the majority of the slots? In fact, all 5 units are used in most of the shader.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    Ooh, very impressive.

    Vasily Volkov and I discussed some of this stuff:

    http://forum.beyond3d.com/showthread.php?p=1290019#post1290019

    I bashed my head against this for a while, mostly non-LDS, but focussed too much on maintaining cache locality for maximum throughput. And got somewhat confused :???: Not actually having a GPU to test on also puts the dampener on things.

    I like the fact you're ignoring cache locality - that makes me chuckle.

    So I'm very impressed, 880Gflops is quite something. Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output). I'm a bit puzzled why the break from the loop is in the middle - need to think about that some more.

    Will you be writing a paper or presenting your results formally some time? You need a shiny graph of performance at various matrix sizes, 128x128, 256x256, etc. :razz:

    I do disagree on the 5th MAD in ATI. Your code is clearly doing 5 MADs per cycle most of the time!

    By the way the loop is 32 ALU cycles, 960GFLOPs peak.

    Jawed
     
  6. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    15,633
    Location:
    Winfield, IN USA
    Pfft, some newb posting up a troll thread. :roll:



































    ;) Hey Prune! :D
     
  7. Rys

    Rys Tiled
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    3,893
    Location:
    Beyond3D HQ
    Think it was Factor + bindings for Stream, if I remember rightly. Fancy sharing the app code, prunedtree? :grin:
     
  8. prunedtree

    Newcomer

    Joined:
    Aug 8, 2009
    Messages:
    27
    Cache locality is quite important: I get only 720 Gflop/s with 1024x1024 matrices (and performance crumbles over that size) with a naive scanline ordering, and 840 Gflop/s with tiling. The texture fetch at the start of the shader loads precomputed tiled addresses.

    Given the difficult to produce decent code with this framework, I don't think you could go far with Brook+. The weird things you might notice in the code is just junk to coerce the compiler into sanity.

    Well I wouldn't mind helping writing something similar to Volkov's paper, but I do not have much use for dense linear algebra myself. My work involves mostly boring memory-bound kernels, and this was an opportunity to have some fun.

    Yes, I am using bindings for ATi CAL in Factor, but it's tied to some proprietary code and fairly incomplete for now. However I do plan to release it at some point. The original post contains the high level method anyway - This ended up being the most simple of all the approaches I tried (sigh)

    By the way, it seems I can't extract more than 444 GB/s out of the texture units (even in synthetic tests with extreme locality) so that gives an upper bound of 888 Gflop/s for this kernel.
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Of course the t unit is being used here, and is one of the reasons why this code is able to reach such high speeds. And yet, the t unit and the mul are accidents. You would almost never be able to exploit them. This code is perhaps an exception.

    However, I find that calculating the fraction of the peak from other bottlenecks (assumed or otherwise), is not the best way. Calculating the peak from actual max throughput is better as it can expose other deficiencies/bottlenecks.
     
  10. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    The t unit is readily used in many shaders, I don't know where you are coming from.
    Different shaders have different performance profiles. If the shader is ALU limited, then likely it will be making good use of the t unit.
     
  11. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    You mean it's used as a fma unit in many shaders, and not as an sfu?

    Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction.
     
  12. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Yes, it can be used as both. See the example posted in this very thread.
    What compiler and what shaders?
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    I know it can be used as both, I am wondering if you are saying that t unit is used as a fma unit in shaders?

    It is, 90% of the time, not.

    ATI jit compiler. Bioshock has 3.5, iirc.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    Why wouldn't it?

    Jawed
     
  15. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    I just loaded up GSA, which happens to have the NVidia Horizon Based AO shader loaded as the last thing I looked at, and there's a whole pile of Ts doing MADs, MULs, ADDs and CNDEs as well as various transcendentals.

    Jawed
     
  17. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Can you calculate the avg slot occupancy in that shader?
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    OK, so what you're saying is cache locality is important in allowing the 8 instructions in the TEX clause to run at full speed (or close). 4:1 ALU:TEX, in this case, doesn't provide the leeway to enable "sloppy" access patterns.

    I guess the scanline access pattern ends up with L2 filled with data it junks, which increases the number of fetches into L2 to fulfil the 8 TEX instructions.

    Guess I'll leave fiddling with it until an idle moment. I can't test performance anyway, but I want to think about your tiling and striding.

    Onto the double-precision version!

    Hmm, perhaps full cache bandwidth only comes with the same values being fetched multiple times.

    Jawed
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    Code:
     
    x   67
    y   59
    z   56
    w   61
    t   51
    total ALU instructions 101
    utilisation = 58%
    
    Jawed
     
  20. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    It really isn't that uncommon to see the t unit used all over. Here's an example I am looking at currently:
    Code:
    ; --------  Disassembly --------------------
    00 TEX: ADDR(656) CNT(1) VALID_PIX 
          0  SAMPLE R6, R1.xyxx, t3, s3
    01 ALU_PUSH_BEFORE: ADDR(64) CNT(101) 
          1  x: MULADD      T1.x,  C34.x,  R6.x, -1.0f      
             y: MULADD      T0.y,  C34.x,  R6.y, -1.0f      
             z: MULADD      T0.z,  C34.x,  R6.z, -1.0f      
             w: MULADD      T1.w,  C34.x,  R6.w, -1.0f      
             t: MULADD      T2.y,  C34.x,  R6.y, -1.0f      VEC_021 
          2  x: MUL         ____,  PV1.z,  PV1.z      
             y: MUL         ____,  R5.z,  R5.z      
             z: MULADD      T1.z,  C34.x,  R6.z, -1.0f      
             w: MUL         T0.w,  R2.z,  R2.z      VEC_201 
             t: ADD         R11.z, -C23.x,  C24.x      
          3  x: DOT4        ____,  T1.x,  T1.x      
             y: DOT4        ____,  T0.y,  T0.y      
             z: DOT4        ____,  PV2.x,  1.0f      
             w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
             t: MULADD      T0.x,  R5.y,  R5.y,  PV2.y      
          4  x: DOT4        ____,  T1.w,  T1.w      
             y: DOT4        ____,  T2.y,  T2.y      
             z: DOT4        ____,  T1.z,  T1.z      
             w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
             t: RSQ_sat     T1.y,  PV3.x      
          5  x: MUL         T0.x,  T1.x,  PS4      
             y: MULADD      ____,  R5.x,  R5.x,  T0.x      VEC_102 
             z: MULADD      ____,  R2.y,  R2.y,  T0.w      
             w: MUL         T0.w,  T0.z,  PS4      
             t: RSQ_sat     T2.x,  PV4.x      
          6  x: MUL         T1.x,  T1.z,  PS5      
             y: MULADD      ____,  R2.x,  R2.x,  PV5.z      
             z: MUL         ____,  T1.w,  PS5      
             w: MUL         T2.w,  T0.y,  T1.y      
             t: RSQ_sat     ____,  PV5.y      
          7  x: CNDGE       T0.x, -C22.x,  T0.x,  PV6.z      
             y: MUL         T0.y,  R5.z,  PS6      
             z: MUL         T1.z,  R5.x,  PS6      
             w: MUL         T1.w,  R5.y,  PS6      
             t: RSQ_sat     T3.x,  PV6.y      
          8  x: DOT4        ____,  R4.x,  R4.x      
             y: DOT4        T1.y,  R4.y,  R4.y      
             z: DOT4        ____,  R4.z,  R4.z      
             w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
             t: CNDGE       T0.w, -C22.x,  T0.w,  T1.x      VEC_021 
          9  x: MUL         T0.x,  T0.x,  T1.z      
             y: MUL         T0.y,  T0.x,  T0.y      
             z: MUL         ____,  T0.x,  T1.w      
             w: MUL         ____,  T2.y,  T2.x      VEC_021 
             t: MUL         ____,  R2.y,  T3.x      
         10  x: CNDGE       T2.x, -C22.x,  T2.w,  PV9.w      
             y: MUL         T1.y,  R2.z,  T3.x      
             z: MUL         ____,  R2.x,  T3.x      
             w: MULADD      T2.w,  PS9,  T0.w,  PV9.z      VEC_021 
             t: RSQ_sat     T3.x,  T1.y      
         11  x: DOT4        ____,  R3.x,  R3.x      VEC_120 
             y: DOT4        ____,  R3.y,  R3.y      
             z: DOT4        ____,  R3.z,  R3.z      
             w: DOT4        R11.w,  (0x80000000, 0.0f).x,  0.0f      
             t: MULADD      T0.x,  PV10.z,  T0.w,  T0.x      
         12  x: MUL         ____,  R4.y,  T3.x      
             y: MUL         ____,  R4.z,  T3.x      
             z: MUL         ____,  R4.x,  T3.x      
             w: MULADD      ____,  T1.y,  T0.w,  T0.y      VEC_102 
             t: RSQ_e       T1.z,  |PV11.x|      
         13  x: MULADD      T2.x,  T2.x,  PV12.z,  T0.x      
             y: MULADD      T0.y,  T2.x,  PV12.x,  T2.w      
             z: MULADD      T0.z,  T2.x,  PV12.y,  PV12.w      
             w: MULADD      T2.w,  R3.x,  PS12, -C29.x      VEC_120 
             t: MULADD      T2.y,  R3.y,  PS12, -C29.y      
         14  x: MUL         ____,  PV13.z,  PV13.z      
             z: MULADD      T1.z,  R3.z,  T1.z, -C29.z      
             t: RCP_e       ____,  T1.z      
         15  x: DOT4        T0.x,  T2.x,  T2.x      
             y: DOT4        ____,  T0.y,  T0.y      
             z: DOT4        ____,  PV14.x,  1.0f      
             w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
             t: ADD         ____,  PS14, -C28.x      
         16  x: DOT4        ____,  T2.w,  T2.w      
             y: DOT4        T1.y,  T2.y,  T2.y      
             z: DOT4        ____,  T1.z,  T1.z      
             w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
             t: MUL         R1.w,  PS15,  C28.y      CLAMP 
         17  x: ADD         ____,  PS16, -1.0f      
             t: RSQ_sat     ____,  T0.x      
         18  x: MUL         R12.x,  T2.x,  PS17      
             y: MUL         R11.y,  T0.y,  PS17      
             z: MUL         R12.z,  T0.z,  PS17      
             w: CNDGE       R2.w,  PV17.x,  0.0f,  1.0f      
             t: RSQ_sat     ____,  T1.y      
         19  x: MUL         ____,  T2.w,  PS18      
             y: MUL         ____,  T2.y,  PS18      
             z: MUL         ____,  T1.z,  PS18      
         20  x: DOT4        ____,  R12.x,  PV19.x      
             y: DOT4        ____,  R11.y,  PV19.y      
             z: DOT4        ____,  R12.z,  PV19.z      
             w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         21  w: MAX         R12.w,  PV20.x,  C38.z      
         22  x: PREDNE      ____,  R2.w, -R2.w      UPDATE_EXEC_MASK UPDATE_PRED 
    02 JUMP  POP_CNT(1) ADDR(36) VALID_PIX 
    03 ALU: ADDR(165) CNT(121) 
         23  x: ADD         R5.x, -R3.x,  C21.x      
             y: ADD         R3.y, -R3.y,  C21.y      
             z: ADD         R1.z, -R3.z,  C21.z      
             w: MOV         R6.w,  1.0f      
             t: MOV         R2.z,  C35.z      
         24  x: DOT4        ____,  PV23.x,  C18.x      
             y: DOT4        ____,  PV23.y,  C18.y      
             z: DOT4        T0.z,  PV23.z,  C18.z      
             w: DOT4        ____,  PV23.w,  C18.w      
             t: MOV         R3.x,  0.0f      
         25  x: MUL         R2.x,  C32.z, -1.0f      
             y: MUL         R2.y,  C32.w, -1.0f      
             z: ADD         ____,  PV24.x,  R2.z      
             w: ADD         ____,  PV24.x, -C32.y      
             t: ADD         T0.w,  PV24.x, -C32.z      
         26  x: ADD         ____,  T0.z, -C32.w      
             y: ADD         ____,  T0.z,  PV25.y      
             z: CNDGE       T1.z,  PV25.z,  0.0f,  1.0f      
             w: ADD         ____,  T0.z,  PV25.x      
             t: CNDGE       T0.y,  PV25.w,  1.0f,  0.0f      
         27  x: CNDGE       ____,  PV26.w,  0.0f,  1.0f      
             y: CNDGE       ____,  T0.w,  1.0f,  0.0f      
             z: CNDGE       ____,  PV26.y,  0.0f,  1.0f      
             w: CNDGE       ____,  PV26.x,  1.0f,  0.0f      
             t: ADD         ____,  T0.z,  C36.x      
         28  x: MUL         ____,  PV27.x,  T0.y      
             y: MUL         ____,  PV27.z,  PV27.y      
             z: MUL         ____,  T1.z,  PV27.w      
             t: MUL         R6.x,  PS27,  C36.y      CLAMP 
         29  x: DOT4        R8.x,  PV28.x,  1.0f      
             y: DOT4        ____,  PV28.y,  C33.y      
             z: DOT4        ____,  PV28.z,  C33.z      
             w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
             t: ADD         T0.y,  PS28, -1.0f      
         30  x: ADD         ____,  PV29.x, -1.0f      
             y: ADD         R2.y,  PV29.x, -1.0f      
             z: ADD         ____,  PV29.x, -C33.w      
             w: ADD         ____,  PV29.x, -C33.y      
             t: ADD         R2.z,  PV29.x, -C33.y      
         31  x: CNDGE       R4.x,  PV30.w,  PV30.w, -PS30      
             y: CNDGE       R5.y,  PV30.x,  PV30.x, -PV30.y      
             z: CNDGE       T3.z,  PV30.z,  PV30.z, -R8.x      
             w: ADD         R2.w,  R8.x, -C33.z      
             t: ADD         ____,  R8.x, -C33.z      
         32  x: CNDGE       ____, -PV31.z,  C3.z,  0.0f      
             y: CNDGE       ____, -PV31.z,  C3.y,  0.0f      
             z: CNDGE       ____, -PV31.z,  C3.x,  0.0f      
             w: CNDGE       ____, -PV31.z,  C3.w,  0.0f      
             t: CNDGE       R5.w,  PS31,  PS31, -PV31.w      
         33  x: CNDGE       T0.x, -R5.y,  C7.z,  PV32.x      
             y: CNDGE       T1.y, -R5.y,  C7.y,  PV32.y      
             z: CNDGE       T1.z, -R5.y,  C7.x,  PV32.z      
             w: CNDGE       T0.w, -R5.y,  C7.w,  PV32.w      
             t: MUL         R2.z,  R8.x,  (0x3E800000, 0.25f).x      
         34  x: CNDGE       T1.x, -T3.z,  C0.z,  0.0f      
             y: CNDGE       T0.y, -T3.z,  C0.y,  0.0f      
             z: CNDGE       T0.z, -T3.z,  C0.x,  0.0f      
             w: CNDGE       T1.w, -T3.z,  C0.w,  0.0f      
             t: CNDGE       R3.z,  T0.y,  0.0f,  1.0f      
         35  x: CNDGE       T2.x, -T3.z,  C1.x,  0.0f      
             y: CNDGE       T2.y, -T3.z,  C1.w,  0.0f      
             z: CNDGE       T2.z, -T3.z,  C1.y,  0.0f      
             w: CNDGE       T2.w, -T3.z,  C1.z,  0.0f      
         36  x: CNDGE       T0.x, -R4.x,  C11.z,  T0.x      
             y: CNDGE       T1.y, -R4.x,  C11.y,  T1.y      
             z: CNDGE       T1.z, -R4.x,  C11.x,  T1.z      
             w: CNDGE       T0.w, -R4.x,  C11.w,  T0.w      
         37  x: CNDGE       T1.x, -R5.y,  C4.z,  T1.x      
             y: CNDGE       T0.y, -R5.y,  C4.y,  T0.y      
             z: CNDGE       T0.z, -R5.y,  C4.x,  T0.z      
             w: CNDGE       T1.w, -R5.y,  C4.w,  T1.w      
         38  x: CNDGE       T2.x, -R5.y,  C5.x,  T2.x      
             y: CNDGE       T2.y, -R5.y,  C5.w,  T2.y      
             z: CNDGE       T2.z, -R5.y,  C5.y,  T2.z      
             w: CNDGE       T2.w, -R5.y,  C5.z,  T2.w      
         39  x: CNDGE       T0.x, -R5.w,  C15.x,  T1.z      
             y: CNDGE       T1.y, -R5.w,  C15.y,  T1.y      
             z: CNDGE       T1.z, -R5.w,  C15.z,  T0.x      
             w: CNDGE       ____, -R5.w,  C15.w,  T0.w      
         40  x: CNDGE       T1.x, -R4.x,  C8.z,  T1.x      
             y: CNDGE       T0.y, -R4.x,  C8.y,  T0.y      
             z: CNDGE       T0.z, -R4.x,  C8.x,  T0.z      
             w: CNDGE       T1.w, -R4.x,  C8.w,  T1.w      VEC_021 
             t: MUL         ____,  R6.w,  PV39.w      
         41  x: CNDGE       T2.x, -R4.x,  C9.x,  T2.x      
             y: CNDGE       T2.y, -R4.x,  C9.w,  T2.y      
             z: CNDGE       T1.z, -R4.x,  C9.y,  T2.z      VEC_120 
             w: CNDGE       T2.w, -R4.x,  C9.z,  T2.w      
             t: MULADD      ____,  R1.z,  T1.z,  PS40      
         42  x: DOT4        ____,  R5.x,  T0.x      
             y: DOT4        ____,  R3.y,  T1.y      
             z: DOT4        ____,  PS41,  1.0f      
             w: DOT4        ____,  0.0f,  0.0f      
             t: CNDGE       T0.x, -R5.w,  C12.x,  T0.z      VEC_021 
         43  x: CNDGE       T1.x, -R5.w,  C13.x,  T2.x      
             y: CNDGE       T0.y, -R5.w,  C12.y,  T0.y      
             z: CNDGE       T0.z, -R5.w,  C12.z,  T1.x      VEC_021 
             w: CNDGE       ____, -R5.w,  C12.w,  T1.w      
             t: RCP_e       R13.w,  PV42.x      
         44  x: CNDGE       R2.x, -T3.z,  C2.x,  0.0f      
             y: CNDGE       T2.y, -R5.w,  C13.y,  T1.z      
             z: CNDGE       T1.z, -R5.w,  C13.z,  T2.w      VEC_021 
             w: CNDGE       T2.w, -R5.w,  C13.w,  T2.y      
             t: MUL         ____,  R6.w,  PV43.w      
         45  x: DOT4        ____,  R5.x,  T0.x      
             y: DOT4        R2.y,  R3.y,  T0.y      
             z: DOT4        ____,  R1.z,  T0.z      
             w: DOT4        ____,  PS44,  1.0f      
             t: CNDGE       R4.z, -T3.z,  C2.y,  0.0f      
         46  x: DOT4        ____,  R5.x,  T1.x      
             y: DOT4        ____,  R3.y,  T2.y      
             z: DOT4        ____,  R1.z,  T1.z      
             w: DOT4        ____,  R6.w,  T2.w      
             t: MUL         R9.y,  R13.w,  PV45.x      
         47  x: MOV         R7.x,  PS46      
             y: CNDGE       R4.y, -T3.z,  C2.z,  0.0f      
             z: MUL         R5.z,  R13.w,  PV46.x      
             w: ADD         R8.w,  PS46,  C31.z      
             t: CNDGE       R6.y, -T3.z,  C2.w,  0.0f      
    04 ALU: ADDR(286) CNT(11) 
         48  x: MULADD      R9.x,  R2.y,  R13.w,  C31.z      
             y: CNDGE       ____, -R5.y,  C6.x,  R2.x      VEC_120 
             z: CNDGE       ____, -R5.y,  C6.y,  R4.z      VEC_120 
             w: MULADD      R9.w,  R5.z,  (0x3E800000, 0.25f).x,  R2.z      VEC_102 
             t: CNDGE       R2.w, -R5.y,  C6.z,  R4.y      VEC_021 
         49  x: CNDGE       R2.x, -R5.y,  C6.w,  R6.y      
             y: ADD         R7.y,  PV48.w,  C31.w      
             z: CNDGE       R2.z, -R4.x,  C10.x,  PV48.y      
             w: CNDGE       R4.w, -R4.x,  C10.y,  PV48.z      
             t: ADD         R8.y,  PV48.w,  C31.w      
    05 TEX: ADDR(658) CNT(6) VALID_PIX 
         50  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         51  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         52  SAMPLE_G R4.__x_, R7.xyxx, t0, s0
         53  SAMPLE_G R8.___x, R8.wyww, t0, s0
         54  SAMPLE_G R8._x__, R9.xwxx, t0, s0
         55  SAMPLE_G R7.x___, R9.ywyy, t0, s0
    06 ALU_PUSH_BEFORE: ADDR(297) CNT(29) 
         56  x: CNDGE       T1.x, -R5.w,  C14.x,  R2.z      
             y: CNDGE       ____, -R4.x,  C10.w,  R2.x      
             z: CNDGE       T3.z, -R5.w,  C14.y,  R4.w      
             w: CNDGE       ____, -R4.x,  C10.z,  R2.w      VEC_021 
         57  x: MUL         ____,  R9.y,  C31.x      
             y: MUL         T2.y,  R9.w,  C31.y      
             z: CNDGE       ____, -R5.w,  C14.z,  PV56.w      VEC_120 
             w: CNDGE       ____, -R5.w,  C14.w,  PV56.y      VEC_120 
         58  x: DOT4        ____,  R5.x,  T1.x      
             y: DOT4        ____,  R3.y,  T3.z      
             z: DOT4        R1.z,  R1.z,  PV57.z      
             w: DOT4        ____,  R6.w,  PV57.w      
             t: FRACT       T0.y,  PV57.x      
         59  x: MULADD      ____,  PV58.x,  R13.w, -R7.x      
             y: MULADD      ____,  PV58.x,  R13.w, -R8.w      
             z: MULADD      ____,  PV58.x,  R13.w, -R4.z      
             w: MULADD      ____,  PV58.x,  R13.w, -R8.y      VEC_201 
             t: FRACT       T2.w,  T2.y      
         60  x: CNDGE       T1.x,  PV59.x,  0.0f,  1.0f      
             y: CNDGE       ____,  PV59.y,  0.0f,  1.0f      
             z: CNDGE       T3.z,  PV59.z,  0.0f,  1.0f      
             w: CNDGE       ____,  PV59.w,  0.0f,  1.0f      
         61  y: ADD         ____, -PV60.z,  PV60.y      
             z: ADD         ____, -PV60.x,  PV60.w      
         62  x: MULADD      T1.x,  PV61.z,  T0.y,  T1.x      
             w: MULADD      ____,  PV61.y,  T0.y,  T3.z      
         63  x: ADD         ____, -PV62.x,  PV62.w      
         64  z: MULADD      R8.z,  PV63.x,  T2.w,  T1.x      
         65  x: PREDNE      ____,  R3.z, -R3.z      UPDATE_EXEC_MASK UPDATE_PRED 
    07 JUMP  POP_CNT(1) ADDR(35) VALID_PIX 
    08 ALU: ADDR(326) CNT(20) 
         66  x: MULADD      T0.x, -R6.x,  C31.z,  C31.z      
             y: MULADD      T0.y, -R6.x,  C31.w,  C31.w      
             z: ADD         ____,  R8.x, -1.0f      VEC_120 
             w: ADD         ____,  R8.x,  0.0f      VEC_120 
         67  y: MOV         ____, -|PV66.w|      
             z: MOV         T0.z, -|PV66.z|      
         68  x: CNDGE       ____,  PV67.y,  C19.x,  0.0f      
             w: CNDGE       ____,  PV67.y,  C19.y,  0.0f      
         69  x: CNDGE       ____,  T0.z,  C20.y,  PV68.w      
             y: CNDGE       ____,  T0.z,  C20.x,  PV68.x      
         70  y: MUL         R3.y,  PV69.y,  T0.x      
             z: MUL         R3.z,  PV69.x,  T0.y      
         71  y: MULADD      R4.y,  PV70.y, -C33.z,  R9.y      
             z: MULADD      R4.z,  PV70.z, -C33.z,  R9.w      
             t: MULADD      R6.y,  PV70.y,  C36.z,  R9.y      VEC_021 
         72  x: ADD         R2.x,  PV71.y,  C31.z      
             y: ADD         R2.y,  PV71.z,  C31.w      
             z: MUL         R5.z,  PV71.y,  C31.x      
             w: ADD         R4.w,  PV71.z,  C31.w      
             t: ADD         R4.x,  PV71.y,  C31.z      
    09 TEX: ADDR(670) CNT(6) VALID_PIX 
         73  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         74  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         75  SAMPLE_G R2.___x, R2.xyxx, t0, s0
         76  SAMPLE_G R2.x___, R4.yzyy, t0, s0
         77  SAMPLE_G R2._x__, R4.xzxx, t0, s0
         78  SAMPLE_G R2.__x_, R4.ywyy, t0, s0
    10 ALU: ADDR(346) CNT(25) 
         79  x: MUL         ____,  R4.z,  C31.y      VEC_120 
             y: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_210 
             z: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_210 
             w: MULADD      ____,  R1.z,  R13.w, -R2.z      VEC_210 
             t: MULADD      T0.x,  R1.z,  R13.w, -R2.w      
         80  x: CNDGE       ____,  PV79.y,  0.0f,  1.0f      
             y: FRACT       R4.y,  PV79.x      
             z: FRACT       T0.z,  R5.z      
             w: CNDGE       T0.w,  PV79.z,  0.0f,  1.0f      
             t: CNDGE       T1.z,  PV79.w,  0.0f,  1.0f      
         81  x: ADD         R2.x,  R6.y,  C31.z      
             y: CNDGE       ____,  T0.x,  0.0f,  1.0f      
             z: MULADD      R6.z,  R3.z,  C36.w,  R9.w      
             w: ADD         ____, -PV80.w,  PV80.x      
             t: ADD         R6.x,  R6.y,  C31.z      
         82  x: ADD         ____, -T1.z,  PV81.y      
             y: ADD         R2.y,  PV81.z,  C31.w      
             z: MULADD      R2.z,  PV81.w,  T0.z,  T0.w      
             w: ADD         R6.w,  PV81.z,  C31.w      
             t: MUL         ____,  R6.y,  C31.x      
         83  x: FRACT       R4.x,  PS82      
             y: MULADD      R7.y,  R3.y,  C36.w,  R9.y      
             z: MUL         R5.z,  R6.z,  C31.y      
             w: MULADD      R4.w,  PV82.x,  T0.z,  T1.z      
             t: MULADD      R8.y,  R3.y,  C33.z,  R9.y      VEC_021 
    11 TEX: ADDR(682) CNT(6) VALID_PIX 
         84  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         85  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         86  SAMPLE_G R2.___x, R2.xyxx, t0, s0
         87  SAMPLE_G R2.x___, R6.yzyy, t0, s0
         88  SAMPLE_G R2._x__, R6.xzxx, t0, s0
         89  SAMPLE_G R6.__x_, R6.ywyy, t0, s0
    12 ALU: ADDR(371) CNT(35) 
         90  x: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_021 
             y: MULADD      ____,  R1.z,  R13.w, -R6.z      VEC_021 
             z: MULADD      ____,  R1.z,  R13.w, -R2.w      VEC_021 
             w: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_021 
             t: FRACT       T0.w,  R5.z      
         91  x: CNDGE       T0.x,  PV90.y,  0.0f,  1.0f      
             y: CNDGE       T0.y,  PV90.x,  0.0f,  1.0f      
             z: CNDGE       ____,  PV90.w,  0.0f,  1.0f      
             w: CNDGE       ____,  PV90.z,  0.0f,  1.0f      
             t: ADD         ____, -R2.z,  R4.w      
         92  x: ADD         ____, -PV91.y,  PV91.z      
             y: MUL         T1.y,  R7.y,  C31.x      
             z: MULADD      ____,  PS91,  R4.y,  R2.z      
             w: ADD         ____, -PV91.x,  PV91.w      
             t: MULADD      R7.z,  R3.z,  C36.z,  R9.w      VEC_021 
         93  x: ADD         T0.x,  R8.z,  PV92.z      
             y: MULADD      T0.y,  PV92.x,  R4.x,  T0.y      
             z: MULADD      ____,  PV92.w,  R4.x,  T0.x      
             w: ADD         R4.w,  R7.y,  C31.z      
             t: ADD         R4.y,  PS92,  C31.w      
         94  x: ADD         R7.x,  R7.y,  C31.z      
             y: MUL         ____,  R7.z,  C31.y      
             z: FRACT       R5.z,  T1.y      VEC_120 
             w: ADD         ____, -PV93.y,  PV93.z      
             t: ADD         R7.w,  R7.z,  C31.w      
         95  x: FRACT       R6.x,  PV94.y      
             y: MULADD      ____,  PV94.w,  T0.w,  T0.y      VEC_021 
             z: MULADD      R8.z,  R3.z,  C33.z,  R9.w      VEC_021 
             w: ADD         R2.w,  R8.y,  C31.z      
             t: ADD         R8.x,  R8.y,  C31.z      
         96  x: MUL         R2.x,  R8.y,  C31.x      
             y: ADD         R2.y,  PV95.z,  C31.w      
             z: ADD         R6.z,  T0.x,  PV95.y      
             w: ADD         R8.w,  PV95.z,  C31.w      
             t: MUL         R2.z,  PV95.z,  C31.y      
    13 TEX: ADDR(694) CNT(7) VALID_PIX 
         97  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         98  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
         99  SAMPLE_G R4.___x, R4.wyww, t0, s0
        100  SAMPLE_G R2.___x, R2.wyww, t0, s0
        101  SAMPLE_G R4.x___, R7.yzyy, t0, s0
        102  SAMPLE_G R2._x__, R7.xzxx, t0, s0
        103  SAMPLE_G R7.__x_, R7.ywyy, t0, s0
    14 ALU: ADDR(406) CNT(16) 
        104  x: MULADD      ____,  R1.z,  R13.w, -R4.w      VEC_210 
             y: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_210 
             z: MULADD      ____,  R1.z,  R13.w, -R4.x      VEC_210 
             w: MULADD      ____,  R1.z,  R13.w, -R7.z      VEC_210 
             t: MULADD      T0.z,  R1.z,  R13.w, -R2.w      VEC_120 
        105  x: CNDGE       ____,  PV104.y,  0.0f,  1.0f      
             y: CNDGE       ____,  PV104.x,  0.0f,  1.0f      
             z: CNDGE       T1.z,  PV104.w,  0.0f,  1.0f      
             w: CNDGE       T0.w,  PV104.z,  0.0f,  1.0f      
             t: FRACT       R4.x,  R2.x      
        106  x: ADD         ____, -PV105.z,  PV105.y      
             y: FRACT       R7.y,  R2.z      
             z: CNDGE       R2.z,  T0.z,  0.0f,  1.0f      VEC_120 
             w: ADD         ____, -PV105.w,  PV105.x      
        107  z: MULADD      R5.z,  PV106.w,  R5.z,  T0.w      
             w: MULADD      R2.w,  PV106.x,  R5.z,  T1.z      
    15 TEX: ADDR(708) CNT(5) VALID_PIX 
        108  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        109  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        110  SAMPLE_G R2.x___, R8.yzyy, t0, s0
        111  SAMPLE_G R2._x__, R8.xzxx, t0, s0
        112  SAMPLE_G R8.__x_, R8.ywyy, t0, s0
    16 ALU_PUSH_BEFORE: ADDR(422) CNT(18) 
        113  x: MULADD      ____,  R1.z,  R13.w, -R2.x      
             y: MULADD      ____,  R1.z,  R13.w, -R8.z      
             z: ADD         ____, -R5.z,  R2.w      VEC_120 
             w: MULADD      ____,  R1.z,  R13.w, -R2.y      
        114  x: CNDGE       ____,  PV113.w,  0.0f,  1.0f      
             y: CNDGE       T0.y,  PV113.x,  0.0f,  1.0f      
             z: MULADD      ____,  PV113.z,  R6.x,  R5.z      
             w: CNDGE       T0.w,  PV113.y,  0.0f,  1.0f      
        115  x: ADD         T0.x,  R6.z,  PV114.z      
             y: ADD         ____, -PV114.y,  PV114.x      
             w: ADD         ____, -PV114.w,  R2.z      
        116  y: MULADD      T0.y,  PV115.y,  R4.x,  T0.y      
             z: MULADD      ____,  PV115.w,  R4.x,  T0.w      
        117  w: ADD         ____, -PV116.y,  PV116.z      
        118  y: MULADD      ____,  PV117.w,  R7.y,  T0.y      
        119  z: ADD         R8.z,  T0.x,  PV118.y      
        120  x: CNDGE       R2.x, -PV119.z,  0.0f,  1.0f      
        121  x: PREDNE      ____,  R2.x, -R2.x      UPDATE_EXEC_MASK UPDATE_PRED 
    17 ALU_PUSH_BEFORE: ADDR(440) CNT(3) 
        122  y: ADD         ____,  R8.z,  C37.x      
        123  x: CNDGE       R2.x,  PV122.y,  1.0f,  0.0f      
        124  x: PREDNE      ____,  R2.x, -R2.x      UPDATE_EXEC_MASK UPDATE_PRED 
    18 JUMP  ADDR(20) VALID_PIX 
    19 ALU: ADDR(443) CNT(1) 
        125  z: MOV         R8.z,  1.0f      
    20 ELSE POP_CNT(1) ADDR(34) VALID_PIX 
    21 ALU: ADDR(444) CNT(8) 
        126  y: MULADD      R4.y,  R3.y,  C38.x,  R9.y      
             z: MULADD      R4.z,  R3.z,  C38.y,  R9.w      
             t: MULADD      R5.y,  R3.y,  C37.y,  R9.y      VEC_021 
        127  x: ADD         R2.x,  PV126.y,  C31.z      
             y: ADD         R2.y,  PV126.z,  C31.w      
             z: MUL         R6.z,  PV126.y,  C31.x      
             w: ADD         R4.w,  PV126.z,  C31.w      
             t: ADD         R4.x,  PV126.y,  C31.z      
    22 TEX: ADDR(718) CNT(6) VALID_PIX 
        128  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        129  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        130  SAMPLE_G R2.___x, R2.xyxx, t0, s0
        131  SAMPLE_G R2.x___, R4.yzyy, t0, s0
        132  SAMPLE_G R2._x__, R4.xzxx, t0, s0
        133  SAMPLE_G R2.__x_, R4.ywyy, t0, s0
    23 ALU: ADDR(452) CNT(15) 
        134  x: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_021 
             y: MULADD      ____,  R1.z,  R13.w, -R2.w      VEC_021 
             z: MULADD      ____,  R1.z,  R13.w, -R2.z      VEC_021 
             w: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_021 
             t: MUL         R4.w,  R4.z,  C31.y      
        135  x: CNDGE       R4.x,  PV134.x,  0.0f,  1.0f      
             y: CNDGE       R4.y,  PV134.y,  0.0f,  1.0f      
             z: CNDGE       R7.z,  PV134.z,  0.0f,  1.0f      
             w: CNDGE       R6.w,  PV134.w,  0.0f,  1.0f      
             t: MULADD      R5.z,  R3.z,  C37.z,  R9.w      VEC_021 
        136  x: ADD         R2.x,  R5.y,  C31.z      
             y: ADD         R2.y,  PS135,  C31.w      
             z: MUL         R4.z,  R5.y,  C31.x      
             w: ADD         R5.w,  PS135,  C31.w      
             t: ADD         R5.x,  R5.y,  C31.z      
    24 TEX: ADDR(730) CNT(6) VALID_PIX 
        137  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        138  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        139  SAMPLE_G R2.___x, R2.xyxx, t0, s0
        140  SAMPLE_G R2.x___, R5.yzyy, t0, s0
        141  SAMPLE_G R2._x__, R5.xzxx, t0, s0
        142  SAMPLE_G R2.__x_, R5.ywyy, t0, s0
    25 ALU: ADDR(467) CNT(39) 
        143  x: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_210 
             y: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_210 
             z: MULADD      ____,  R1.z,  R13.w, -R2.z      VEC_210 
             w: MUL         ____,  R5.z,  C31.y      VEC_120 
             t: MULADD      T0.w,  R1.z,  R13.w, -R2.w      
        144  x: FRACT       T0.x,  PV143.w      
             y: FRACT       T0.y,  R4.z      
             z: CNDGE       T0.z,  PV143.y,  0.0f,  1.0f      
             w: CNDGE       ____,  PV143.x,  0.0f,  1.0f      
             t: CNDGE       T1.y,  PV143.z,  0.0f,  1.0f      
        145  x: CNDGE       ____,  T0.w,  0.0f,  1.0f      
             y: ADD         ____, -PV144.z,  PV144.w      
             z: FRACT       T1.z,  R6.z      
             w: FRACT       R7.w,  R4.w      VEC_201 
             t: ADD         ____, -R4.x,  R6.w      
        146  x: MULADD      R11.x,  PS145,  PV145.z,  R4.x      
             y: ADD         ____, -T1.y,  PV145.x      
             z: MULADD      T0.z,  PV145.y,  T0.y,  T0.z      
             w: ADD         ____, -R7.z,  R4.y      VEC_021 
             t: MULADD      R6.z,  R3.z,  C40.y,  R9.w      VEC_021 
        147  x: MULADD      R8.x,  PV146.w,  T1.z,  R7.z      
             y: ADD         R7.y,  PS146,  C31.w      
             z: MUL         ____,  PS146,  C31.y      
             w: MULADD      ____,  PV146.y,  T0.y,  T1.y      
             t: ADD         R6.w,  PS146,  C31.w      
        148  x: FRACT       R2.x,  PV147.z      
             y: ADD         ____, -T0.z,  PV147.w      
             z: MULADD      R5.z,  R3.z,  C40.w,  R9.w      VEC_120 
             t: MULADD      R6.y,  R3.y,  C40.x,  R9.y      VEC_021 
        149  x: MULADD      ____,  PV148.y,  T0.x,  T0.z      
             y: MULADD      R5.y,  R3.y,  C40.z,  R9.y      
             z: ADD         R7.z,  PS148,  C31.z      
             w: MUL         ____,  PS148,  C31.x      
             t: ADD         R6.x,  PS148,  C31.z      
        150  x: ADD         R4.x,  PV149.y,  C31.z      
             y: FRACT       R4.y,  PV149.w      
             z: ADD         R4.z,  R8.z,  PV149.x      
             w: ADD         R4.w,  R5.z,  C31.w      VEC_120 
             t: ADD         R5.x,  PV149.y,  C31.z      
    26 TEX: ADDR(742) CNT(7) VALID_PIX 
        151  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        152  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        153  SAMPLE_G R2.___x, R7.zyzz, t0, s0
        154  SAMPLE_G R4.___x, R4.xwxx, t0, s0
        155  SAMPLE_G R4.x___, R6.yzyy, t0, s0
        156  SAMPLE_G R7._x__, R6.xzxx, t0, s0
        157  SAMPLE_G R6.__x_, R6.ywyy, t0, s0
    27 ALU: ADDR(506) CNT(25) 
        158  x: MULADD      ____,  R1.z,  R13.w, -R7.y      VEC_021 
             y: MULADD      ____,  R1.z,  R13.w, -R4.x      VEC_021 
             z: MULADD      ____,  R1.z,  R13.w, -R2.w      VEC_021 
             w: MULADD      ____,  R1.z,  R13.w, -R6.z      VEC_021 
             t: ADD         R5.w,  R5.z,  C31.w      
        159  x: CNDGE       ____,  PV158.z,  0.0f,  1.0f      
             y: CNDGE       T0.y,  PV158.w,  0.0f,  1.0f      
             z: CNDGE       ____,  PV158.x,  0.0f,  1.0f      
             w: CNDGE       T0.w,  PV158.y,  0.0f,  1.0f      
             t: MUL         ____,  R5.y,  C31.x      
        160  x: MUL         ____,  R5.z,  C31.y      
             y: MULADD      ____,  R1.z,  R13.w, -R4.w      VEC_120 
             z: ADD         ____, -PV159.y,  PV159.x      
             w: ADD         ____, -PV159.w,  PV159.z      
             t: FRACT       R4.w,  PS159      
        161  x: FRACT       R7.x,  PV160.x      
             y: MULADD      R4.y,  PV160.w,  R4.y,  T0.w      VEC_021 
             z: CNDGE       R6.z,  PV160.y,  0.0f,  1.0f      
             w: MULADD      ____,  PV160.z,  R4.y,  T0.y      VEC_021 
             t: MULADD      R10.z,  R3.z,  C39.y,  R9.w      VEC_021 
        162  x: ADD         R4.x, -PV161.y,  PV161.w      
             y: MULADD      R10.y,  R3.y,  C39.x,  R9.y      
             z: ADD         R9.z,  PS161,  C31.w      
             w: ADD         R10.w,  PS161,  C31.w      
             t: MUL         R7.z,  PS161,  C31.y      
    28 TEX: ADDR(756) CNT(6) VALID_PIX 
        163  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        164  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        165  SAMPLE_G R6.x___, R5.yzyy, t0, s0
        166  SAMPLE_G R7._x__, R5.xzxx, t0, s0
        167  SAMPLE_G R5.__x_, R5.ywyy, t0, s0
        168  SAMPLE_G R5.x___, R10.yzyy, t0, s0
    29 ALU: ADDR(531) CNT(33) 
        169  x: MULADD      ____,  R1.z,  R13.w, -R5.z      
             y: MULADD      ____,  R1.z,  R13.w, -R7.y      VEC_021 
             z: MULADD      ____,  R4.x,  R2.x,  R4.y      VEC_120 
             w: MULADD      ____,  R1.z,  R13.w, -R6.x      VEC_120 
             t: ADD         R9.x,  R10.y,  C31.z      
        170  x: CNDGE       T0.x,  PV169.w,  0.0f,  1.0f      
             y: CNDGE       T0.y,  PV169.x,  0.0f,  1.0f      
             z: CNDGE       ____,  PV169.y,  0.0f,  1.0f      
             w: ADD         T0.w,  R4.z,  PV169.z      
             t: ADD         R10.x,  R10.y,  C31.z      
        171  x: MUL         ____,  R10.y,  C31.x      
             y: ADD         ____, -PV170.y,  R6.z      
             z: MULADD      ____,  R1.z,  R13.w, -R5.x      
             w: ADD         ____, -PV170.x,  PV170.z      
             t: FRACT       R2.x,  R7.z      
        172  x: MULADD      T0.x,  PV171.w,  R4.w,  T0.x      
             y: FRACT       R7.y,  PV171.x      
             z: MULADD      ____,  PV171.y,  R4.w,  T0.y      VEC_120 
             w: CNDGE       R2.w,  PV171.z,  0.0f,  1.0f      
             t: MULADD      R6.y,  R3.y,  C39.z,  R9.y      VEC_021 
        173  x: ADD         R5.x,  PS172,  C31.z      
             y: ADD         ____, -PV172.x,  PV172.z      
             z: MULADD      R6.z,  R3.z,  C39.w,  R9.w      
             w: MUL         ____,  PS172,  C31.x      
             t: ADD         R6.x,  PS172,  C31.z      
        174  x: MULADD      ____,  PV173.y,  R7.x,  T0.x      
             y: ADD         R5.y,  PV173.z,  C31.w      
             z: MUL         ____,  PV173.z,  C31.y      
             w: ADD         R6.w,  PV173.z,  C31.w      
             t: FRACT       R8.w,  PV173.w      
        175  x: ADD         R8.x, -R11.x,  R8.x      
             y: FRACT       R2.y,  PV174.z      
             z: ADD         R7.z,  T0.w,  PV174.x      
    30 TEX: ADDR(768) CNT(6) VALID_PIX 
        176  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        177  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        178  SAMPLE_G R4.___x, R9.xzxx, t0, s0
        179  SAMPLE_G R4._x__, R10.xzxx, t0, s0
        180  SAMPLE_G R5.___x, R5.xyxx, t0, s0
        181  SAMPLE_G R10.__x_, R10.ywyy, t0, s0
    31 ALU: ADDR(564) CNT(13) 
        182  x: MULADD      ____,  R1.z,  R13.w, -R4.y      
             y: MULADD      ____,  R1.z,  R13.w, -R10.z      
             z: MULADD      ____,  R1.z,  R13.w, -R4.w      
             w: MULADD      R4.w,  R8.x,  R7.w,  R11.x      VEC_102 
        183  x: CNDGE       ____,  PV182.z,  0.0f,  1.0f      
             y: CNDGE       T0.y,  PV182.y,  0.0f,  1.0f      
             z: CNDGE       ____,  PV182.x,  0.0f,  1.0f      
             w: MULADD      ____,  R1.z,  R13.w, -R5.w      
        184  y: CNDGE       R10.y,  PV183.w,  0.0f,  1.0f      
             z: ADD         ____, -PV183.y,  PV183.x      
             w: ADD         ____, -R2.w,  PV183.z      
        185  y: MULADD      R4.y,  PV184.w,  R7.y,  R2.w      
             w: MULADD      R2.w,  PV184.z,  R7.y,  T0.y      
    32 TEX: ADDR(780) CNT(5) VALID_PIX 
        186  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        187  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
        188  SAMPLE_G R8.x___, R6.yzyy, t0, s0
        189  SAMPLE_G R7._x__, R6.xzxx, t0, s0
        190  SAMPLE_G R6.__x_, R6.ywyy, t0, s0
    33 ALU_POP_AFTER: ADDR(577) CNT(18) 
        191  x: MULADD      ____,  R1.z,  R13.w, -R6.z      
             y: MULADD      ____,  R1.z,  R13.w, -R7.y      
             z: ADD         ____, -R4.y,  R2.w      VEC_021 
             w: MULADD      ____,  R1.z,  R13.w, -R8.x      
        192  x: CNDGE       T0.x,  PV191.w,  0.0f,  1.0f      
             y: CNDGE       ____,  PV191.y,  0.0f,  1.0f      
             z: MULADD      ____,  PV191.z,  R2.x,  R4.y      
             w: CNDGE       T0.w,  PV191.x,  0.0f,  1.0f      
        193  x: ADD         ____, -PV192.x,  PV192.y      
             y: ADD         ____, -PV192.w,  R10.y      
             w: ADD         T1.w,  R7.z,  PV192.z      
        194  x: MULADD      T0.x,  PV193.x,  R8.w,  T0.x      
             z: MULADD      ____,  PV193.y,  R8.w,  T0.w      
        195  y: ADD         ____, -PV194.x,  PV194.z      
        196  x: MULADD      ____,  PV195.y,  R2.y,  T0.x      
        197  y: ADD         ____,  T1.w,  PV196.x      
        198  x: ADD         ____,  R4.w,  PV197.y      
        199  z: MUL         R8.z,  PV198.x,  C37.w      
    34 POP (2) ADDR(35) 
    35 ALU_POP_AFTER: ADDR(595) CNT(2) 
        200  x: ADD         ____,  R3.w, -R8.z      
        201  w: MULADD      R3.w,  R1.w,  PV200.x,  R8.z      
    36 TEX: ADDR(790) CNT(2) VALID_PIX 
        202  SAMPLE R2, R1.xyxx, t2, s2
        203  SAMPLE R1, R1.xyxx, t1, s1
    37 ALU: ADDR(597) CNT(48) 
        204  x: DOT4        ____,  R12.x, -C29.x      
             y: DOT4        ____,  R11.y, -C29.y      
             z: DOT4        ____,  R12.z, -C29.z      
             w: DOT4        T1.w,  (0x80000000, 0.0f).x,  0.0f      
             t: MULADD      T0.w,  R2.w,  R11.z,  C23.x      
        205  x: MUL         ____,  C27.x,  C27.x      
             w: MAX         ____,  PV204.x,  0.0f      
             t: LOG_sat     ____,  |R12.w|      
        206  x: MUL         ____,  PV205.w,  R1.y      
             y: MUL         ____,  T0.w,  PS205      
             z: MUL         ____,  PV205.w,  R1.x      
             w: MUL         ____,  PV205.w,  R1.z      
             t: RCP_e       ____,  PV205.x      
        207  x: MUL         ____,  PV206.z,  C30.x      
             y: MUL         T1.y,  R11.w,  PS206      CLAMP 
             z: MUL         T0.z,  PV206.w,  C30.z      
             w: MUL         ____,  PV206.x,  C30.y      
             t: EXP_e       ____,  PV206.y      
        208  x: MUL         ____,  R2.z,  PS207      
             y: MUL         ____,  R2.y,  PS207      
             z: MUL         ____,  R2.x,  PS207      
             w: MUL         ____,  R3.w,  PV207.x      
             t: MUL         T0.y,  R3.w,  PV207.w      
        209  x: MUL         ____,  R3.w,  T0.z      
             y: MUL         ____,  PV208.x,  C30.z      
             z: MUL         ____,  PV208.y,  C30.y      
             w: MUL         ____,  PV208.z,  C30.x      
             t: MULADD      T0.w,  R1.x,  R0.x,  PV208.w      
        210  x: MUL         ____,  PV209.w,  C25.x      
             y: MULADD      T0.y,  R1.z,  R0.z,  PV209.x      
             z: MULADD      T0.z,  R1.y,  R0.y,  T0.y      
             w: MUL         ____,  PV209.z,  C25.x      
             t: MUL         ____,  PV209.y,  C25.x      
        211  x: MUL         T0.x,  T1.y,  C27.y      
             y: MULADD      ____,  PS210,  R3.w,  PV210.y      
             z: MULADD      ____,  PV210.w,  R3.w,  PV210.z      
             w: MULADD      ____,  PV210.x,  R3.w,  T0.w      
        212  y: CNDGE       T0.y, -T1.w,  T0.y,  PV211.y      
             z: CNDGE       T0.z, -T1.w,  T0.z,  PV211.z      
             w: CNDGE       T0.w, -T1.w,  T0.w,  PV211.w      
        213  x: ADD         ____, -PV212.y,  C26.z      
             y: ADD         ____, -PV212.z,  C26.y      
             z: ADD         ____, -PV212.w,  C26.x      
             w: MUL         R0.w,  R0.w,  R1.w      
        214  x: MULADD      R0.x,  T0.x,  PV213.z,  T0.w      
             y: MULADD      R0.y,  T0.x,  PV213.y,  T0.z      
             z: MULADD      R0.z,  T0.x,  PV213.x,  T0.y      
    38 EXP_DONE: PIX0, R0
    
    The overall utilization is about 80% but this isn't due to the t unit not being utilized but instead because of some scalar dependencies.
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...