Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 18-Aug-2009, 17:18   #1
prunedtree
Regular
 
Join Date: Aug 2009
Posts: 27
Default Faster dense matrix-matrix products on ATi hardware

Dense matrix-matrix products, a basic element for dense linear algebra, is rather straightforward in terms of multiply-add operations, and very friendly to memory hierarchies, which makes it an interesting benchmark for throughput-oriented hardware: The performance of matrix-matrix multiply is often much more relevant to performance than the sum of all ALU operations than can be completed in a second.

Contrast for instance the peak performance claimed by both major IHVs:

nVidia GT200 (GTX280) : 933 Gflop/s
ATi RV770 (HD4870) : 1200 Gflop/s

...to the current fastest matrix-matrix product implementations:

CUBLAS 2.0 on GTX280[1] achieves 375 Gflop/s
and on HD4870[2] ATi reckons 540 Gflop/s

However, there is a significant difference as Volkov's implementation achieves the peak multiply-add rate with on operand from shared memory, while ATi's implementation is limited by the speed of the texture units. As many others, I thought that higher performance could be possible on ATi boards, using some mechanism to avoid the memory bottleneck. However, according to ATi, this is not possible[3], but they do not want to disclose details on the hardware.

Thus, I experimented with all the ideas I could come up with. And I've hit these limitations ATi knew about, one after another: Shared memory (LDS in ATi parlance) is no faster than texture fetches that hit L1 (30 billion float4 per second, 480 GB/s, for both). Shared memory broadcasting requires unpractical amounts of ALU in order to put addresses into registers (sigh). Shared registers have half the peak bandwidth of local registers, giving us a limit of 480 Gflop/s...

ATi's claim checks out: The limited features of their hardware do not really offer any help. But is their implementation really optimal ? The bandwidth intensity of a matrix-matrix product implementation is directly related to the size of the blocks in the destination matrix, and `simple_matmult' uses 8x4 blocks (this is also the maximum for their pixel shader approach). RV770's texture units can deliver 120 billion single precision values per second and we need two input values for each multiply-add operation. With 8x4 blocks, the bandwith reduction is ~5, and thus we obtain a peak of 600 Gflop/s.

Using 8x8 blocks would bring the bandwidth reduction to 8, for a peak of 960 Gflop/s. However, the obvious limitation to higher block sizes is that we need enough space in the register file to store them. The size of the register file (1024 scalars) on RV770 may seem impressive, but with 180 cycles of latency for L1 hits, you need 8 threads (wraps in ATi parlance) to hide one texture clause behind 30 cycles of computation (128 multiply-add). This gives us less than ~120 scalars in order to compute, for instance, two 8x8 outer products per loop: A single texture clause can load the four float8 inputs, and the two outer products amount to 128 multiply-add instructions.

How much register space would we need ? 64 scalars for the output block, the 32 values that are fetched by the texture units, and some registers for the loop index and texture addresses... a hundred scalars. This looks quite reasonable, so I implemented it. The major difficulty is to trick ATi's horrible compiler (which reflects the current quality of their `GPU computing' software stack well) into producing decent machine code. Here's what it looks like:

Code:
00 ALU: ADDR(32) CNT(71) 
      0  x: LSHR        T0.x,  R0.x,  (0x00000006, 8.407790786e-45f).x      
         y: MOV         R23.y,  0.0f      
         z: MOV         R23.z,  0.0f      
         w: AND_INT     T0.w,  R0.x,  (0x0000003F, 8.828180325e-44f).y      
         t: MOV         R23.x,  0.0f      
      1  x: MOV         R22.x,  0.0f      
         y: MOV         R22.y,  0.0f      
         z: MOV         R22.z,  0.0f      
         w: MOV         R23.w,  0.0f      
         t: MOV         R22.w,  0.0f      
      2  x: MOV         R21.x,  0.0f      
         y: MOV         R21.y,  0.0f      
         z: MOV         R21.z,  0.0f      
         w: MOV         R21.w,  0.0f      
         t: MOV         R20.x,  0.0f      
      3  x: MOV         R19.x,  0.0f      
         y: MOV         R20.y,  0.0f      
         z: MOV         R20.z,  0.0f      
         w: MOV         R20.w,  0.0f      
         t: MOV         R4.z,  (0xC0000000, -2.0f).x      
      4  x: MOV         R18.x,  0.0f      
         y: MOV         R19.y,  0.0f      
         z: MOV         R19.z,  0.0f      
         w: MOV         R19.w,  0.0f      
         t: MOV         R18.y,  0.0f      
      5  x: MOV         R17.x,  0.0f      
         y: MOV         R17.y,  0.0f      
         z: MOV         R18.z,  0.0f      
         w: MOV         R18.w,  0.0f      
         t: MOV         R17.z,  0.0f      
      6  x: MOV         R16.x,  0.0f      
         y: MOV         R16.y,  0.0f      
         z: MOV         R16.z,  0.0f      
         w: MOV         R17.w,  0.0f      
         t: MOV         R16.w,  0.0f      
      7  x: MOV         R15.x,  0.0f      
         y: MOV         R15.y,  0.0f      
         z: MOV         R15.z,  0.0f      
         w: MOV         R15.w,  0.0f      
         t: MOV         R13.x,  0.0f      
      8  x: MOV         R14.x,  0.0f      
         y: MOV         R13.y,  0.0f      
         z: MOV         R13.z,  0.0f      
         w: MOV         R13.w,  0.0f      
         t: MOV         R14.y,  0.0f      
      9  x: MOV         R12.x,  0.0f      
         y: MOV         R12.y,  0.0f      
         z: MOV         R14.z,  0.0f      
         w: MOV         R14.w,  0.0f      
         t: MOV         R12.z,  0.0f      
     10  x: MOV         R11.x,  0.0f      
         y: MOV         R11.y,  0.0f      
         z: MOV         R11.z,  0.0f      
         w: MOV         R12.w,  0.0f      
         t: MOV         R11.w,  0.0f      
     11  x: MOV         R9.x,  0.0f      
         y: MOV         R9.y,  0.0f      
         z: MOV         R9.z,  0.0f      
         w: MOV         R9.w,  0.0f      
         t: MOV         R10.x,  0.0f      
     12  x: MOV         R8.x,  0.0f      
         y: MOV         R10.y,  0.0f      
         z: MOV         R10.z,  0.0f      
         w: MOV         R10.w,  0.0f      
         t: MOV         R8.y,  0.0f      
     13  z: MOV         R8.z,  0.0f      
         w: MOV         R8.w,  0.0f      
         t: I_TO_F      R0.x,  T0.w      
     14  t: I_TO_F      R0.y,  T0.x      
01 TEX: ADDR(288) CNT(1) 
     15  SAMPLE R5.xyz_, R0.xyxx, t4, s4  UNNORM(XYZW) 
02 ALU: ADDR(103) CNT(2) 
     16  x: MOV         R4.x,  R5.x      
         y: MOV         R4.y,  R5.y      
03 LOOP_DX10 i0 FAIL_JUMP_ADDR(11) 
    04 ALU_BREAK: ADDR(105) CNT(3) KCACHE0(CB0:0-15) 
         17  z: ADD         R4.z,  R4.z,  (0x40000000, 2.0f).x      
         18  x: PREDGT      ____,  KC0[0].x,  R4.z      UPDATE_EXEC_MASK UPDATE_PRED 
    05 ALU: ADDR(108) CNT(3) KCACHE0(CB0:0-15) 
         19  x: ADD         R4.x,  R4.x,  KC0[0].y      
             y: ADD         R4.y,  R4.y,  KC0[0].y      
             w: ADD         R4.w,  R4.z,  1.0f      
    06 TEX: ADDR(290) CNT(8) 
         20  SAMPLE R0, R4.xzxx, t0, s0  UNNORM(XYZW) 
         21  SAMPLE R2, R4.xzxx, t1, s1  UNNORM(XYZW) 
         22  SAMPLE R1, R4.yzyy, t2, s2  UNNORM(XYZW) 
         23  SAMPLE R3, R4.yzyy, t3, s3  UNNORM(XYZW) 
         24  SAMPLE R6, R4.xwxx, t0, s0  UNNORM(XYZW) 
         25  SAMPLE R7, R4.xwxx, t1, s1  UNNORM(XYZW) 
         26  SAMPLE R24, R4.ywyy, t2, s2  UNNORM(XYZW) 
         27  SAMPLE R25, R4.ywyy, t3, s3  UNNORM(XYZW) 
    07 ALU_PUSH_BEFORE: ADDR(111) CNT(65) KCACHE0(CB0:0-15) 
         28  x: MULADD      R23.x,  R0.x,  R1.x,  R23.x      
             y: MULADD      R23.y,  R0.x,  R1.y,  R23.y      
             z: MULADD      R23.z,  R0.x,  R1.z,  R23.z      
             w: MULADD      R23.w,  R0.x,  R1.w,  R23.w      
         29  x: MULADD      R22.x,  R0.x,  R3.x,  R22.x      
             y: MULADD      R22.y,  R0.x,  R3.y,  R22.y      
             z: MULADD      R22.z,  R0.x,  R3.z,  R22.z      
             w: MULADD      R22.w,  R0.x,  R3.w,  R22.w      
         30  x: MULADD      R21.x,  R0.y,  R1.x,  R21.x      VEC_210 
             y: MULADD      R21.y,  R0.y,  R1.y,  R21.y      VEC_201 
             z: MULADD      R21.z,  R0.y,  R1.z,  R21.z      VEC_201 
             w: MULADD      R21.w,  R0.y,  R1.w,  R21.w      VEC_201 
             t: MULADD      R19.x,  R0.z,  R1.x,  R19.x      VEC_120 
         31  x: MULADD      R20.x,  R0.y,  R3.x,  R20.x      VEC_210 
             y: MULADD      R20.y,  R0.y,  R3.y,  R20.y      VEC_201 
             z: MULADD      R20.z,  R0.y,  R3.z,  R20.z      VEC_201 
             w: MULADD      R20.w,  R0.y,  R3.w,  R20.w      VEC_201 
             t: MULADD      R18.x,  R0.z,  R3.x,  R18.x      VEC_120 
         32  x: MULADD      R17.x,  R0.w,  R1.x,  R17.x      VEC_201 
             y: MULADD      R19.y,  R0.z,  R1.y,  R19.y      VEC_210 
             z: MULADD      R19.z,  R0.z,  R1.z,  R19.z      VEC_201 
             w: MULADD      R19.w,  R0.z,  R1.w,  R19.w      VEC_201 
             t: MULADD      R17.y,  R0.w,  R1.y,  R17.y      VEC_120 
         33  x: MULADD      R16.x,  R0.w,  R3.x,  R16.x      VEC_201 
             y: MULADD      R18.y,  R0.z,  R3.y,  R18.y      VEC_210 
             z: MULADD      R18.z,  R0.z,  R3.z,  R18.z      VEC_201 
             w: MULADD      R18.w,  R0.z,  R3.w,  R18.w      VEC_201 
             t: MULADD      R16.y,  R0.w,  R3.y,  R16.y      VEC_120 
         34  x: MULADD      R15.x,  R2.x,  R1.x,  R15.x      VEC_201 
             y: MULADD      R15.y,  R2.x,  R1.y,  R15.y      VEC_201 
             z: MULADD      R17.z,  R0.w,  R1.z,  R17.z      
             w: MULADD      R17.w,  R0.w,  R1.w,  R17.w      
             t: MULADD      R15.z,  R2.x,  R1.z,  R15.z      
         35  x: MULADD      R13.x,  R2.x,  R3.x,  R13.x      VEC_201 
             y: MULADD      R13.y,  R2.x,  R3.y,  R13.y      VEC_201 
             z: MULADD      R16.z,  R0.w,  R3.z,  R16.z      
             w: MULADD      R16.w,  R0.w,  R3.w,  R16.w      
             t: MULADD      R13.z,  R2.x,  R3.z,  R13.z      
         36  x: MULADD      R14.x,  R2.y,  R1.x,  R14.x      VEC_201 
             y: MULADD      R14.y,  R2.y,  R1.y,  R14.y      VEC_201 
             z: MULADD      R14.z,  R2.y,  R1.z,  R14.z      VEC_201 
             w: MULADD      R15.w,  R2.x,  R1.w,  R15.w      VEC_210 
             t: MULADD      R14.w,  R2.y,  R1.w,  R14.w      VEC_120 
         37  x: MULADD      R12.x,  R2.y,  R3.x,  R12.x      VEC_201 
             y: MULADD      R12.y,  R2.y,  R3.y,  R12.y      VEC_201 
             z: MULADD      R12.z,  R2.y,  R3.z,  R12.z      VEC_201 
             w: MULADD      R13.w,  R2.x,  R3.w,  R13.w      VEC_210 
             t: MULADD      R12.w,  R2.y,  R3.w,  R12.w      VEC_120 
         38  x: MULADD      R11.x,  R2.z,  R1.x,  R11.x      VEC_210 
             y: MULADD      R11.y,  R2.z,  R1.y,  R11.y      VEC_201 
             z: MULADD      R11.z,  R2.z,  R1.z,  R11.z      VEC_201 
             w: MULADD      R11.w,  R2.z,  R1.w,  R11.w      VEC_201 
             t: MULADD      R10.x,  R2.w,  R1.x,  R10.x      VEC_120 
         39  x: MULADD      R9.x,  R2.z,  R3.x,  R9.x      VEC_210 
             y: MULADD      R10.y,  R2.w,  R1.y,  R10.y      VEC_201 
             z: MULADD      R10.z,  R2.w,  R1.z,  R10.z      VEC_201 
             w: MULADD      R10.w,  R2.w,  R1.w,  R10.w      VEC_201 
             t: MULADD      R8.x,  R2.w,  R3.x,  R8.x      VEC_120 
         40  y: MULADD      R9.y,  R2.z,  R3.y,  R9.y      VEC_210 
             z: MULADD      R9.z,  R2.z,  R3.z,  R9.z      VEC_201 
             w: MULADD      R9.w,  R2.z,  R3.w,  R9.w      VEC_201 
             t: MULADD      R8.y,  R2.w,  R3.y,  R8.y      VEC_120 
         41  z: MULADD      R8.z,  R2.w,  R3.z,  R8.z      
             w: MULADD      R8.w,  R2.w,  R3.w,  R8.w      
         42  x: PREDE_INT   ____,  KC0[0].y,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
    08 JUMP  POP_CNT(1) ADDR(10) 
    09 ALU_POP_AFTER: ADDR(176) CNT(64) 
         43  x: MULADD      R23.x,  R6.x,  R24.x,  R23.x      
             y: MULADD      R23.y,  R6.x,  R24.y,  R23.y      
             z: MULADD      R23.z,  R6.x,  R24.z,  R23.z      
             w: MULADD      R23.w,  R6.x,  R24.w,  R23.w      
         44  x: MULADD      R22.x,  R6.x,  R25.x,  R22.x      
             y: MULADD      R22.y,  R6.x,  R25.y,  R22.y      
             z: MULADD      R22.z,  R6.x,  R25.z,  R22.z      
             w: MULADD      R22.w,  R6.x,  R25.w,  R22.w      
         45  x: MULADD      R20.x,  R6.y,  R24.x,  R20.x      VEC_210 
             y: MULADD      R20.y,  R6.y,  R24.y,  R20.y      VEC_201 
             z: MULADD      R20.z,  R6.y,  R24.z,  R20.z      VEC_201 
             w: MULADD      R20.w,  R6.y,  R24.w,  R20.w      VEC_201 
             t: MULADD      R19.x,  R6.z,  R24.x,  R19.x      VEC_120 
         46  x: MULADD      R21.x,  R6.y,  R25.x,  R21.x      VEC_210 
             y: MULADD      R21.y,  R6.y,  R25.y,  R21.y      VEC_201 
             z: MULADD      R21.z,  R6.y,  R25.z,  R21.z      VEC_201 
             w: MULADD      R21.w,  R6.y,  R25.w,  R21.w      VEC_201 
             t: MULADD      R18.x,  R6.z,  R25.x,  R18.x      VEC_120 
         47  x: MULADD      R17.x,  R6.w,  R24.x,  R17.x      VEC_201 
             y: MULADD      R19.y,  R6.z,  R24.y,  R19.y      VEC_210 
             z: MULADD      R19.z,  R6.z,  R24.z,  R19.z      VEC_201 
             w: MULADD      R19.w,  R6.z,  R24.w,  R19.w      VEC_201 
             t: MULADD      R17.y,  R6.w,  R24.y,  R17.y      VEC_120 
         48  x: MULADD      R16.x,  R6.w,  R25.x,  R16.x      VEC_201 
             y: MULADD      R18.y,  R6.z,  R25.y,  R18.y      VEC_210 
             z: MULADD      R18.z,  R6.z,  R25.z,  R18.z      VEC_201 
             w: MULADD      R18.w,  R6.z,  R25.w,  R18.w      VEC_201 
             t: MULADD      R16.y,  R6.w,  R25.y,  R16.y      VEC_120 
         49  x: MULADD      R15.x,  R7.x,  R24.x,  R15.x      VEC_201 
             y: MULADD      R15.y,  R7.x,  R24.y,  R15.y      VEC_201 
             z: MULADD      R17.z,  R6.w,  R24.z,  R17.z      
             w: MULADD      R17.w,  R6.w,  R24.w,  R17.w      
             t: MULADD      R15.z,  R7.x,  R24.z,  R15.z      
         50  x: MULADD      R13.x,  R7.x,  R25.x,  R13.x      VEC_201 
             y: MULADD      R13.y,  R7.x,  R25.y,  R13.y      VEC_201 
             z: MULADD      R16.z,  R6.w,  R25.z,  R16.z      
             w: MULADD      R16.w,  R6.w,  R25.w,  R16.w      
             t: MULADD      R13.z,  R7.x,  R25.z,  R13.z      
         51  x: MULADD      R14.x,  R7.y,  R24.x,  R14.x      VEC_201 
             y: MULADD      R14.y,  R7.y,  R24.y,  R14.y      VEC_201 
             z: MULADD      R14.z,  R7.y,  R24.z,  R14.z      VEC_201 
             w: MULADD      R15.w,  R7.x,  R24.w,  R15.w      VEC_210 
             t: MULADD      R14.w,  R7.y,  R24.w,  R14.w      VEC_120 
         52  x: MULADD      R12.x,  R7.y,  R25.x,  R12.x      VEC_201 
             y: MULADD      R12.y,  R7.y,  R25.y,  R12.y      VEC_201 
             z: MULADD      R12.z,  R7.y,  R25.z,  R12.z      VEC_201 
             w: MULADD      R13.w,  R7.x,  R25.w,  R13.w      VEC_210 
             t: MULADD      R12.w,  R7.y,  R25.w,  R12.w      VEC_120 
         53  x: MULADD      R11.x,  R7.z,  R24.x,  R11.x      VEC_210 
             y: MULADD      R11.y,  R7.z,  R24.y,  R11.y      VEC_201 
             z: MULADD      R11.z,  R7.z,  R24.z,  R11.z      VEC_201 
             w: MULADD      R11.w,  R7.z,  R24.w,  R11.w      VEC_201 
             t: MULADD      R10.x,  R7.w,  R24.x,  R10.x      VEC_120 
         54  x: MULADD      R9.x,  R7.z,  R25.x,  R9.x      VEC_210 
             y: MULADD      R10.y,  R7.w,  R24.y,  R10.y      VEC_201 
             z: MULADD      R10.z,  R7.w,  R24.z,  R10.z      VEC_201 
             w: MULADD      R10.w,  R7.w,  R24.w,  R10.w      VEC_201 
             t: MULADD      R8.x,  R7.w,  R25.x,  R8.x      VEC_120 
         55  y: MULADD      R9.y,  R7.z,  R25.y,  R9.y      VEC_210 
             z: MULADD      R9.z,  R7.z,  R25.z,  R9.z      VEC_201 
             w: MULADD      R9.w,  R7.z,  R25.w,  R9.w      VEC_201 
             t: MULADD      R8.y,  R7.w,  R25.y,  R8.y      VEC_120 
         56  z: MULADD      R8.z,  R7.w,  R25.z,  R8.z      
             w: MULADD      R8.w,  R7.w,  R25.w,  R8.w      
10 ENDLOOP i0 PASS_JUMP_ADDR(4) 
11 ALU: ADDR(240) CNT(29) 
     57  x: ADD_INT     T0.x,  R5.z,  (0x00000003, 4.203895393e-45f).x      
         y: ADD_INT     ____,  R5.z,  0.0f      
         z: ADD_INT     T0.z,  R5.z,  (0x00000002, 2.802596929e-45f).y      
         w: ADD_INT     ____,  R5.z,  1      
     58  x: LSHL        R0.x,  PV57.y,  (0x00000002, 2.802596929e-45f).x      
         y: ADD_INT     T0.y,  PV57.z,  (0x00000004, 5.605193857e-45f).y      
         z: ADD_INT     T1.z,  PV57.w,  (0x00000004, 5.605193857e-45f).y      
         w: ADD_INT     T0.w,  PV57.y,  (0x00000004, 5.605193857e-45f).y      
         t: LSHL        R1.x,  PV57.w,  (0x00000002, 2.802596929e-45f).x      
     59  x: LSHL        R2.x,  T0.z,  (0x00000002, 2.802596929e-45f).x      
         y: ADD_INT     R0.y,  PV58.z,  (0x00000004, 5.605193857e-45f).y      
         z: ADD_INT     R0.z,  PV58.w,  (0x00000004, 5.605193857e-45f).y      
         w: ADD_INT     T1.w,  T0.x,  (0x00000004, 5.605193857e-45f).y      
         t: LSHL        R3.x,  T0.x,  (0x00000002, 2.802596929e-45f).x      
     60  x: LSHL        R4.x,  T0.w,  (0x00000002, 2.802596929e-45f).x      
         y: ADD_INT     R1.y,  PV59.z,  (0x00000004, 5.605193857e-45f).y      
         z: ADD_INT     R1.z,  PV59.w,  (0x00000004, 5.605193857e-45f).y      
         w: ADD_INT     R0.w,  T0.y,  (0x00000004, 5.605193857e-45f).y      
         t: LSHL        R5.x,  T1.z,  (0x00000002, 2.802596929e-45f).x      
     61  x: LSHL        R6.x,  T0.y,  (0x00000002, 2.802596929e-45f).x      
         y: ADD_INT     R2.y,  PV60.z,  (0x00000004, 5.605193857e-45f).y      
         z: ADD_INT     R2.z,  PV60.w,  (0x00000004, 5.605193857e-45f).y      
         w: ADD_INT     R1.w,  R0.y,  (0x00000004, 5.605193857e-45f).y      VEC_120 
         t: LSHL        R7.x,  T1.w,  (0x00000002, 2.802596929e-45f).x      
12 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R23, ELEM_SIZE(3) 
13 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R22, ELEM_SIZE(3) 
14 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R21, ELEM_SIZE(3) 
15 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R20, ELEM_SIZE(3) 
16 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R19, ELEM_SIZE(3) 
17 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R18, ELEM_SIZE(3) 
18 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R17, ELEM_SIZE(3) 
19 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R16, ELEM_SIZE(3) 
20 ALU: ADDR(269) CNT(12) 
     62  x: LSHL        R7.x,  R0.z,  (0x00000002, 2.802596929e-45f).x      
         t: LSHL        R6.x,  R0.y,  (0x00000002, 2.802596929e-45f).x      
     63  x: LSHL        R5.x,  R0.w,  (0x00000002, 2.802596929e-45f).x      
         t: LSHL        R4.x,  R1.z,  (0x00000002, 2.802596929e-45f).x      
     64  x: LSHL        R3.x,  R1.y,  (0x00000002, 2.802596929e-45f).x      
         t: LSHL        R2.x,  R1.w,  (0x00000002, 2.802596929e-45f).x      
     65  x: LSHL        R1.x,  R2.z,  (0x00000002, 2.802596929e-45f).x      
         t: LSHL        R0.x,  R2.y,  (0x00000002, 2.802596929e-45f).x      
21 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R15, ELEM_SIZE(3) 
22 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R13, ELEM_SIZE(3) 
23 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R14, ELEM_SIZE(3) 
24 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R12, ELEM_SIZE(3) 
25 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R11, ELEM_SIZE(3) 
26 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R9, ELEM_SIZE(3) 
27 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R10, ELEM_SIZE(3) 
28 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R8, ELEM_SIZE(3) 
END_OF_PROGRAM
The result ? I measure 880 Gflop/s for 4096x4096 dense matrix-matrix products. That makes a pair of HD4870x2 boards faster than nine GTX280s ^^

EDIT: 1000 Gflop/s later in this thread

References:
[1] V. Volkov, J. W. Demmel: Benchmarking GPUs to tune dense linear algebra. Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008
http://mc.stanford.edu/cgi-bin/image...Volkov_GPU.pdf

[2] `What we see on our optimized MM kernel is ~540 gflops in IL.'
Micah Villmow, AMD. Answering to vvolkov on the ATi Stream sectionof the AMD Developer Forums
http://forums.amd.com/forum/messagev...hreadid=105221

[3] `The simple_matmult example that we have is pretty much optimal for our hardware'
Micah Villmow, AMD. Answering to sgratton on the ATi Stream sectionof the AMD Developer Forums
http://forums.amd.com/forum/messagev...hreadid=102771
prunedtree is offline   Reply With Quote
Old 18-Aug-2009, 17:54   #2
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,237
Send a message via Skype™ to rpg.314
Thumbs up

Quote:
Originally Posted by prunedtree View Post
Dense matrix-matrix products, a basic element for dense linear algebra, is rather straightforward in terms of multiply-add operations, and very friendly to memory hierarchies, which makes it an interesting benchmark for throughput-oriented hardware: The performance of matrix-matrix multiply is often much more relevant to performance than the sum of all ALU operations than can be completed in a second.

Contrast for instance the peak performance claimed by both major IHVs:

nVidia GT200 (GTX280) : 933 Gflop/s
ATi RV770 (HD4870) : 1200 Gflop/s
I have always believed that the mul on nv and the t unit on amd's gpu's should be ignored while calculating the peak throughput. Volkov's paper does the same too.

Err..., you quote vendor's peak claims and calculate your fraction of flops from your own estimate of peak flops???

Quote:
The result ? I measure 880 Gflop/s (92% peak) for 4096x4096 dense matrix-matrix products.
GREAT JOB, nevertheless.

I assume that you are using a 4870. Right?
rpg.314 is offline   Reply With Quote
Old 18-Aug-2009, 18:22   #3
prunedtree
Regular
 
Join Date: Aug 2009
Posts: 27
Default

Quote:
Originally Posted by rpg.314 View Post
I have always believed that the mul on nv and the t unit on amd's gpu's should be ignored while calculating the peak throughput. Volkov's paper does the same too.
Yes, it's more realistic to expect to achieve such rates in ALU-bound situations.

Quote:
Originally Posted by rpg.314 View Post
Err..., you quote vendor's peak claims and calculate your fraction of flops from your own estimate of peak flops???
No, if you look at the code, you will see that the arithmetic peak is actually a bit over 960 Gflop/s (31 cycles for the loop if there's no overhead, around 4.13 multiply-accumulate per cycle in average, or 990 Gflop/s)

My peak estimate is from the L1 bandwidth, which I assume to be the bottleneck. 100% would be unlikely as there's no cache prefetching.

Quote:
Originally Posted by rpg.314 View Post
GREAT JOB, nevertheless.

I assume that you are using a 4870. Right?
Thanks. I'm using a pair of HD4870x2, but my numbers are for a single device (one RV770 and its gigabyte of GDDR5) which is equivalent to a HD4870 board (in theory PCIe is not a bottleneck for sustained SGEMM computation).

Given that it's essentially lots of calls to the SGEMM kernel, it could be funny to try to achieve 3500 Gflop/s in single precision LU factorization (LINPACK benchmark) using Volkov's approach for multi-GPU computation.
prunedtree is offline   Reply With Quote
Old 18-Aug-2009, 19:19   #4
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by rpg.314 View Post
I have always believed that the mul on nv and the t unit on amd's gpu's should be ignored while calculating the peak throughput. Volkov's paper does the same too.
Why would you ignore the t unit? Can you not see from the example given that it's being utilized in the majority of the slots? In fact, all 5 units are used in most of the shader.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 18-Aug-2009, 19:42   #5
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,949
Send a message via Skype™ to Jawed
Default

Ooh, very impressive.

Vasily Volkov and I discussed some of this stuff:

http://forum.beyond3d.com/showthread...19#post1290019

I bashed my head against this for a while, mostly non-LDS, but focussed too much on maintaining cache locality for maximum throughput. And got somewhat confused Not actually having a GPU to test on also puts the dampener on things.

I like the fact you're ignoring cache locality - that makes me chuckle.

So I'm very impressed, 880Gflops is quite something. Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output). I'm a bit puzzled why the break from the loop is in the middle - need to think about that some more.

Will you be writing a paper or presenting your results formally some time? You need a shiny graph of performance at various matrix sizes, 128x128, 256x256, etc.

I do disagree on the 5th MAD in ATI. Your code is clearly doing 5 MADs per cycle most of the time!

By the way the loop is 32 ALU cycles, 960GFLOPs peak.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 18-Aug-2009, 19:58   #6
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Winfield, IN USA
Posts: 15,339
Default

Pfft, some newb posting up a troll thread.



































Hey Prune!
digitalwanderer is offline   Reply With Quote
Old 18-Aug-2009, 20:57   #7
Rys
Tiled
 
Join Date: Oct 2003
Location: Abbots Langley, UK
Posts: 2,745
Default

Quote:
Originally Posted by Jawed View Post
Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output).
Think it was Factor + bindings for Stream, if I remember rightly. Fancy sharing the app code, prunedtree?
__________________
Mr. Popples!
Rys is offline   Reply With Quote
Old 19-Aug-2009, 05:09   #8
prunedtree
Regular
 
Join Date: Aug 2009
Posts: 27
Default

Quote:
Originally Posted by Jawed View Post
I like the fact you're ignoring cache locality - that makes me chuckle.
Cache locality is quite important: I get only 720 Gflop/s with 1024x1024 matrices (and performance crumbles over that size) with a naive scanline ordering, and 840 Gflop/s with tiling. The texture fetch at the start of the shader loads precomputed tiled addresses.

Quote:
Originally Posted by Jawed View Post
So I'm very impressed, 880Gflops is quite something. Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output). I'm a bit puzzled why the break from the loop is in the middle - need to think about that some more.
Given the difficult to produce decent code with this framework, I don't think you could go far with Brook+. The weird things you might notice in the code is just junk to coerce the compiler into sanity.

Quote:
Originally Posted by Jawed View Post
Will you be writing a paper or presenting your results formally some time? You need a shiny graph of performance at various matrix sizes, 128x128, 256x256, etc.
Well I wouldn't mind helping writing something similar to Volkov's paper, but I do not have much use for dense linear algebra myself. My work involves mostly boring memory-bound kernels, and this was an opportunity to have some fun.

Quote:
Originally Posted by Rys View Post
Think it was Factor + bindings for Stream, if I remember rightly. Fancy sharing the app code, prunedtree?
Yes, I am using bindings for ATi CAL in Factor, but it's tied to some proprietary code and fairly incomplete for now. However I do plan to release it at some point. The original post contains the high level method anyway - This ended up being the most simple of all the approaches I tried (sigh)

By the way, it seems I can't extract more than 444 GB/s out of the texture units (even in synthetic tests with extreme locality) so that gives an upper bound of 888 Gflop/s for this kernel.
prunedtree is offline   Reply With Quote
Old 19-Aug-2009, 06:17   #9
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,237
Send a message via Skype™ to rpg.314
Default

Of course the t unit is being used here, and is one of the reasons why this code is able to reach such high speeds. And yet, the t unit and the mul are accidents. You would almost never be able to exploit them. This code is perhaps an exception.

However, I find that calculating the fraction of the peak from other bottlenecks (assumed or otherwise), is not the best way. Calculating the peak from actual max throughput is better as it can expose other deficiencies/bottlenecks.
rpg.314 is offline   Reply With Quote
Old 19-Aug-2009, 07:54   #10
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by rpg.314 View Post
Of course the t unit is being used here, and is one of the reasons why this code is able to reach such high speeds. And yet, the t unit and the mul are accidents. You would almost never be able to exploit them. This code is perhaps an exception.
The t unit is readily used in many shaders, I don't know where you are coming from.
Quote:
However, I find that calculating the fraction of the peak from other bottlenecks (assumed or otherwise), is not the best way. Calculating the peak from actual max throughput is better as it can expose other deficiencies/bottlenecks.
Different shaders have different performance profiles. If the shader is ALU limited, then likely it will be making good use of the t unit.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 19-Aug-2009, 07:59   #11
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,237
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by OpenGL guy View Post
The t unit is readily used in many shaders, I don't know where you are coming from.
You mean it's used as a fma unit in many shaders, and not as an sfu?

Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction.
rpg.314 is offline   Reply With Quote
Old 19-Aug-2009, 08:39   #12
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by rpg.314 View Post
You mean it's used as a fma unit in many shaders, and not as an sfu?
Yes, it can be used as both. See the example posted in this very thread.
Quote:
Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction.
What compiler and what shaders?
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 19-Aug-2009, 11:26   #13
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,237
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by OpenGL guy View Post
Yes, it can be used as both. See the example posted in this very thread.
I know it can be used as both, I am wondering if you are saying that t unit is used as a fma unit in shaders?

It is, 90% of the time, not.

Quote:
What compiler and what shaders?
ATI jit compiler. Bioshock has 3.5, iirc.
rpg.314 is offline   Reply With Quote
Old 19-Aug-2009, 14:11   #14
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,949
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
I know it can be used as both, I am wondering if you are saying that t unit is used as a fma unit in shaders?
Why wouldn't it?

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 19-Aug-2009, 15:15   #15
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,237
Send a message via Skype™ to rpg.314
Default

Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
rpg.314 is offline   Reply With Quote
Old 19-Aug-2009, 16:31   #16
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,949
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
I just loaded up GSA, which happens to have the NVidia Horizon Based AO shader loaded as the last thing I looked at, and there's a whole pile of Ts doing MADs, MULs, ADDs and CNDEs as well as various transcendentals.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 19-Aug-2009, 16:43   #17
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,237
Send a message via Skype™ to rpg.314
Default

Can you calculate the avg slot occupancy in that shader?
rpg.314 is offline   Reply With Quote
Old 19-Aug-2009, 17:24   #18
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,949
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by prunedtree View Post
Cache locality is quite important: I get only 720 Gflop/s with 1024x1024 matrices (and performance crumbles over that size) with a naive scanline ordering, and 840 Gflop/s with tiling. The texture fetch at the start of the shader loads precomputed tiled addresses.
OK, so what you're saying is cache locality is important in allowing the 8 instructions in the TEX clause to run at full speed (or close). 4:1 ALU:TEX, in this case, doesn't provide the leeway to enable "sloppy" access patterns.

I guess the scanline access pattern ends up with L2 filled with data it junks, which increases the number of fetches into L2 to fulfil the 8 TEX instructions.

Quote:
Given the difficult to produce decent code with this framework, I don't think you could go far with Brook+. The weird things you might notice in the code is just junk to coerce the compiler into sanity.
Guess I'll leave fiddling with it until an idle moment. I can't test performance anyway, but I want to think about your tiling and striding.

Quote:
Well I wouldn't mind helping writing something similar to Volkov's paper, but I do not have much use for dense linear algebra myself. My work involves mostly boring memory-bound kernels, and this was an opportunity to have some fun.
Onto the double-precision version!

Quote:
By the way, it seems I can't extract more than 444 GB/s out of the texture units (even in synthetic tests with extreme locality) so that gives an upper bound of 888 Gflop/s for this kernel.
Hmm, perhaps full cache bandwidth only comes with the same values being fetched multiple times.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 19-Aug-2009, 17:37   #19
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,949
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
Can you calculate the avg slot occupancy in that shader?
Code:
 
x   67
y   59
z   56
w   61
t   51
total ALU instructions 101
utilisation = 58%
Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 19-Aug-2009, 19:08   #20
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by rpg.314 View Post
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
It really isn't that uncommon to see the t unit used all over. Here's an example I am looking at currently:
Code:
; --------  Disassembly --------------------
00 TEX: ADDR(656) CNT(1) VALID_PIX 
      0  SAMPLE R6, R1.xyxx, t3, s3
01 ALU_PUSH_BEFORE: ADDR(64) CNT(101) 
      1  x: MULADD      T1.x,  C34.x,  R6.x, -1.0f      
         y: MULADD      T0.y,  C34.x,  R6.y, -1.0f      
         z: MULADD      T0.z,  C34.x,  R6.z, -1.0f      
         w: MULADD      T1.w,  C34.x,  R6.w, -1.0f      
         t: MULADD      T2.y,  C34.x,  R6.y, -1.0f      VEC_021 
      2  x: MUL         ____,  PV1.z,  PV1.z      
         y: MUL         ____,  R5.z,  R5.z      
         z: MULADD      T1.z,  C34.x,  R6.z, -1.0f      
         w: MUL         T0.w,  R2.z,  R2.z      VEC_201 
         t: ADD         R11.z, -C23.x,  C24.x      
      3  x: DOT4        ____,  T1.x,  T1.x      
         y: DOT4        ____,  T0.y,  T0.y      
         z: DOT4        ____,  PV2.x,  1.0f      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: MULADD      T0.x,  R5.y,  R5.y,  PV2.y      
      4  x: DOT4        ____,  T1.w,  T1.w      
         y: DOT4        ____,  T2.y,  T2.y      
         z: DOT4        ____,  T1.z,  T1.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: RSQ_sat     T1.y,  PV3.x      
      5  x: MUL         T0.x,  T1.x,  PS4      
         y: MULADD      ____,  R5.x,  R5.x,  T0.x      VEC_102 
         z: MULADD      ____,  R2.y,  R2.y,  T0.w      
         w: MUL         T0.w,  T0.z,  PS4      
         t: RSQ_sat     T2.x,  PV4.x      
      6  x: MUL         T1.x,  T1.z,  PS5      
         y: MULADD      ____,  R2.x,  R2.x,  PV5.z      
         z: MUL         ____,  T1.w,  PS5      
         w: MUL         T2.w,  T0.y,  T1.y      
         t: RSQ_sat     ____,  PV5.y      
      7  x: CNDGE       T0.x, -C22.x,  T0.x,  PV6.z      
         y: MUL         T0.y,  R5.z,  PS6      
         z: MUL         T1.z,  R5.x,  PS6      
         w: MUL         T1.w,  R5.y,  PS6      
         t: RSQ_sat     T3.x,  PV6.y      
      8  x: DOT4        ____,  R4.x,  R4.x      
         y: DOT4        T1.y,  R4.y,  R4.y      
         z: DOT4        ____,  R4.z,  R4.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: CNDGE       T0.w, -C22.x,  T0.w,  T1.x      VEC_021 
      9  x: MUL         T0.x,  T0.x,  T1.z      
         y: MUL         T0.y,  T0.x,  T0.y      
         z: MUL         ____,  T0.x,  T1.w      
         w: MUL         ____,  T2.y,  T2.x      VEC_021 
         t: MUL         ____,  R2.y,  T3.x      
     10  x: CNDGE       T2.x, -C22.x,  T2.w,  PV9.w      
         y: MUL         T1.y,  R2.z,  T3.x      
         z: MUL         ____,  R2.x,  T3.x      
         w: MULADD      T2.w,  PS9,  T0.w,  PV9.z      VEC_021 
         t: RSQ_sat     T3.x,  T1.y      
     11  x: DOT4        ____,  R3.x,  R3.x      VEC_120 
         y: DOT4        ____,  R3.y,  R3.y      
         z: DOT4        ____,  R3.z,  R3.z      
         w: DOT4        R11.w,  (0x80000000, 0.0f).x,  0.0f      
         t: MULADD      T0.x,  PV10.z,  T0.w,  T0.x      
     12  x: MUL         ____,  R4.y,  T3.x      
         y: MUL         ____,  R4.z,  T3.x      
         z: MUL         ____,  R4.x,  T3.x      
         w: MULADD      ____,  T1.y,  T0.w,  T0.y      VEC_102 
         t: RSQ_e       T1.z,  |PV11.x|      
     13  x: MULADD      T2.x,  T2.x,  PV12.z,  T0.x      
         y: MULADD      T0.y,  T2.x,  PV12.x,  T2.w      
         z: MULADD      T0.z,  T2.x,  PV12.y,  PV12.w      
         w: MULADD      T2.w,  R3.x,  PS12, -C29.x      VEC_120 
         t: MULADD      T2.y,  R3.y,  PS12, -C29.y      
     14  x: MUL         ____,  PV13.z,  PV13.z      
         z: MULADD      T1.z,  R3.z,  T1.z, -C29.z      
         t: RCP_e       ____,  T1.z      
     15  x: DOT4        T0.x,  T2.x,  T2.x      
         y: DOT4        ____,  T0.y,  T0.y      
         z: DOT4        ____,  PV14.x,  1.0f      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: ADD         ____,  PS14, -C28.x      
     16  x: DOT4        ____,  T2.w,  T2.w      
         y: DOT4        T1.y,  T2.y,  T2.y      
         z: DOT4        ____,  T1.z,  T1.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: MUL         R1.w,  PS15,  C28.y      CLAMP 
     17  x: ADD         ____,  PS16, -1.0f      
         t: RSQ_sat     ____,  T0.x      
     18  x: MUL         R12.x,  T2.x,  PS17      
         y: MUL         R11.y,  T0.y,  PS17      
         z: MUL         R12.z,  T0.z,  PS17      
         w: CNDGE       R2.w,  PV17.x,  0.0f,  1.0f      
         t: RSQ_sat     ____,  T1.y      
     19  x: MUL         ____,  T2.w,  PS18      
         y: MUL         ____,  T2.y,  PS18      
         z: MUL         ____,  T1.z,  PS18      
     20  x: DOT4        ____,  R12.x,  PV19.x      
         y: DOT4        ____,  R11.y,  PV19.y      
         z: DOT4        ____,  R12.z,  PV19.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
     21  w: MAX         R12.w,  PV20.x,  C38.z      
     22  x: PREDNE      ____,  R2.w, -R2.w      UPDATE_EXEC_MASK UPDATE_PRED 
02 JUMP  POP_CNT(1) ADDR(36) VALID_PIX 
03 ALU: ADDR(165) CNT(121) 
     23  x: ADD         R5.x, -R3.x,  C21.x      
         y: ADD         R3.y, -R3.y,  C21.y      
         z: ADD         R1.z, -R3.z,  C21.z      
         w: MOV         R6.w,  1.0f      
         t: MOV         R2.z,  C35.z      
     24  x: DOT4        ____,  PV23.x,  C18.x      
         y: DOT4        ____,  PV23.y,  C18.y      
         z: DOT4        T0.z,  PV23.z,  C18.z      
         w: DOT4        ____,  PV23.w,  C18.w      
         t: MOV         R3.x,  0.0f      
     25  x: MUL         R2.x,  C32.z, -1.0f      
         y: MUL         R2.y,  C32.w, -1.0f      
         z: ADD         ____,  PV24.x,  R2.z      
         w: ADD         ____,  PV24.x, -C32.y      
         t: ADD         T0.w,  PV24.x, -C32.z      
     26  x: ADD         ____,  T0.z, -C32.w      
         y: ADD         ____,  T0.z,  PV25.y      
         z: CNDGE       T1.z,  PV25.z,  0.0f,  1.0f      
         w: ADD         ____,  T0.z,  PV25.x      
         t: CNDGE       T0.y,  PV25.w,  1.0f,  0.0f      
     27  x: CNDGE       ____,  PV26.w,  0.0f,  1.0f      
         y: CNDGE       ____,  T0.w,  1.0f,  0.0f      
         z: CNDGE       ____,  PV26.y,  0.0f,  1.0f      
         w: CNDGE       ____,  PV26.x,  1.0f,  0.0f      
         t: ADD         ____,  T0.z,  C36.x      
     28  x: MUL         ____,  PV27.x,  T0.y      
         y: MUL         ____,  PV27.z,  PV27.y      
         z: MUL         ____,  T1.z,  PV27.w      
         t: MUL         R6.x,  PS27,  C36.y      CLAMP 
     29  x: DOT4        R8.x,  PV28.x,  1.0f      
         y: DOT4        ____,  PV28.y,  C33.y      
         z: DOT4        ____,  PV28.z,  C33.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: ADD         T0.y,  PS28, -1.0f      
     30  x: ADD         ____,  PV29.x, -1.0f      
         y: ADD         R2.y,  PV29.x, -1.0f      
         z: ADD         ____,  PV29.x, -C33.w      
         w: ADD         ____,  PV29.x, -C33.y      
         t: ADD         R2.z,  PV29.x, -C33.y      
     31  x: CNDGE       R4.x,  PV30.w,  PV30.w, -PS30      
         y: CNDGE       R5.y,  PV30.x,  PV30.x, -PV30.y      
         z: CNDGE       T3.z,  PV30.z,  PV30.z, -R8.x      
         w: ADD         R2.w,  R8.x, -C33.z      
         t: ADD         ____,  R8.x, -C33.z      
     32  x: CNDGE       ____, -PV31.z,  C3.z,  0.0f      
         y: CNDGE       ____, -PV31.z,  C3.y,  0.0f      
         z: CNDGE       ____, -PV31.z,  C3.x,  0.0f      
         w: CNDGE       ____, -PV31.z,  C3.w,  0.0f      
         t: CNDGE       R5.w,  PS31,  PS31, -PV31.w      
     33  x: CNDGE       T0.x, -R5.y,  C7.z,  PV32.x      
         y: CNDGE       T1.y, -R5.y,  C7.y,  PV32.y      
         z: CNDGE       T1.z, -R5.y,  C7.x,  PV32.z      
         w: CNDGE       T0.w, -R5.y,  C7.w,  PV32.w      
         t: MUL         R2.z,  R8.x,  (0x3E800000, 0.25f).x      
     34  x: CNDGE       T1.x, -T3.z,  C0.z,  0.0f      
         y: CNDGE       T0.y, -T3.z,  C0.y,  0.0f      
         z: CNDGE       T0.z, -T3.z,  C0.x,  0.0f      
         w: CNDGE       T1.w, -T3.z,  C0.w,  0.0f      
         t: CNDGE       R3.z,  T0.y,  0.0f,  1.0f      
     35  x: CNDGE       T2.x, -T3.z,  C1.x,  0.0f      
         y: CNDGE       T2.y, -T3.z,  C1.w,  0.0f      
         z: CNDGE       T2.z, -T3.z,  C1.y,  0.0f      
         w: CNDGE       T2.w, -T3.z,  C1.z,  0.0f      
     36  x: CNDGE       T0.x, -R4.x,  C11.z,  T0.x      
         y: CNDGE       T1.y, -R4.x,  C11.y,  T1.y      
         z: CNDGE       T1.z, -R4.x,  C11.x,  T1.z      
         w: CNDGE       T0.w, -R4.x,  C11.w,  T0.w      
     37  x: CNDGE       T1.x, -R5.y,  C4.z,  T1.x      
         y: CNDGE       T0.y, -R5.y,  C4.y,  T0.y      
         z: CNDGE       T0.z, -R5.y,  C4.x,  T0.z      
         w: CNDGE       T1.w, -R5.y,  C4.w,  T1.w      
     38  x: CNDGE       T2.x, -R5.y,  C5.x,  T2.x      
         y: CNDGE       T2.y, -R5.y,  C5.w,  T2.y      
         z: CNDGE       T2.z, -R5.y,  C5.y,  T2.z      
         w: CNDGE       T2.w, -R5.y,  C5.z,  T2.w      
     39  x: CNDGE       T0.x, -R5.w,  C15.x,  T1.z      
         y: CNDGE       T1.y, -R5.w,  C15.y,  T1.y      
         z: CNDGE       T1.z, -R5.w,  C15.z,  T0.x      
         w: CNDGE       ____, -R5.w,  C15.w,  T0.w      
     40  x: CNDGE       T1.x, -R4.x,  C8.z,  T1.x      
         y: CNDGE       T0.y, -R4.x,  C8.y,  T0.y      
         z: CNDGE       T0.z, -R4.x,  C8.x,  T0.z      
         w: CNDGE       T1.w, -R4.x,  C8.w,  T1.w      VEC_021 
         t: MUL         ____,  R6.w,  PV39.w      
     41  x: CNDGE       T2.x, -R4.x,  C9.x,  T2.x      
         y: CNDGE       T2.y, -R4.x,  C9.w,  T2.y      
         z: CNDGE       T1.z, -R4.x,  C9.y,  T2.z      VEC_120 
         w: CNDGE       T2.w, -R4.x,  C9.z,  T2.w      
         t: MULADD      ____,  R1.z,  T1.z,  PS40      
     42  x: DOT4        ____,  R5.x,  T0.x      
         y: DOT4        ____,  R3.y,  T1.y      
         z: DOT4        ____,  PS41,  1.0f      
         w: DOT4        ____,  0.0f,  0.0f      
         t: CNDGE       T0.x, -R5.w,  C12.x,  T0.z      VEC_021 
     43  x: CNDGE       T1.x, -R5.w,  C13.x,  T2.x      
         y: CNDGE       T0.y, -R5.w,  C12.y,  T0.y      
         z: CNDGE       T0.z, -R5.w,  C12.z,  T1.x      VEC_021 
         w: CNDGE       ____, -R5.w,  C12.w,  T1.w      
         t: RCP_e       R13.w,  PV42.x      
     44  x: CNDGE       R2.x, -T3.z,  C2.x,  0.0f      
         y: CNDGE       T2.y, -R5.w,  C13.y,  T1.z      
         z: CNDGE       T1.z, -R5.w,  C13.z,  T2.w      VEC_021 
         w: CNDGE       T2.w, -R5.w,  C13.w,  T2.y      
         t: MUL         ____,  R6.w,  PV43.w      
     45  x: DOT4        ____,  R5.x,  T0.x      
         y: DOT4        R2.y,  R3.y,  T0.y      
         z: DOT4        ____,  R1.z,  T0.z      
         w: DOT4        ____,  PS44,  1.0f      
         t: CNDGE       R4.z, -T3.z,  C2.y,  0.0f      
     46  x: DOT4        ____,  R5.x,  T1.x      
         y: DOT4        ____,  R3.y,  T2.y      
         z: DOT4        ____,  R1.z,  T1.z      
         w: DOT4        ____,  R6.w,  T2.w      
         t: MUL         R9.y,  R13.w,  PV45.x      
     47  x: MOV         R7.x,  PS46      
         y: CNDGE       R4.y, -T3.z,  C2.z,  0.0f      
         z: MUL         R5.z,  R13.w,  PV46.x      
         w: ADD         R8.w,  PS46,  C31.z      
         t: CNDGE       R6.y, -T3.z,  C2.w,  0.0f      
04 ALU: ADDR(286) CNT(11) 
     48  x: MULADD      R9.x,  R2.y,  R13.w,  C31.z      
         y: CNDGE       ____, -R5.y,  C6.x,  R2.x      VEC_120 
         z: CNDGE       ____, -R5.y,  C6.y,  R4.z      VEC_120 
         w: MULADD      R9.w,  R5.z,  (0x3E800000, 0.25f).x,  R2.z      VEC_102 
         t: CNDGE       R2.w, -R5.y,  C6.z,  R4.y      VEC_021 
     49  x: CNDGE       R2.x, -R5.y,  C6.w,  R6.y      
         y: ADD         R7.y,  PV48.w,  C31.w      
         z: CNDGE       R2.z, -R4.x,  C10.x,  PV48.y      
         w: CNDGE       R4.w, -R4.x,  C10.y,  PV48.z      
         t: ADD         R8.y,  PV48.w,  C31.w      
05 TEX: ADDR(658) CNT(6) VALID_PIX 
     50  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     51  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     52  SAMPLE_G R4.__x_, R7.xyxx, t0, s0
     53  SAMPLE_G R8.___x, R8.wyww, t0, s0
     54  SAMPLE_G R8._x__, R9.xwxx, t0, s0
     55  SAMPLE_G R7.x___, R9.ywyy, t0, s0
06 ALU_PUSH_BEFORE: ADDR(297) CNT(29) 
     56  x: CNDGE       T1.x, -R5.w,  C14.x,  R2.z      
         y: CNDGE       ____, -R4.x,  C10.w,  R2.x      
         z: CNDGE       T3.z, -R5.w,  C14.y,  R4.w      
         w: CNDGE       ____, -R4.x,  C10.z,  R2.w      VEC_021 
     57  x: MUL         ____,  R9.y,  C31.x      
         y: MUL         T2.y,  R9.w,  C31.y      
         z: CNDGE       ____, -R5.w,  C14.z,  PV56.w      VEC_120 
         w: CNDGE       ____, -R5.w,  C14.w,  PV56.y      VEC_120 
     58  x: DOT4        ____,  R5.x,  T1.x      
         y: DOT4        ____,  R3.y,  T3.z      
         z: DOT4        R1.z,  R1.z,  PV57.z      
         w: DOT4        ____,  R6.w,  PV57.w      
         t: FRACT       T0.y,  PV57.x      
     59  x: MULADD      ____,  PV58.x,  R13.w, -R7.x      
         y: MULADD      ____,  PV58.x,  R13.w, -R8.w      
         z: MULADD      ____,  PV58.x,  R13.w, -R4.z      
         w: MULADD      ____,  PV58.x,  R13.w, -R8.y      VEC_201 
         t: FRACT       T2.w,  T2.y      
     60  x: CNDGE       T1.x,  PV59.x,  0.0f,  1.0f      
         y: CNDGE       ____,  PV59.y,  0.0f,  1.0f      
         z: CNDGE       T3.z,  PV59.z,  0.0f,  1.0f      
         w: CNDGE       ____,  PV59.w,  0.0f,  1.0f      
     61  y: ADD         ____, -PV60.z,  PV60.y      
         z: ADD         ____, -PV60.x,  PV60.w      
     62  x: MULADD      T1.x,  PV61.z,  T0.y,  T1.x      
         w: MULADD      ____,  PV61.y,  T0.y,  T3.z      
     63  x: ADD         ____, -PV62.x,  PV62.w      
     64  z: MULADD      R8.z,  PV63.x,  T2.w,  T1.x      
     65  x: PREDNE      ____,  R3.z, -R3.z      UPDATE_EXEC_MASK UPDATE_PRED 
07 JUMP  POP_CNT(1) ADDR(35) VALID_PIX 
08 ALU: ADDR(326) CNT(20) 
     66  x: MULADD      T0.x, -R6.x,  C31.z,  C31.z      
         y: MULADD      T0.y, -R6.x,  C31.w,  C31.w      
         z: ADD         ____,  R8.x, -1.0f      VEC_120 
         w: ADD         ____,  R8.x,  0.0f      VEC_120 
     67  y: MOV         ____, -|PV66.w|      
         z: MOV         T0.z, -|PV66.z|      
     68  x: CNDGE       ____,  PV67.y,  C19.x,  0.0f      
         w: CNDGE       ____,  PV67.y,  C19.y,  0.0f      
     69  x: CNDGE       ____,  T0.z,  C20.y,  PV68.w      
         y: CNDGE       ____,  T0.z,  C20.x,  PV68.x      
     70  y: MUL         R3.y,  PV69.y,  T0.x      
         z: MUL         R3.z,  PV69.x,  T0.y      
     71  y: MULADD      R4.y,  PV70.y, -C33.z,  R9.y      
         z: MULADD      R4.z,  PV70.z, -C33.z,  R9.w      
         t: MULADD      R6.y,  PV70.y,  C36.z,  R9.y      VEC_021 
     72  x: ADD         R2.x,  PV71.y,  C31.z      
         y: ADD         R2.y,  PV71.z,  C31.w      
         z: MUL         R5.z,  PV71.y,  C31.x      
         w: ADD         R4.w,  PV71.z,  C31.w      
         t: ADD         R4.x,  PV71.y,  C31.z      
09 TEX: ADDR(670) CNT(6) VALID_PIX 
     73  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     74  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     75  SAMPLE_G R2.___x, R2.xyxx, t0, s0
     76  SAMPLE_G R2.x___, R4.yzyy, t0, s0
     77  SAMPLE_G R2._x__, R4.xzxx, t0, s0
     78  SAMPLE_G R2.__x_, R4.ywyy, t0, s0
10 ALU: ADDR(346) CNT(25) 
     79  x: MUL         ____,  R4.z,  C31.y      VEC_120 
         y: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_210 
         z: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_210 
         w: MULADD      ____,  R1.z,  R13.w, -R2.z      VEC_210 
         t: MULADD      T0.x,  R1.z,  R13.w, -R2.w      
     80  x: CNDGE       ____,  PV79.y,  0.0f,  1.0f      
         y: FRACT       R4.y,  PV79.x      
         z: FRACT       T0.z,  R5.z      
         w: CNDGE       T0.w,  PV79.z,  0.0f,  1.0f      
         t: CNDGE       T1.z,  PV79.w,  0.0f,  1.0f      
     81  x: ADD         R2.x,  R6.y,  C31.z      
         y: CNDGE       ____,  T0.x,  0.0f,  1.0f      
         z: MULADD      R6.z,  R3.z,  C36.w,  R9.w      
         w: ADD         ____, -PV80.w,  PV80.x      
         t: ADD         R6.x,  R6.y,  C31.z      
     82  x: ADD         ____, -T1.z,  PV81.y      
         y: ADD         R2.y,  PV81.z,  C31.w      
         z: MULADD      R2.z,  PV81.w,  T0.z,  T0.w      
         w: ADD         R6.w,  PV81.z,  C31.w      
         t: MUL         ____,  R6.y,  C31.x      
     83  x: FRACT       R4.x,  PS82      
         y: MULADD      R7.y,  R3.y,  C36.w,  R9.y      
         z: MUL         R5.z,  R6.z,  C31.y      
         w: MULADD      R4.w,  PV82.x,  T0.z,  T1.z      
         t: MULADD      R8.y,  R3.y,  C33.z,  R9.y      VEC_021 
11 TEX: ADDR(682) CNT(6) VALID_PIX 
     84  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     85  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     86  SAMPLE_G R2.___x, R2.xyxx, t0, s0
     87  SAMPLE_G R2.x___, R6.yzyy, t0, s0
     88  SAMPLE_G R2._x__, R6.xzxx, t0, s0
     89  SAMPLE_G R6.__x_, R6.ywyy, t0, s0
12 ALU: ADDR(371) CNT(35) 
     90  x: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_021 
         y: MULADD      ____,  R1.z,  R13.w, -R6.z      VEC_021 
         z: MULADD      ____,  R1.z,  R13.w, -R2.w      VEC_021 
         w: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_021 
         t: FRACT       T0.w,  R5.z      
     91  x: CNDGE       T0.x,  PV90.y,  0.0f,  1.0f      
         y: CNDGE       T0.y,  PV90.x,  0.0f,  1.0f      
         z: CNDGE       ____,  PV90.w,  0.0f,  1.0f      
         w: CNDGE       ____,  PV90.z,  0.0f,  1.0f      
         t: ADD         ____, -R2.z,  R4.w      
     92  x: ADD         ____, -PV91.y,  PV91.z      
         y: MUL         T1.y,  R7.y,  C31.x      
         z: MULADD      ____,  PS91,  R4.y,  R2.z      
         w: ADD         ____, -PV91.x,  PV91.w      
         t: MULADD      R7.z,  R3.z,  C36.z,  R9.w      VEC_021 
     93  x: ADD         T0.x,  R8.z,  PV92.z      
         y: MULADD      T0.y,  PV92.x,  R4.x,  T0.y      
         z: MULADD      ____,  PV92.w,  R4.x,  T0.x      
         w: ADD         R4.w,  R7.y,  C31.z      
         t: ADD         R4.y,  PS92,  C31.w      
     94  x: ADD         R7.x,  R7.y,  C31.z      
         y: MUL         ____,  R7.z,  C31.y      
         z: FRACT       R5.z,  T1.y      VEC_120 
         w: ADD         ____, -PV93.y,  PV93.z      
         t: ADD         R7.w,  R7.z,  C31.w      
     95  x: FRACT       R6.x,  PV94.y      
         y: MULADD      ____,  PV94.w,  T0.w,  T0.y      VEC_021 
         z: MULADD      R8.z,  R3.z,  C33.z,  R9.w      VEC_021 
         w: ADD         R2.w,  R8.y,  C31.z      
         t: ADD         R8.x,  R8.y,  C31.z      
     96  x: MUL         R2.x,  R8.y,  C31.x      
         y: ADD         R2.y,  PV95.z,  C31.w      
         z: ADD         R6.z,  T0.x,  PV95.y      
         w: ADD         R8.w,  PV95.z,  C31.w      
         t: MUL         R2.z,  PV95.z,  C31.y      
13 TEX: ADDR(694) CNT(7) VALID_PIX 
     97  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     98  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
     99  SAMPLE_G R4.___x, R4.wyww, t0, s0
    100  SAMPLE_G R2.___x, R2.wyww, t0, s0
    101  SAMPLE_G R4.x___, R7.yzyy, t0, s0
    102  SAMPLE_G R2._x__, R7.xzxx, t0, s0
    103  SAMPLE_G R7.__x_, R7.ywyy, t0, s0
14 ALU: ADDR(406) CNT(16) 
    104  x: MULADD      ____,  R1.z,  R13.w, -R4.w      VEC_210 
         y: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_210 
         z: MULADD      ____,  R1.z,  R13.w, -R4.x      VEC_210 
         w: MULADD      ____,  R1.z,  R13.w, -R7.z      VEC_210 
         t: MULADD      T0.z,  R1.z,  R13.w, -R2.w      VEC_120 
    105  x: CNDGE       ____,  PV104.y,  0.0f,  1.0f      
         y: CNDGE       ____,  PV104.x,  0.0f,  1.0f      
         z: CNDGE       T1.z,  PV104.w,  0.0f,  1.0f      
         w: CNDGE       T0.w,  PV104.z,  0.0f,  1.0f      
         t: FRACT       R4.x,  R2.x      
    106  x: ADD         ____, -PV105.z,  PV105.y      
         y: FRACT       R7.y,  R2.z      
         z: CNDGE       R2.z,  T0.z,  0.0f,  1.0f      VEC_120 
         w: ADD         ____, -PV105.w,  PV105.x      
    107  z: MULADD      R5.z,  PV106.w,  R5.z,  T0.w      
         w: MULADD      R2.w,  PV106.x,  R5.z,  T1.z      
15 TEX: ADDR(708) CNT(5) VALID_PIX 
    108  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    109  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    110  SAMPLE_G R2.x___, R8.yzyy, t0, s0
    111  SAMPLE_G R2._x__, R8.xzxx, t0, s0
    112  SAMPLE_G R8.__x_, R8.ywyy, t0, s0
16 ALU_PUSH_BEFORE: ADDR(422) CNT(18) 
    113  x: MULADD      ____,  R1.z,  R13.w, -R2.x      
         y: MULADD      ____,  R1.z,  R13.w, -R8.z      
         z: ADD         ____, -R5.z,  R2.w      VEC_120 
         w: MULADD      ____,  R1.z,  R13.w, -R2.y      
    114  x: CNDGE       ____,  PV113.w,  0.0f,  1.0f      
         y: CNDGE       T0.y,  PV113.x,  0.0f,  1.0f      
         z: MULADD      ____,  PV113.z,  R6.x,  R5.z      
         w: CNDGE       T0.w,  PV113.y,  0.0f,  1.0f      
    115  x: ADD         T0.x,  R6.z,  PV114.z      
         y: ADD         ____, -PV114.y,  PV114.x      
         w: ADD         ____, -PV114.w,  R2.z      
    116  y: MULADD      T0.y,  PV115.y,  R4.x,  T0.y      
         z: MULADD      ____,  PV115.w,  R4.x,  T0.w      
    117  w: ADD         ____, -PV116.y,  PV116.z      
    118  y: MULADD      ____,  PV117.w,  R7.y,  T0.y      
    119  z: ADD         R8.z,  T0.x,  PV118.y      
    120  x: CNDGE       R2.x, -PV119.z,  0.0f,  1.0f      
    121  x: PREDNE      ____,  R2.x, -R2.x      UPDATE_EXEC_MASK UPDATE_PRED 
17 ALU_PUSH_BEFORE: ADDR(440) CNT(3) 
    122  y: ADD         ____,  R8.z,  C37.x      
    123  x: CNDGE       R2.x,  PV122.y,  1.0f,  0.0f      
    124  x: PREDNE      ____,  R2.x, -R2.x      UPDATE_EXEC_MASK UPDATE_PRED 
18 JUMP  ADDR(20) VALID_PIX 
19 ALU: ADDR(443) CNT(1) 
    125  z: MOV         R8.z,  1.0f      
20 ELSE POP_CNT(1) ADDR(34) VALID_PIX 
21 ALU: ADDR(444) CNT(8) 
    126  y: MULADD      R4.y,  R3.y,  C38.x,  R9.y      
         z: MULADD      R4.z,  R3.z,  C38.y,  R9.w      
         t: MULADD      R5.y,  R3.y,  C37.y,  R9.y      VEC_021 
    127  x: ADD         R2.x,  PV126.y,  C31.z      
         y: ADD         R2.y,  PV126.z,  C31.w      
         z: MUL         R6.z,  PV126.y,  C31.x      
         w: ADD         R4.w,  PV126.z,  C31.w      
         t: ADD         R4.x,  PV126.y,  C31.z      
22 TEX: ADDR(718) CNT(6) VALID_PIX 
    128  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    129  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    130  SAMPLE_G R2.___x, R2.xyxx, t0, s0
    131  SAMPLE_G R2.x___, R4.yzyy, t0, s0
    132  SAMPLE_G R2._x__, R4.xzxx, t0, s0
    133  SAMPLE_G R2.__x_, R4.ywyy, t0, s0
23 ALU: ADDR(452) CNT(15) 
    134  x: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_021 
         y: MULADD      ____,  R1.z,  R13.w, -R2.w      VEC_021 
         z: MULADD      ____,  R1.z,  R13.w, -R2.z      VEC_021 
         w: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_021 
         t: MUL         R4.w,  R4.z,  C31.y      
    135  x: CNDGE       R4.x,  PV134.x,  0.0f,  1.0f      
         y: CNDGE       R4.y,  PV134.y,  0.0f,  1.0f      
         z: CNDGE       R7.z,  PV134.z,  0.0f,  1.0f      
         w: CNDGE       R6.w,  PV134.w,  0.0f,  1.0f      
         t: MULADD      R5.z,  R3.z,  C37.z,  R9.w      VEC_021 
    136  x: ADD         R2.x,  R5.y,  C31.z      
         y: ADD         R2.y,  PS135,  C31.w      
         z: MUL         R4.z,  R5.y,  C31.x      
         w: ADD         R5.w,  PS135,  C31.w      
         t: ADD         R5.x,  R5.y,  C31.z      
24 TEX: ADDR(730) CNT(6) VALID_PIX 
    137  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    138  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    139  SAMPLE_G R2.___x, R2.xyxx, t0, s0
    140  SAMPLE_G R2.x___, R5.yzyy, t0, s0
    141  SAMPLE_G R2._x__, R5.xzxx, t0, s0
    142  SAMPLE_G R2.__x_, R5.ywyy, t0, s0
25 ALU: ADDR(467) CNT(39) 
    143  x: MULADD      ____,  R1.z,  R13.w, -R2.y      VEC_210 
         y: MULADD      ____,  R1.z,  R13.w, -R2.x      VEC_210 
         z: MULADD      ____,  R1.z,  R13.w, -R2.z      VEC_210 
         w: MUL         ____,  R5.z,  C31.y      VEC_120 
         t: MULADD      T0.w,  R1.z,  R13.w, -R2.w      
    144  x: FRACT       T0.x,  PV143.w      
         y: FRACT       T0.y,  R4.z      
         z: CNDGE       T0.z,  PV143.y,  0.0f,  1.0f      
         w: CNDGE       ____,  PV143.x,  0.0f,  1.0f      
         t: CNDGE       T1.y,  PV143.z,  0.0f,  1.0f      
    145  x: CNDGE       ____,  T0.w,  0.0f,  1.0f      
         y: ADD         ____, -PV144.z,  PV144.w      
         z: FRACT       T1.z,  R6.z      
         w: FRACT       R7.w,  R4.w      VEC_201 
         t: ADD         ____, -R4.x,  R6.w      
    146  x: MULADD      R11.x,  PS145,  PV145.z,  R4.x      
         y: ADD         ____, -T1.y,  PV145.x      
         z: MULADD      T0.z,  PV145.y,  T0.y,  T0.z      
         w: ADD         ____, -R7.z,  R4.y      VEC_021 
         t: MULADD      R6.z,  R3.z,  C40.y,  R9.w      VEC_021 
    147  x: MULADD      R8.x,  PV146.w,  T1.z,  R7.z      
         y: ADD         R7.y,  PS146,  C31.w      
         z: MUL         ____,  PS146,  C31.y      
         w: MULADD      ____,  PV146.y,  T0.y,  T1.y      
         t: ADD         R6.w,  PS146,  C31.w      
    148  x: FRACT       R2.x,  PV147.z      
         y: ADD         ____, -T0.z,  PV147.w      
         z: MULADD      R5.z,  R3.z,  C40.w,  R9.w      VEC_120 
         t: MULADD      R6.y,  R3.y,  C40.x,  R9.y      VEC_021 
    149  x: MULADD      ____,  PV148.y,  T0.x,  T0.z      
         y: MULADD      R5.y,  R3.y,  C40.z,  R9.y      
         z: ADD         R7.z,  PS148,  C31.z      
         w: MUL         ____,  PS148,  C31.x      
         t: ADD         R6.x,  PS148,  C31.z      
    150  x: ADD         R4.x,  PV149.y,  C31.z      
         y: FRACT       R4.y,  PV149.w      
         z: ADD         R4.z,  R8.z,  PV149.x      
         w: ADD         R4.w,  R5.z,  C31.w      VEC_120 
         t: ADD         R5.x,  PV149.y,  C31.z      
26 TEX: ADDR(742) CNT(7) VALID_PIX 
    151  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    152  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    153  SAMPLE_G R2.___x, R7.zyzz, t0, s0
    154  SAMPLE_G R4.___x, R4.xwxx, t0, s0
    155  SAMPLE_G R4.x___, R6.yzyy, t0, s0
    156  SAMPLE_G R7._x__, R6.xzxx, t0, s0
    157  SAMPLE_G R6.__x_, R6.ywyy, t0, s0
27 ALU: ADDR(506) CNT(25) 
    158  x: MULADD      ____,  R1.z,  R13.w, -R7.y      VEC_021 
         y: MULADD      ____,  R1.z,  R13.w, -R4.x      VEC_021 
         z: MULADD      ____,  R1.z,  R13.w, -R2.w      VEC_021 
         w: MULADD      ____,  R1.z,  R13.w, -R6.z      VEC_021 
         t: ADD         R5.w,  R5.z,  C31.w      
    159  x: CNDGE       ____,  PV158.z,  0.0f,  1.0f      
         y: CNDGE       T0.y,  PV158.w,  0.0f,  1.0f      
         z: CNDGE       ____,  PV158.x,  0.0f,  1.0f      
         w: CNDGE       T0.w,  PV158.y,  0.0f,  1.0f      
         t: MUL         ____,  R5.y,  C31.x      
    160  x: MUL         ____,  R5.z,  C31.y      
         y: MULADD      ____,  R1.z,  R13.w, -R4.w      VEC_120 
         z: ADD         ____, -PV159.y,  PV159.x      
         w: ADD         ____, -PV159.w,  PV159.z      
         t: FRACT       R4.w,  PS159      
    161  x: FRACT       R7.x,  PV160.x      
         y: MULADD      R4.y,  PV160.w,  R4.y,  T0.w      VEC_021 
         z: CNDGE       R6.z,  PV160.y,  0.0f,  1.0f      
         w: MULADD      ____,  PV160.z,  R4.y,  T0.y      VEC_021 
         t: MULADD      R10.z,  R3.z,  C39.y,  R9.w      VEC_021 
    162  x: ADD         R4.x, -PV161.y,  PV161.w      
         y: MULADD      R10.y,  R3.y,  C39.x,  R9.y      
         z: ADD         R9.z,  PS161,  C31.w      
         w: ADD         R10.w,  PS161,  C31.w      
         t: MUL         R7.z,  PS161,  C31.y      
28 TEX: ADDR(756) CNT(6) VALID_PIX 
    163  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    164  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    165  SAMPLE_G R6.x___, R5.yzyy, t0, s0
    166  SAMPLE_G R7._x__, R5.xzxx, t0, s0
    167  SAMPLE_G R5.__x_, R5.ywyy, t0, s0
    168  SAMPLE_G R5.x___, R10.yzyy, t0, s0
29 ALU: ADDR(531) CNT(33) 
    169  x: MULADD      ____,  R1.z,  R13.w, -R5.z      
         y: MULADD      ____,  R1.z,  R13.w, -R7.y      VEC_021 
         z: MULADD      ____,  R4.x,  R2.x,  R4.y      VEC_120 
         w: MULADD      ____,  R1.z,  R13.w, -R6.x      VEC_120 
         t: ADD         R9.x,  R10.y,  C31.z      
    170  x: CNDGE       T0.x,  PV169.w,  0.0f,  1.0f      
         y: CNDGE       T0.y,  PV169.x,  0.0f,  1.0f      
         z: CNDGE       ____,  PV169.y,  0.0f,  1.0f      
         w: ADD         T0.w,  R4.z,  PV169.z      
         t: ADD         R10.x,  R10.y,  C31.z      
    171  x: MUL         ____,  R10.y,  C31.x      
         y: ADD         ____, -PV170.y,  R6.z      
         z: MULADD      ____,  R1.z,  R13.w, -R5.x      
         w: ADD         ____, -PV170.x,  PV170.z      
         t: FRACT       R2.x,  R7.z      
    172  x: MULADD      T0.x,  PV171.w,  R4.w,  T0.x      
         y: FRACT       R7.y,  PV171.x      
         z: MULADD      ____,  PV171.y,  R4.w,  T0.y      VEC_120 
         w: CNDGE       R2.w,  PV171.z,  0.0f,  1.0f      
         t: MULADD      R6.y,  R3.y,  C39.z,  R9.y      VEC_021 
    173  x: ADD         R5.x,  PS172,  C31.z      
         y: ADD         ____, -PV172.x,  PV172.z      
         z: MULADD      R6.z,  R3.z,  C39.w,  R9.w      
         w: MUL         ____,  PS172,  C31.x      
         t: ADD         R6.x,  PS172,  C31.z      
    174  x: MULADD      ____,  PV173.y,  R7.x,  T0.x      
         y: ADD         R5.y,  PV173.z,  C31.w      
         z: MUL         ____,  PV173.z,  C31.y      
         w: ADD         R6.w,  PV173.z,  C31.w      
         t: FRACT       R8.w,  PV173.w      
    175  x: ADD         R8.x, -R11.x,  R8.x      
         y: FRACT       R2.y,  PV174.z      
         z: ADD         R7.z,  T0.w,  PV174.x      
30 TEX: ADDR(768) CNT(6) VALID_PIX 
    176  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    177  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    178  SAMPLE_G R4.___x, R9.xzxx, t0, s0
    179  SAMPLE_G R4._x__, R10.xzxx, t0, s0
    180  SAMPLE_G R5.___x, R5.xyxx, t0, s0
    181  SAMPLE_G R10.__x_, R10.ywyy, t0, s0
31 ALU: ADDR(564) CNT(13) 
    182  x: MULADD      ____,  R1.z,  R13.w, -R4.y      
         y: MULADD      ____,  R1.z,  R13.w, -R10.z      
         z: MULADD      ____,  R1.z,  R13.w, -R4.w      
         w: MULADD      R4.w,  R8.x,  R7.w,  R11.x      VEC_102 
    183  x: CNDGE       ____,  PV182.z,  0.0f,  1.0f      
         y: CNDGE       T0.y,  PV182.y,  0.0f,  1.0f      
         z: CNDGE       ____,  PV182.x,  0.0f,  1.0f      
         w: MULADD      ____,  R1.z,  R13.w, -R5.w      
    184  y: CNDGE       R10.y,  PV183.w,  0.0f,  1.0f      
         z: ADD         ____, -PV183.y,  PV183.x      
         w: ADD         ____, -R2.w,  PV183.z      
    185  y: MULADD      R4.y,  PV184.w,  R7.y,  R2.w      
         w: MULADD      R2.w,  PV184.z,  R7.y,  T0.y      
32 TEX: ADDR(780) CNT(5) VALID_PIX 
    186  SET_GRADIENTS_H ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    187  SET_GRADIENTS_V ____, R3.xxxx, t0, s0  WHOLE_QUAD 
    188  SAMPLE_G R8.x___, R6.yzyy, t0, s0
    189  SAMPLE_G R7._x__, R6.xzxx, t0, s0
    190  SAMPLE_G R6.__x_, R6.ywyy, t0, s0
33 ALU_POP_AFTER: ADDR(577) CNT(18) 
    191  x: MULADD      ____,  R1.z,  R13.w, -R6.z      
         y: MULADD      ____,  R1.z,  R13.w, -R7.y      
         z: ADD         ____, -R4.y,  R2.w      VEC_021 
         w: MULADD      ____,  R1.z,  R13.w, -R8.x      
    192  x: CNDGE       T0.x,  PV191.w,  0.0f,  1.0f      
         y: CNDGE       ____,  PV191.y,  0.0f,  1.0f      
         z: MULADD      ____,  PV191.z,  R2.x,  R4.y      
         w: CNDGE       T0.w,  PV191.x,  0.0f,  1.0f      
    193  x: ADD         ____, -PV192.x,  PV192.y      
         y: ADD         ____, -PV192.w,  R10.y      
         w: ADD         T1.w,  R7.z,  PV192.z      
    194  x: MULADD      T0.x,  PV193.x,  R8.w,  T0.x      
         z: MULADD      ____,  PV193.y,  R8.w,  T0.w      
    195  y: ADD         ____, -PV194.x,  PV194.z      
    196  x: MULADD      ____,  PV195.y,  R2.y,  T0.x      
    197  y: ADD         ____,  T1.w,  PV196.x      
    198  x: ADD         ____,  R4.w,  PV197.y      
    199  z: MUL         R8.z,  PV198.x,  C37.w      
34 POP (2) ADDR(35) 
35 ALU_POP_AFTER: ADDR(595) CNT(2) 
    200  x: ADD         ____,  R3.w, -R8.z      
    201  w: MULADD      R3.w,  R1.w,  PV200.x,  R8.z      
36 TEX: ADDR(790) CNT(2) VALID_PIX 
    202  SAMPLE R2, R1.xyxx, t2, s2
    203  SAMPLE R1, R1.xyxx, t1, s1
37 ALU: ADDR(597) CNT(48) 
    204  x: DOT4        ____,  R12.x, -C29.x      
         y: DOT4        ____,  R11.y, -C29.y      
         z: DOT4        ____,  R12.z, -C29.z      
         w: DOT4        T1.w,  (0x80000000, 0.0f).x,  0.0f      
         t: MULADD      T0.w,  R2.w,  R11.z,  C23.x      
    205  x: MUL         ____,  C27.x,  C27.x      
         w: MAX         ____,  PV204.x,  0.0f      
         t: LOG_sat     ____,  |R12.w|      
    206  x: MUL         ____,  PV205.w,  R1.y      
         y: MUL         ____,  T0.w,  PS205      
         z: MUL         ____,  PV205.w,  R1.x      
         w: MUL         ____,  PV205.w,  R1.z      
         t: RCP_e       ____,  PV205.x      
    207  x: MUL         ____,  PV206.z,  C30.x      
         y: MUL         T1.y,  R11.w,  PS206      CLAMP 
         z: MUL         T0.z,  PV206.w,  C30.z      
         w: MUL         ____,  PV206.x,  C30.y      
         t: EXP_e       ____,  PV206.y      
    208  x: MUL         ____,  R2.z,  PS207      
         y: MUL         ____,  R2.y,  PS207      
         z: MUL         ____,  R2.x,  PS207      
         w: MUL         ____,  R3.w,  PV207.x      
         t: MUL         T0.y,  R3.w,  PV207.w      
    209  x: MUL         ____,  R3.w,  T0.z      
         y: MUL         ____,  PV208.x,  C30.z      
         z: MUL         ____,  PV208.y,  C30.y      
         w: MUL         ____,  PV208.z,  C30.x      
         t: MULADD      T0.w,  R1.x,  R0.x,  PV208.w      
    210  x: MUL         ____,  PV209.w,  C25.x      
         y: MULADD      T0.y,  R1.z,  R0.z,  PV209.x      
         z: MULADD      T0.z,  R1.y,  R0.y,  T0.y      
         w: MUL         ____,  PV209.z,  C25.x      
         t: MUL         ____,  PV209.y,  C25.x      
    211  x: MUL         T0.x,  T1.y,  C27.y      
         y: MULADD      ____,  PS210,  R3.w,  PV210.y      
         z: MULADD      ____,  PV210.w,  R3.w,  PV210.z      
         w: MULADD      ____,  PV210.x,  R3.w,  T0.w      
    212  y: CNDGE       T0.y, -T1.w,  T0.y,  PV211.y      
         z: CNDGE       T0.z, -T1.w,  T0.z,  PV211.z      
         w: CNDGE       T0.w, -T1.w,  T0.w,  PV211.w      
    213  x: ADD         ____, -PV212.y,  C26.z      
         y: ADD         ____, -PV212.z,  C26.y      
         z: ADD         ____, -PV212.w,  C26.x      
         w: MUL         R0.w,  R0.w,  R1.w      
    214  x: MULADD      R0.x,  T0.x,  PV213.z,  T0.w      
         y: MULADD      R0.y,  T0.x,  PV213.y,  T0.z      
         z: MULADD      R0.z,  T0.x,  PV213.x,  T0.y      
38 EXP_DONE: PIX0, R0
The overall utilization is about 80% but this isn't due to the t unit not being utilized but instead because of some scalar dependencies.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 19-Aug-2009, 19:52   #21
AlexV
Heteroscedasticitate
 
Join Date: Mar 2005
Posts: 2,433
Default

Quote:
Originally Posted by rpg.314 View Post
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
This is incorrect. You may be confused by how scheduling is prioritized - namely, "common" instructions will first be assigned to the "vector" ALUs (x,y,z,w) and only if those are occupied will they be assigned to the transcendental unit as well. Of course, transcendental ops (or stuff like INT MUL/DIV, for example) get scheduled to the trans ALU implicitly. There are also some GPR read port restrictions in place, which end up not always allowing an instruction to be scheduled there. But it does MADs just fine, and quite often, really.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do.
AlexV is offline   Reply With Quote
Old 19-Aug-2009, 21:36   #22
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by rpg.314 View Post
Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction.
Average utilization doesn't really indicate how often the t unit is being used. If you have a bunch of very scalar code, utilization may go down, but you may find the rest of the code is fully utilizing all slots.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 20-Aug-2009, 05:58   #23
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,237
Send a message via Skype™ to rpg.314
Default

Which instruction is a CNDE btw?
rpg.314 is offline   Reply With Quote
Old 20-Aug-2009, 06:16   #24
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,333
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by rpg.314 View Post
Which instruction is a CNDE btw?
I believe it checks if a number is equal to 0. If so, chooses one of the operands, if not, chooses the other. I don't have the specs in front of me but Jawed posted a link to the instruction set specs recently.

Edit: Did you mean CNDGE? I believe that checks if a number is greater than or equal to 0 with similar behavior to what I posted above.

Edit again: Compare to the cmp instruction in the Direct3D instruction specs.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 20-Aug-2009, 08:52   #25
prunedtree
Regular
 
Join Date: Aug 2009
Posts: 27
Default

Quote:
Originally Posted by Jawed View Post
Onto the double-precision version!
Well, given that double precision multiply-adds are four (five if you count the `t' unit) times slower but only require twice the bandwidth, it's much easier to achieve high ALU utilization. ATi's implementation is almost optimal, over 200 Gflop/s (out of 240 Gflop/s peak).

Quote:
Originally Posted by Jawed View Post
Hmm, perhaps full cache bandwidth only comes with the same values being fetched multiple times.
No, I didn't measure more than ~444 GB/s even with all threads fetching the same value(s) over and over. Running ATi's various synthetic tests (among the samples in the SDK) gives similar results. As texture fetches are the bottleneck, it's actually impressive that the hardware manages to loose only 1% of efficiency with a more complex access pattern.
prunedtree is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 17:03.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.