View Full Version : Faster dense matrix-matrix products on ATi hardware
prunedtree
18-Aug-2009, 17:18
Dense matrix-matrix products, a basic element for dense linear algebra, is rather straightforward in terms of multiply-add operations, and very friendly to memory hierarchies, which makes it an interesting benchmark for throughput-oriented hardware: The performance of matrix-matrix multiply is often much more relevant to performance than the sum of all ALU operations than can be completed in a second.
Contrast for instance the peak performance claimed by both major IHVs:
nVidia GT200 (GTX280) : 933 Gflop/s
ATi RV770 (HD4870) : 1200 Gflop/s
...to the current fastest matrix-matrix product implementations:
CUBLAS 2.0 on GTX280[1] achieves 375 Gflop/s
and on HD4870[2] ATi reckons 540 Gflop/s
However, there is a significant difference as Volkov's implementation achieves the peak multiply-add rate with on operand from shared memory, while ATi's implementation is limited by the speed of the texture units. As many others, I thought that higher performance could be possible on ATi boards, using some mechanism to avoid the memory bottleneck. However, according to ATi, this is not possible[3], but they do not want to disclose details on the hardware.
Thus, I experimented with all the ideas I could come up with. And I've hit these limitations ATi knew about, one after another: Shared memory (LDS in ATi parlance) is no faster than texture fetches that hit L1 (30 billion float4 per second, 480 GB/s, for both). Shared memory broadcasting requires unpractical amounts of ALU in order to put addresses into registers (sigh). Shared registers have half the peak bandwidth of local registers, giving us a limit of 480 Gflop/s...
ATi's claim checks out: The limited features of their hardware do not really offer any help. But is their implementation really optimal ? The bandwidth intensity of a matrix-matrix product implementation is directly related to the size of the blocks in the destination matrix, and `simple_matmult' uses 8x4 blocks (this is also the maximum for their pixel shader approach). RV770's texture units can deliver 120 billion single precision values per second and we need two input values for each multiply-add operation. With 8x4 blocks, the bandwith reduction is ~5, and thus we obtain a peak of 600 Gflop/s.
Using 8x8 blocks would bring the bandwidth reduction to 8, for a peak of 960 Gflop/s. However, the obvious limitation to higher block sizes is that we need enough space in the register file to store them. The size of the register file (1024 scalars) on RV770 may seem impressive, but with 180 cycles of latency for L1 hits, you need 8 threads (wraps in ATi parlance) to hide one texture clause behind 30 cycles of computation (128 multiply-add). This gives us less than ~120 scalars in order to compute, for instance, two 8x8 outer products per loop: A single texture clause can load the four float8 inputs, and the two outer products amount to 128 multiply-add instructions.
How much register space would we need ? 64 scalars for the output block, the 32 values that are fetched by the texture units, and some registers for the loop index and texture addresses... a hundred scalars. This looks quite reasonable, so I implemented it. The major difficulty is to trick ATi's horrible compiler (which reflects the current quality of their `GPU computing' software stack well) into producing decent machine code. Here's what it looks like:
00 ALU: ADDR(32) CNT(71)
0 x: LSHR T0.x, R0.x, (0x00000006, 8.407790786e-45f).x
y: MOV R23.y, 0.0f
z: MOV R23.z, 0.0f
w: AND_INT T0.w, R0.x, (0x0000003F, 8.828180325e-44f).y
t: MOV R23.x, 0.0f
1 x: MOV R22.x, 0.0f
y: MOV R22.y, 0.0f
z: MOV R22.z, 0.0f
w: MOV R23.w, 0.0f
t: MOV R22.w, 0.0f
2 x: MOV R21.x, 0.0f
y: MOV R21.y, 0.0f
z: MOV R21.z, 0.0f
w: MOV R21.w, 0.0f
t: MOV R20.x, 0.0f
3 x: MOV R19.x, 0.0f
y: MOV R20.y, 0.0f
z: MOV R20.z, 0.0f
w: MOV R20.w, 0.0f
t: MOV R4.z, (0xC0000000, -2.0f).x
4 x: MOV R18.x, 0.0f
y: MOV R19.y, 0.0f
z: MOV R19.z, 0.0f
w: MOV R19.w, 0.0f
t: MOV R18.y, 0.0f
5 x: MOV R17.x, 0.0f
y: MOV R17.y, 0.0f
z: MOV R18.z, 0.0f
w: MOV R18.w, 0.0f
t: MOV R17.z, 0.0f
6 x: MOV R16.x, 0.0f
y: MOV R16.y, 0.0f
z: MOV R16.z, 0.0f
w: MOV R17.w, 0.0f
t: MOV R16.w, 0.0f
7 x: MOV R15.x, 0.0f
y: MOV R15.y, 0.0f
z: MOV R15.z, 0.0f
w: MOV R15.w, 0.0f
t: MOV R13.x, 0.0f
8 x: MOV R14.x, 0.0f
y: MOV R13.y, 0.0f
z: MOV R13.z, 0.0f
w: MOV R13.w, 0.0f
t: MOV R14.y, 0.0f
9 x: MOV R12.x, 0.0f
y: MOV R12.y, 0.0f
z: MOV R14.z, 0.0f
w: MOV R14.w, 0.0f
t: MOV R12.z, 0.0f
10 x: MOV R11.x, 0.0f
y: MOV R11.y, 0.0f
z: MOV R11.z, 0.0f
w: MOV R12.w, 0.0f
t: MOV R11.w, 0.0f
11 x: MOV R9.x, 0.0f
y: MOV R9.y, 0.0f
z: MOV R9.z, 0.0f
w: MOV R9.w, 0.0f
t: MOV R10.x, 0.0f
12 x: MOV R8.x, 0.0f
y: MOV R10.y, 0.0f
z: MOV R10.z, 0.0f
w: MOV R10.w, 0.0f
t: MOV R8.y, 0.0f
13 z: MOV R8.z, 0.0f
w: MOV R8.w, 0.0f
t: I_TO_F R0.x, T0.w
14 t: I_TO_F R0.y, T0.x
01 TEX: ADDR(288) CNT(1)
15 SAMPLE R5.xyz_, R0.xyxx, t4, s4 UNNORM(XYZW)
02 ALU: ADDR(103) CNT(2)
16 x: MOV R4.x, R5.x
y: MOV R4.y, R5.y
03 LOOP_DX10 i0 FAIL_JUMP_ADDR(11)
04 ALU_BREAK: ADDR(105) CNT(3) KCACHE0(CB0:0-15)
17 z: ADD R4.z, R4.z, (0x40000000, 2.0f).x
18 x: PREDGT ____, KC0[0].x, R4.z UPDATE_EXEC_MASK UPDATE_PRED
05 ALU: ADDR(108) CNT(3) KCACHE0(CB0:0-15)
19 x: ADD R4.x, R4.x, KC0[0].y
y: ADD R4.y, R4.y, KC0[0].y
w: ADD R4.w, R4.z, 1.0f
06 TEX: ADDR(290) CNT(8)
20 SAMPLE R0, R4.xzxx, t0, s0 UNNORM(XYZW)
21 SAMPLE R2, R4.xzxx, t1, s1 UNNORM(XYZW)
22 SAMPLE R1, R4.yzyy, t2, s2 UNNORM(XYZW)
23 SAMPLE R3, R4.yzyy, t3, s3 UNNORM(XYZW)
24 SAMPLE R6, R4.xwxx, t0, s0 UNNORM(XYZW)
25 SAMPLE R7, R4.xwxx, t1, s1 UNNORM(XYZW)
26 SAMPLE R24, R4.ywyy, t2, s2 UNNORM(XYZW)
27 SAMPLE R25, R4.ywyy, t3, s3 UNNORM(XYZW)
07 ALU_PUSH_BEFORE: ADDR(111) CNT(65) KCACHE0(CB0:0-15)
28 x: MULADD R23.x, R0.x, R1.x, R23.x
y: MULADD R23.y, R0.x, R1.y, R23.y
z: MULADD R23.z, R0.x, R1.z, R23.z
w: MULADD R23.w, R0.x, R1.w, R23.w
29 x: MULADD R22.x, R0.x, R3.x, R22.x
y: MULADD R22.y, R0.x, R3.y, R22.y
z: MULADD R22.z, R0.x, R3.z, R22.z
w: MULADD R22.w, R0.x, R3.w, R22.w
30 x: MULADD R21.x, R0.y, R1.x, R21.x VEC_210
y: MULADD R21.y, R0.y, R1.y, R21.y VEC_201
z: MULADD R21.z, R0.y, R1.z, R21.z VEC_201
w: MULADD R21.w, R0.y, R1.w, R21.w VEC_201
t: MULADD R19.x, R0.z, R1.x, R19.x VEC_120
31 x: MULADD R20.x, R0.y, R3.x, R20.x VEC_210
y: MULADD R20.y, R0.y, R3.y, R20.y VEC_201
z: MULADD R20.z, R0.y, R3.z, R20.z VEC_201
w: MULADD R20.w, R0.y, R3.w, R20.w VEC_201
t: MULADD R18.x, R0.z, R3.x, R18.x VEC_120
32 x: MULADD R17.x, R0.w, R1.x, R17.x VEC_201
y: MULADD R19.y, R0.z, R1.y, R19.y VEC_210
z: MULADD R19.z, R0.z, R1.z, R19.z VEC_201
w: MULADD R19.w, R0.z, R1.w, R19.w VEC_201
t: MULADD R17.y, R0.w, R1.y, R17.y VEC_120
33 x: MULADD R16.x, R0.w, R3.x, R16.x VEC_201
y: MULADD R18.y, R0.z, R3.y, R18.y VEC_210
z: MULADD R18.z, R0.z, R3.z, R18.z VEC_201
w: MULADD R18.w, R0.z, R3.w, R18.w VEC_201
t: MULADD R16.y, R0.w, R3.y, R16.y VEC_120
34 x: MULADD R15.x, R2.x, R1.x, R15.x VEC_201
y: MULADD R15.y, R2.x, R1.y, R15.y VEC_201
z: MULADD R17.z, R0.w, R1.z, R17.z
w: MULADD R17.w, R0.w, R1.w, R17.w
t: MULADD R15.z, R2.x, R1.z, R15.z
35 x: MULADD R13.x, R2.x, R3.x, R13.x VEC_201
y: MULADD R13.y, R2.x, R3.y, R13.y VEC_201
z: MULADD R16.z, R0.w, R3.z, R16.z
w: MULADD R16.w, R0.w, R3.w, R16.w
t: MULADD R13.z, R2.x, R3.z, R13.z
36 x: MULADD R14.x, R2.y, R1.x, R14.x VEC_201
y: MULADD R14.y, R2.y, R1.y, R14.y VEC_201
z: MULADD R14.z, R2.y, R1.z, R14.z VEC_201
w: MULADD R15.w, R2.x, R1.w, R15.w VEC_210
t: MULADD R14.w, R2.y, R1.w, R14.w VEC_120
37 x: MULADD R12.x, R2.y, R3.x, R12.x VEC_201
y: MULADD R12.y, R2.y, R3.y, R12.y VEC_201
z: MULADD R12.z, R2.y, R3.z, R12.z VEC_201
w: MULADD R13.w, R2.x, R3.w, R13.w VEC_210
t: MULADD R12.w, R2.y, R3.w, R12.w VEC_120
38 x: MULADD R11.x, R2.z, R1.x, R11.x VEC_210
y: MULADD R11.y, R2.z, R1.y, R11.y VEC_201
z: MULADD R11.z, R2.z, R1.z, R11.z VEC_201
w: MULADD R11.w, R2.z, R1.w, R11.w VEC_201
t: MULADD R10.x, R2.w, R1.x, R10.x VEC_120
39 x: MULADD R9.x, R2.z, R3.x, R9.x VEC_210
y: MULADD R10.y, R2.w, R1.y, R10.y VEC_201
z: MULADD R10.z, R2.w, R1.z, R10.z VEC_201
w: MULADD R10.w, R2.w, R1.w, R10.w VEC_201
t: MULADD R8.x, R2.w, R3.x, R8.x VEC_120
40 y: MULADD R9.y, R2.z, R3.y, R9.y VEC_210
z: MULADD R9.z, R2.z, R3.z, R9.z VEC_201
w: MULADD R9.w, R2.z, R3.w, R9.w VEC_201
t: MULADD R8.y, R2.w, R3.y, R8.y VEC_120
41 z: MULADD R8.z, R2.w, R3.z, R8.z
w: MULADD R8.w, R2.w, R3.w, R8.w
42 x: PREDE_INT ____, KC0[0].y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
08 JUMP POP_CNT(1) ADDR(10)
09 ALU_POP_AFTER: ADDR(176) CNT(64)
43 x: MULADD R23.x, R6.x, R24.x, R23.x
y: MULADD R23.y, R6.x, R24.y, R23.y
z: MULADD R23.z, R6.x, R24.z, R23.z
w: MULADD R23.w, R6.x, R24.w, R23.w
44 x: MULADD R22.x, R6.x, R25.x, R22.x
y: MULADD R22.y, R6.x, R25.y, R22.y
z: MULADD R22.z, R6.x, R25.z, R22.z
w: MULADD R22.w, R6.x, R25.w, R22.w
45 x: MULADD R20.x, R6.y, R24.x, R20.x VEC_210
y: MULADD R20.y, R6.y, R24.y, R20.y VEC_201
z: MULADD R20.z, R6.y, R24.z, R20.z VEC_201
w: MULADD R20.w, R6.y, R24.w, R20.w VEC_201
t: MULADD R19.x, R6.z, R24.x, R19.x VEC_120
46 x: MULADD R21.x, R6.y, R25.x, R21.x VEC_210
y: MULADD R21.y, R6.y, R25.y, R21.y VEC_201
z: MULADD R21.z, R6.y, R25.z, R21.z VEC_201
w: MULADD R21.w, R6.y, R25.w, R21.w VEC_201
t: MULADD R18.x, R6.z, R25.x, R18.x VEC_120
47 x: MULADD R17.x, R6.w, R24.x, R17.x VEC_201
y: MULADD R19.y, R6.z, R24.y, R19.y VEC_210
z: MULADD R19.z, R6.z, R24.z, R19.z VEC_201
w: MULADD R19.w, R6.z, R24.w, R19.w VEC_201
t: MULADD R17.y, R6.w, R24.y, R17.y VEC_120
48 x: MULADD R16.x, R6.w, R25.x, R16.x VEC_201
y: MULADD R18.y, R6.z, R25.y, R18.y VEC_210
z: MULADD R18.z, R6.z, R25.z, R18.z VEC_201
w: MULADD R18.w, R6.z, R25.w, R18.w VEC_201
t: MULADD R16.y, R6.w, R25.y, R16.y VEC_120
49 x: MULADD R15.x, R7.x, R24.x, R15.x VEC_201
y: MULADD R15.y, R7.x, R24.y, R15.y VEC_201
z: MULADD R17.z, R6.w, R24.z, R17.z
w: MULADD R17.w, R6.w, R24.w, R17.w
t: MULADD R15.z, R7.x, R24.z, R15.z
50 x: MULADD R13.x, R7.x, R25.x, R13.x VEC_201
y: MULADD R13.y, R7.x, R25.y, R13.y VEC_201
z: MULADD R16.z, R6.w, R25.z, R16.z
w: MULADD R16.w, R6.w, R25.w, R16.w
t: MULADD R13.z, R7.x, R25.z, R13.z
51 x: MULADD R14.x, R7.y, R24.x, R14.x VEC_201
y: MULADD R14.y, R7.y, R24.y, R14.y VEC_201
z: MULADD R14.z, R7.y, R24.z, R14.z VEC_201
w: MULADD R15.w, R7.x, R24.w, R15.w VEC_210
t: MULADD R14.w, R7.y, R24.w, R14.w VEC_120
52 x: MULADD R12.x, R7.y, R25.x, R12.x VEC_201
y: MULADD R12.y, R7.y, R25.y, R12.y VEC_201
z: MULADD R12.z, R7.y, R25.z, R12.z VEC_201
w: MULADD R13.w, R7.x, R25.w, R13.w VEC_210
t: MULADD R12.w, R7.y, R25.w, R12.w VEC_120
53 x: MULADD R11.x, R7.z, R24.x, R11.x VEC_210
y: MULADD R11.y, R7.z, R24.y, R11.y VEC_201
z: MULADD R11.z, R7.z, R24.z, R11.z VEC_201
w: MULADD R11.w, R7.z, R24.w, R11.w VEC_201
t: MULADD R10.x, R7.w, R24.x, R10.x VEC_120
54 x: MULADD R9.x, R7.z, R25.x, R9.x VEC_210
y: MULADD R10.y, R7.w, R24.y, R10.y VEC_201
z: MULADD R10.z, R7.w, R24.z, R10.z VEC_201
w: MULADD R10.w, R7.w, R24.w, R10.w VEC_201
t: MULADD R8.x, R7.w, R25.x, R8.x VEC_120
55 y: MULADD R9.y, R7.z, R25.y, R9.y VEC_210
z: MULADD R9.z, R7.z, R25.z, R9.z VEC_201
w: MULADD R9.w, R7.z, R25.w, R9.w VEC_201
t: MULADD R8.y, R7.w, R25.y, R8.y VEC_120
56 z: MULADD R8.z, R7.w, R25.z, R8.z
w: MULADD R8.w, R7.w, R25.w, R8.w
10 ENDLOOP i0 PASS_JUMP_ADDR(4)
11 ALU: ADDR(240) CNT(29)
57 x: ADD_INT T0.x, R5.z, (0x00000003, 4.203895393e-45f).x
y: ADD_INT ____, R5.z, 0.0f
z: ADD_INT T0.z, R5.z, (0x00000002, 2.802596929e-45f).y
w: ADD_INT ____, R5.z, 1
58 x: LSHL R0.x, PV57.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT T0.y, PV57.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT T1.z, PV57.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT T0.w, PV57.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R1.x, PV57.w, (0x00000002, 2.802596929e-45f).x
59 x: LSHL R2.x, T0.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, PV58.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R0.z, PV58.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT T1.w, T0.x, (0x00000004, 5.605193857e-45f).y
t: LSHL R3.x, T0.x, (0x00000002, 2.802596929e-45f).x
60 x: LSHL R4.x, T0.w, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R1.y, PV59.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R1.z, PV59.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R0.w, T0.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R5.x, T1.z, (0x00000002, 2.802596929e-45f).x
61 x: LSHL R6.x, T0.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R2.y, PV60.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R2.z, PV60.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R1.w, R0.y, (0x00000004, 5.605193857e-45f).y VEC_120
t: LSHL R7.x, T1.w, (0x00000002, 2.802596929e-45f).x
12 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R23, ELEM_SIZE(3)
13 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R22, ELEM_SIZE(3)
14 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R21, ELEM_SIZE(3)
15 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R20, ELEM_SIZE(3)
16 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R19, ELEM_SIZE(3)
17 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R18, ELEM_SIZE(3)
18 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R17, ELEM_SIZE(3)
19 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R16, ELEM_SIZE(3)
20 ALU: ADDR(269) CNT(12)
62 x: LSHL R7.x, R0.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R6.x, R0.y, (0x00000002, 2.802596929e-45f).x
63 x: LSHL R5.x, R0.w, (0x00000002, 2.802596929e-45f).x
t: LSHL R4.x, R1.z, (0x00000002, 2.802596929e-45f).x
64 x: LSHL R3.x, R1.y, (0x00000002, 2.802596929e-45f).x
t: LSHL R2.x, R1.w, (0x00000002, 2.802596929e-45f).x
65 x: LSHL R1.x, R2.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R0.x, R2.y, (0x00000002, 2.802596929e-45f).x
21 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R15, ELEM_SIZE(3)
22 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R13, ELEM_SIZE(3)
23 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R14, ELEM_SIZE(3)
24 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R12, ELEM_SIZE(3)
25 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R11, ELEM_SIZE(3)
26 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R9, ELEM_SIZE(3)
27 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R10, ELEM_SIZE(3)
28 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R8, ELEM_SIZE(3)
END_OF_PROGRAM
The result ? I measure 880 Gflop/s for 4096x4096 dense matrix-matrix products. That makes a pair of HD4870x2 boards faster than nine GTX280s ^^
EDIT: 1000 Gflop/s later in this thread
References:
[1] V. Volkov, J. W. Demmel: Benchmarking GPUs to tune dense linear algebra. Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008
http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf
[2] `What we see on our optimized MM kernel is ~540 gflops in IL.'
Micah Villmow, AMD. Answering to vvolkov on the ATi Stream sectionof the AMD Developer Forums
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=105221
[3] `The simple_matmult example that we have is pretty much optimal for our hardware'
Micah Villmow, AMD. Answering to sgratton on the ATi Stream sectionof the AMD Developer Forums
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=102771
rpg.314
18-Aug-2009, 17:54
Dense matrix-matrix products, a basic element for dense linear algebra, is rather straightforward in terms of multiply-add operations, and very friendly to memory hierarchies, which makes it an interesting benchmark for throughput-oriented hardware: The performance of matrix-matrix multiply is often much more relevant to performance than the sum of all ALU operations than can be completed in a second.
Contrast for instance the peak performance claimed by both major IHVs:
nVidia GT200 (GTX280) : 933 Gflop/s
ATi RV770 (HD4870) : 1200 Gflop/s
I have always believed that the mul on nv and the t unit on amd's gpu's should be ignored while calculating the peak throughput. Volkov's paper does the same too.
Err..., you quote vendor's peak claims and calculate your fraction of flops from your own estimate of peak flops???:shock:
The result ? I measure 880 Gflop/s (92% peak) for 4096x4096 dense matrix-matrix products.
GREAT JOB, nevertheless. :mrgreen::mrgreen::runaway:
I assume that you are using a 4870. Right?
prunedtree
18-Aug-2009, 18:22
I have always believed that the mul on nv and the t unit on amd's gpu's should be ignored while calculating the peak throughput. Volkov's paper does the same too.
Yes, it's more realistic to expect to achieve such rates in ALU-bound situations.
Err..., you quote vendor's peak claims and calculate your fraction of flops from your own estimate of peak flops???:shock:
No, if you look at the code, you will see that the arithmetic peak is actually a bit over 960 Gflop/s (31 cycles for the loop if there's no overhead, around 4.13 multiply-accumulate per cycle in average, or 990 Gflop/s)
My peak estimate is from the L1 bandwidth, which I assume to be the bottleneck. 100% would be unlikely as there's no cache prefetching.
GREAT JOB, nevertheless. :mrgreen::mrgreen::runaway:
I assume that you are using a 4870. Right?
Thanks. I'm using a pair of HD4870x2, but my numbers are for a single device (one RV770 and its gigabyte of GDDR5) which is equivalent to a HD4870 board (in theory PCIe is not a bottleneck for sustained SGEMM computation).
Given that it's essentially lots of calls to the SGEMM kernel, it could be funny to try to achieve 3500 Gflop/s in single precision LU factorization (LINPACK benchmark) using Volkov's approach for multi-GPU computation.
OpenGL guy
18-Aug-2009, 19:19
I have always believed that the mul on nv and the t unit on amd's gpu's should be ignored while calculating the peak throughput. Volkov's paper does the same too.
Why would you ignore the t unit? Can you not see from the example given that it's being utilized in the majority of the slots? In fact, all 5 units are used in most of the shader.
Ooh, very impressive.
Vasily Volkov and I discussed some of this stuff:
http://forum.beyond3d.com/showthread.php?p=1290019#post1290019
I bashed my head against this for a while, mostly non-LDS, but focussed too much on maintaining cache locality for maximum throughput. And got somewhat confused :???: Not actually having a GPU to test on also puts the dampener on things.
I like the fact you're ignoring cache locality - that makes me chuckle.
So I'm very impressed, 880Gflops is quite something. Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output). I'm a bit puzzled why the break from the loop is in the middle - need to think about that some more.
Will you be writing a paper or presenting your results formally some time? You need a shiny graph of performance at various matrix sizes, 128x128, 256x256, etc. :razz:
I do disagree on the 5th MAD in ATI. Your code is clearly doing 5 MADs per cycle most of the time!
By the way the loop is 32 ALU cycles, 960GFLOPs peak.
Jawed
digitalwanderer
18-Aug-2009, 19:58
Pfft, some newb posting up a troll thread. :roll:
;) Hey Prune! :D
Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output).
Think it was Factor + bindings for Stream, if I remember rightly. Fancy sharing the app code, prunedtree? :grin:
prunedtree
19-Aug-2009, 05:09
I like the fact you're ignoring cache locality - that makes me chuckle.
Cache locality is quite important: I get only 720 Gflop/s with 1024x1024 matrices (and performance crumbles over that size) with a naive scanline ordering, and 840 Gflop/s with tiling. The texture fetch at the start of the shader loads precomputed tiled addresses.
So I'm very impressed, 880Gflops is quite something. Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output). I'm a bit puzzled why the break from the loop is in the middle - need to think about that some more.
Given the difficult to produce decent code with this framework, I don't think you could go far with Brook+. The weird things you might notice in the code is just junk to coerce the compiler into sanity.
Will you be writing a paper or presenting your results formally some time? You need a shiny graph of performance at various matrix sizes, 128x128, 256x256, etc. :razz:
Well I wouldn't mind helping writing something similar to Volkov's paper, but I do not have much use for dense linear algebra myself. My work involves mostly boring memory-bound kernels, and this was an opportunity to have some fun.
Think it was Factor + bindings for Stream, if I remember rightly. Fancy sharing the app code, prunedtree? :grin:
Yes, I am using bindings for ATi CAL in Factor, but it's tied to some proprietary code and fairly incomplete for now. However I do plan to release it at some point. The original post contains the high level method anyway - This ended up being the most simple of all the approaches I tried (sigh)
By the way, it seems I can't extract more than 444 GB/s out of the texture units (even in synthetic tests with extreme locality) so that gives an upper bound of 888 Gflop/s for this kernel.
rpg.314
19-Aug-2009, 06:17
Of course the t unit is being used here, and is one of the reasons why this code is able to reach such high speeds. And yet, the t unit and the mul are accidents. You would almost never be able to exploit them. This code is perhaps an exception.
However, I find that calculating the fraction of the peak from other bottlenecks (assumed or otherwise), is not the best way. Calculating the peak from actual max throughput is better as it can expose other deficiencies/bottlenecks.
OpenGL guy
19-Aug-2009, 07:54
Of course the t unit is being used here, and is one of the reasons why this code is able to reach such high speeds. And yet, the t unit and the mul are accidents. You would almost never be able to exploit them. This code is perhaps an exception.
The t unit is readily used in many shaders, I don't know where you are coming from.
However, I find that calculating the fraction of the peak from other bottlenecks (assumed or otherwise), is not the best way. Calculating the peak from actual max throughput is better as it can expose other deficiencies/bottlenecks.
Different shaders have different performance profiles. If the shader is ALU limited, then likely it will be making good use of the t unit.
rpg.314
19-Aug-2009, 07:59
The t unit is readily used in many shaders, I don't know where you are coming from.
You mean it's used as a fma unit in many shaders, and not as an sfu?
Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction.
OpenGL guy
19-Aug-2009, 08:39
You mean it's used as a fma unit in many shaders, and not as an sfu?
Yes, it can be used as both. See the example posted in this very thread.
Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction.
What compiler and what shaders?
rpg.314
19-Aug-2009, 11:26
Yes, it can be used as both. See the example posted in this very thread.
I know it can be used as both, I am wondering if you are saying that t unit is used as a fma unit in shaders?
It is, 90% of the time, not.
What compiler and what shaders?
ATI jit compiler. Bioshock has 3.5, iirc.
I know it can be used as both, I am wondering if you are saying that t unit is used as a fma unit in shaders?
Why wouldn't it?
Jawed
rpg.314
19-Aug-2009, 15:15
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
I just loaded up GSA, which happens to have the NVidia Horizon Based AO shader loaded as the last thing I looked at, and there's a whole pile of Ts doing MADs, MULs, ADDs and CNDEs as well as various transcendentals.
Jawed
rpg.314
19-Aug-2009, 16:43
Can you calculate the avg slot occupancy in that shader?
Cache locality is quite important: I get only 720 Gflop/s with 1024x1024 matrices (and performance crumbles over that size) with a naive scanline ordering, and 840 Gflop/s with tiling. The texture fetch at the start of the shader loads precomputed tiled addresses.
OK, so what you're saying is cache locality is important in allowing the 8 instructions in the TEX clause to run at full speed (or close). 4:1 ALU:TEX, in this case, doesn't provide the leeway to enable "sloppy" access patterns.
I guess the scanline access pattern ends up with L2 filled with data it junks, which increases the number of fetches into L2 to fulfil the 8 TEX instructions.
Given the difficult to produce decent code with this framework, I don't think you could go far with Brook+. The weird things you might notice in the code is just junk to coerce the compiler into sanity.
Guess I'll leave fiddling with it until an idle moment. I can't test performance anyway, but I want to think about your tiling and striding.
Well I wouldn't mind helping writing something similar to Volkov's paper, but I do not have much use for dense linear algebra myself. My work involves mostly boring memory-bound kernels, and this was an opportunity to have some fun.
Onto the double-precision version!
By the way, it seems I can't extract more than 444 GB/s out of the texture units (even in synthetic tests with extreme locality) so that gives an upper bound of 888 Gflop/s for this kernel.
Hmm, perhaps full cache bandwidth only comes with the same values being fetched multiple times.
Jawed
Can you calculate the avg slot occupancy in that shader?
x 67
y 59
z 56
w 61
t 51
total ALU instructions 101
utilisation = 58%
Jawed
OpenGL guy
19-Aug-2009, 19:08
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
It really isn't that uncommon to see the t unit used all over. Here's an example I am looking at currently:
; -------- Disassembly --------------------
00 TEX: ADDR(656) CNT(1) VALID_PIX
0 SAMPLE R6, R1.xyxx, t3, s3
01 ALU_PUSH_BEFORE: ADDR(64) CNT(101)
1 x: MULADD T1.x, C34.x, R6.x, -1.0f
y: MULADD T0.y, C34.x, R6.y, -1.0f
z: MULADD T0.z, C34.x, R6.z, -1.0f
w: MULADD T1.w, C34.x, R6.w, -1.0f
t: MULADD T2.y, C34.x, R6.y, -1.0f VEC_021
2 x: MUL ____, PV1.z, PV1.z
y: MUL ____, R5.z, R5.z
z: MULADD T1.z, C34.x, R6.z, -1.0f
w: MUL T0.w, R2.z, R2.z VEC_201
t: ADD R11.z, -C23.x, C24.x
3 x: DOT4 ____, T1.x, T1.x
y: DOT4 ____, T0.y, T0.y
z: DOT4 ____, PV2.x, 1.0f
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: MULADD T0.x, R5.y, R5.y, PV2.y
4 x: DOT4 ____, T1.w, T1.w
y: DOT4 ____, T2.y, T2.y
z: DOT4 ____, T1.z, T1.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: RSQ_sat T1.y, PV3.x
5 x: MUL T0.x, T1.x, PS4
y: MULADD ____, R5.x, R5.x, T0.x VEC_102
z: MULADD ____, R2.y, R2.y, T0.w
w: MUL T0.w, T0.z, PS4
t: RSQ_sat T2.x, PV4.x
6 x: MUL T1.x, T1.z, PS5
y: MULADD ____, R2.x, R2.x, PV5.z
z: MUL ____, T1.w, PS5
w: MUL T2.w, T0.y, T1.y
t: RSQ_sat ____, PV5.y
7 x: CNDGE T0.x, -C22.x, T0.x, PV6.z
y: MUL T0.y, R5.z, PS6
z: MUL T1.z, R5.x, PS6
w: MUL T1.w, R5.y, PS6
t: RSQ_sat T3.x, PV6.y
8 x: DOT4 ____, R4.x, R4.x
y: DOT4 T1.y, R4.y, R4.y
z: DOT4 ____, R4.z, R4.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: CNDGE T0.w, -C22.x, T0.w, T1.x VEC_021
9 x: MUL T0.x, T0.x, T1.z
y: MUL T0.y, T0.x, T0.y
z: MUL ____, T0.x, T1.w
w: MUL ____, T2.y, T2.x VEC_021
t: MUL ____, R2.y, T3.x
10 x: CNDGE T2.x, -C22.x, T2.w, PV9.w
y: MUL T1.y, R2.z, T3.x
z: MUL ____, R2.x, T3.x
w: MULADD T2.w, PS9, T0.w, PV9.z VEC_021
t: RSQ_sat T3.x, T1.y
11 x: DOT4 ____, R3.x, R3.x VEC_120
y: DOT4 ____, R3.y, R3.y
z: DOT4 ____, R3.z, R3.z
w: DOT4 R11.w, (0x80000000, 0.0f).x, 0.0f
t: MULADD T0.x, PV10.z, T0.w, T0.x
12 x: MUL ____, R4.y, T3.x
y: MUL ____, R4.z, T3.x
z: MUL ____, R4.x, T3.x
w: MULADD ____, T1.y, T0.w, T0.y VEC_102
t: RSQ_e T1.z, |PV11.x|
13 x: MULADD T2.x, T2.x, PV12.z, T0.x
y: MULADD T0.y, T2.x, PV12.x, T2.w
z: MULADD T0.z, T2.x, PV12.y, PV12.w
w: MULADD T2.w, R3.x, PS12, -C29.x VEC_120
t: MULADD T2.y, R3.y, PS12, -C29.y
14 x: MUL ____, PV13.z, PV13.z
z: MULADD T1.z, R3.z, T1.z, -C29.z
t: RCP_e ____, T1.z
15 x: DOT4 T0.x, T2.x, T2.x
y: DOT4 ____, T0.y, T0.y
z: DOT4 ____, PV14.x, 1.0f
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: ADD ____, PS14, -C28.x
16 x: DOT4 ____, T2.w, T2.w
y: DOT4 T1.y, T2.y, T2.y
z: DOT4 ____, T1.z, T1.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: MUL R1.w, PS15, C28.y CLAMP
17 x: ADD ____, PS16, -1.0f
t: RSQ_sat ____, T0.x
18 x: MUL R12.x, T2.x, PS17
y: MUL R11.y, T0.y, PS17
z: MUL R12.z, T0.z, PS17
w: CNDGE R2.w, PV17.x, 0.0f, 1.0f
t: RSQ_sat ____, T1.y
19 x: MUL ____, T2.w, PS18
y: MUL ____, T2.y, PS18
z: MUL ____, T1.z, PS18
20 x: DOT4 ____, R12.x, PV19.x
y: DOT4 ____, R11.y, PV19.y
z: DOT4 ____, R12.z, PV19.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
21 w: MAX R12.w, PV20.x, C38.z
22 x: PREDNE ____, R2.w, -R2.w UPDATE_EXEC_MASK UPDATE_PRED
02 JUMP POP_CNT(1) ADDR(36) VALID_PIX
03 ALU: ADDR(165) CNT(121)
23 x: ADD R5.x, -R3.x, C21.x
y: ADD R3.y, -R3.y, C21.y
z: ADD R1.z, -R3.z, C21.z
w: MOV R6.w, 1.0f
t: MOV R2.z, C35.z
24 x: DOT4 ____, PV23.x, C18.x
y: DOT4 ____, PV23.y, C18.y
z: DOT4 T0.z, PV23.z, C18.z
w: DOT4 ____, PV23.w, C18.w
t: MOV R3.x, 0.0f
25 x: MUL R2.x, C32.z, -1.0f
y: MUL R2.y, C32.w, -1.0f
z: ADD ____, PV24.x, R2.z
w: ADD ____, PV24.x, -C32.y
t: ADD T0.w, PV24.x, -C32.z
26 x: ADD ____, T0.z, -C32.w
y: ADD ____, T0.z, PV25.y
z: CNDGE T1.z, PV25.z, 0.0f, 1.0f
w: ADD ____, T0.z, PV25.x
t: CNDGE T0.y, PV25.w, 1.0f, 0.0f
27 x: CNDGE ____, PV26.w, 0.0f, 1.0f
y: CNDGE ____, T0.w, 1.0f, 0.0f
z: CNDGE ____, PV26.y, 0.0f, 1.0f
w: CNDGE ____, PV26.x, 1.0f, 0.0f
t: ADD ____, T0.z, C36.x
28 x: MUL ____, PV27.x, T0.y
y: MUL ____, PV27.z, PV27.y
z: MUL ____, T1.z, PV27.w
t: MUL R6.x, PS27, C36.y CLAMP
29 x: DOT4 R8.x, PV28.x, 1.0f
y: DOT4 ____, PV28.y, C33.y
z: DOT4 ____, PV28.z, C33.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: ADD T0.y, PS28, -1.0f
30 x: ADD ____, PV29.x, -1.0f
y: ADD R2.y, PV29.x, -1.0f
z: ADD ____, PV29.x, -C33.w
w: ADD ____, PV29.x, -C33.y
t: ADD R2.z, PV29.x, -C33.y
31 x: CNDGE R4.x, PV30.w, PV30.w, -PS30
y: CNDGE R5.y, PV30.x, PV30.x, -PV30.y
z: CNDGE T3.z, PV30.z, PV30.z, -R8.x
w: ADD R2.w, R8.x, -C33.z
t: ADD ____, R8.x, -C33.z
32 x: CNDGE ____, -PV31.z, C3.z, 0.0f
y: CNDGE ____, -PV31.z, C3.y, 0.0f
z: CNDGE ____, -PV31.z, C3.x, 0.0f
w: CNDGE ____, -PV31.z, C3.w, 0.0f
t: CNDGE R5.w, PS31, PS31, -PV31.w
33 x: CNDGE T0.x, -R5.y, C7.z, PV32.x
y: CNDGE T1.y, -R5.y, C7.y, PV32.y
z: CNDGE T1.z, -R5.y, C7.x, PV32.z
w: CNDGE T0.w, -R5.y, C7.w, PV32.w
t: MUL R2.z, R8.x, (0x3E800000, 0.25f).x
34 x: CNDGE T1.x, -T3.z, C0.z, 0.0f
y: CNDGE T0.y, -T3.z, C0.y, 0.0f
z: CNDGE T0.z, -T3.z, C0.x, 0.0f
w: CNDGE T1.w, -T3.z, C0.w, 0.0f
t: CNDGE R3.z, T0.y, 0.0f, 1.0f
35 x: CNDGE T2.x, -T3.z, C1.x, 0.0f
y: CNDGE T2.y, -T3.z, C1.w, 0.0f
z: CNDGE T2.z, -T3.z, C1.y, 0.0f
w: CNDGE T2.w, -T3.z, C1.z, 0.0f
36 x: CNDGE T0.x, -R4.x, C11.z, T0.x
y: CNDGE T1.y, -R4.x, C11.y, T1.y
z: CNDGE T1.z, -R4.x, C11.x, T1.z
w: CNDGE T0.w, -R4.x, C11.w, T0.w
37 x: CNDGE T1.x, -R5.y, C4.z, T1.x
y: CNDGE T0.y, -R5.y, C4.y, T0.y
z: CNDGE T0.z, -R5.y, C4.x, T0.z
w: CNDGE T1.w, -R5.y, C4.w, T1.w
38 x: CNDGE T2.x, -R5.y, C5.x, T2.x
y: CNDGE T2.y, -R5.y, C5.w, T2.y
z: CNDGE T2.z, -R5.y, C5.y, T2.z
w: CNDGE T2.w, -R5.y, C5.z, T2.w
39 x: CNDGE T0.x, -R5.w, C15.x, T1.z
y: CNDGE T1.y, -R5.w, C15.y, T1.y
z: CNDGE T1.z, -R5.w, C15.z, T0.x
w: CNDGE ____, -R5.w, C15.w, T0.w
40 x: CNDGE T1.x, -R4.x, C8.z, T1.x
y: CNDGE T0.y, -R4.x, C8.y, T0.y
z: CNDGE T0.z, -R4.x, C8.x, T0.z
w: CNDGE T1.w, -R4.x, C8.w, T1.w VEC_021
t: MUL ____, R6.w, PV39.w
41 x: CNDGE T2.x, -R4.x, C9.x, T2.x
y: CNDGE T2.y, -R4.x, C9.w, T2.y
z: CNDGE T1.z, -R4.x, C9.y, T2.z VEC_120
w: CNDGE T2.w, -R4.x, C9.z, T2.w
t: MULADD ____, R1.z, T1.z, PS40
42 x: DOT4 ____, R5.x, T0.x
y: DOT4 ____, R3.y, T1.y
z: DOT4 ____, PS41, 1.0f
w: DOT4 ____, 0.0f, 0.0f
t: CNDGE T0.x, -R5.w, C12.x, T0.z VEC_021
43 x: CNDGE T1.x, -R5.w, C13.x, T2.x
y: CNDGE T0.y, -R5.w, C12.y, T0.y
z: CNDGE T0.z, -R5.w, C12.z, T1.x VEC_021
w: CNDGE ____, -R5.w, C12.w, T1.w
t: RCP_e R13.w, PV42.x
44 x: CNDGE R2.x, -T3.z, C2.x, 0.0f
y: CNDGE T2.y, -R5.w, C13.y, T1.z
z: CNDGE T1.z, -R5.w, C13.z, T2.w VEC_021
w: CNDGE T2.w, -R5.w, C13.w, T2.y
t: MUL ____, R6.w, PV43.w
45 x: DOT4 ____, R5.x, T0.x
y: DOT4 R2.y, R3.y, T0.y
z: DOT4 ____, R1.z, T0.z
w: DOT4 ____, PS44, 1.0f
t: CNDGE R4.z, -T3.z, C2.y, 0.0f
46 x: DOT4 ____, R5.x, T1.x
y: DOT4 ____, R3.y, T2.y
z: DOT4 ____, R1.z, T1.z
w: DOT4 ____, R6.w, T2.w
t: MUL R9.y, R13.w, PV45.x
47 x: MOV R7.x, PS46
y: CNDGE R4.y, -T3.z, C2.z, 0.0f
z: MUL R5.z, R13.w, PV46.x
w: ADD R8.w, PS46, C31.z
t: CNDGE R6.y, -T3.z, C2.w, 0.0f
04 ALU: ADDR(286) CNT(11)
48 x: MULADD R9.x, R2.y, R13.w, C31.z
y: CNDGE ____, -R5.y, C6.x, R2.x VEC_120
z: CNDGE ____, -R5.y, C6.y, R4.z VEC_120
w: MULADD R9.w, R5.z, (0x3E800000, 0.25f).x, R2.z VEC_102
t: CNDGE R2.w, -R5.y, C6.z, R4.y VEC_021
49 x: CNDGE R2.x, -R5.y, C6.w, R6.y
y: ADD R7.y, PV48.w, C31.w
z: CNDGE R2.z, -R4.x, C10.x, PV48.y
w: CNDGE R4.w, -R4.x, C10.y, PV48.z
t: ADD R8.y, PV48.w, C31.w
05 TEX: ADDR(658) CNT(6) VALID_PIX
50 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
51 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
52 SAMPLE_G R4.__x_, R7.xyxx, t0, s0
53 SAMPLE_G R8.___x, R8.wyww, t0, s0
54 SAMPLE_G R8._x__, R9.xwxx, t0, s0
55 SAMPLE_G R7.x___, R9.ywyy, t0, s0
06 ALU_PUSH_BEFORE: ADDR(297) CNT(29)
56 x: CNDGE T1.x, -R5.w, C14.x, R2.z
y: CNDGE ____, -R4.x, C10.w, R2.x
z: CNDGE T3.z, -R5.w, C14.y, R4.w
w: CNDGE ____, -R4.x, C10.z, R2.w VEC_021
57 x: MUL ____, R9.y, C31.x
y: MUL T2.y, R9.w, C31.y
z: CNDGE ____, -R5.w, C14.z, PV56.w VEC_120
w: CNDGE ____, -R5.w, C14.w, PV56.y VEC_120
58 x: DOT4 ____, R5.x, T1.x
y: DOT4 ____, R3.y, T3.z
z: DOT4 R1.z, R1.z, PV57.z
w: DOT4 ____, R6.w, PV57.w
t: FRACT T0.y, PV57.x
59 x: MULADD ____, PV58.x, R13.w, -R7.x
y: MULADD ____, PV58.x, R13.w, -R8.w
z: MULADD ____, PV58.x, R13.w, -R4.z
w: MULADD ____, PV58.x, R13.w, -R8.y VEC_201
t: FRACT T2.w, T2.y
60 x: CNDGE T1.x, PV59.x, 0.0f, 1.0f
y: CNDGE ____, PV59.y, 0.0f, 1.0f
z: CNDGE T3.z, PV59.z, 0.0f, 1.0f
w: CNDGE ____, PV59.w, 0.0f, 1.0f
61 y: ADD ____, -PV60.z, PV60.y
z: ADD ____, -PV60.x, PV60.w
62 x: MULADD T1.x, PV61.z, T0.y, T1.x
w: MULADD ____, PV61.y, T0.y, T3.z
63 x: ADD ____, -PV62.x, PV62.w
64 z: MULADD R8.z, PV63.x, T2.w, T1.x
65 x: PREDNE ____, R3.z, -R3.z UPDATE_EXEC_MASK UPDATE_PRED
07 JUMP POP_CNT(1) ADDR(35) VALID_PIX
08 ALU: ADDR(326) CNT(20)
66 x: MULADD T0.x, -R6.x, C31.z, C31.z
y: MULADD T0.y, -R6.x, C31.w, C31.w
z: ADD ____, R8.x, -1.0f VEC_120
w: ADD ____, R8.x, 0.0f VEC_120
67 y: MOV ____, -|PV66.w|
z: MOV T0.z, -|PV66.z|
68 x: CNDGE ____, PV67.y, C19.x, 0.0f
w: CNDGE ____, PV67.y, C19.y, 0.0f
69 x: CNDGE ____, T0.z, C20.y, PV68.w
y: CNDGE ____, T0.z, C20.x, PV68.x
70 y: MUL R3.y, PV69.y, T0.x
z: MUL R3.z, PV69.x, T0.y
71 y: MULADD R4.y, PV70.y, -C33.z, R9.y
z: MULADD R4.z, PV70.z, -C33.z, R9.w
t: MULADD R6.y, PV70.y, C36.z, R9.y VEC_021
72 x: ADD R2.x, PV71.y, C31.z
y: ADD R2.y, PV71.z, C31.w
z: MUL R5.z, PV71.y, C31.x
w: ADD R4.w, PV71.z, C31.w
t: ADD R4.x, PV71.y, C31.z
09 TEX: ADDR(670) CNT(6) VALID_PIX
73 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
74 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
75 SAMPLE_G R2.___x, R2.xyxx, t0, s0
76 SAMPLE_G R2.x___, R4.yzyy, t0, s0
77 SAMPLE_G R2._x__, R4.xzxx, t0, s0
78 SAMPLE_G R2.__x_, R4.ywyy, t0, s0
10 ALU: ADDR(346) CNT(25)
79 x: MUL ____, R4.z, C31.y VEC_120
y: MULADD ____, R1.z, R13.w, -R2.y VEC_210
z: MULADD ____, R1.z, R13.w, -R2.x VEC_210
w: MULADD ____, R1.z, R13.w, -R2.z VEC_210
t: MULADD T0.x, R1.z, R13.w, -R2.w
80 x: CNDGE ____, PV79.y, 0.0f, 1.0f
y: FRACT R4.y, PV79.x
z: FRACT T0.z, R5.z
w: CNDGE T0.w, PV79.z, 0.0f, 1.0f
t: CNDGE T1.z, PV79.w, 0.0f, 1.0f
81 x: ADD R2.x, R6.y, C31.z
y: CNDGE ____, T0.x, 0.0f, 1.0f
z: MULADD R6.z, R3.z, C36.w, R9.w
w: ADD ____, -PV80.w, PV80.x
t: ADD R6.x, R6.y, C31.z
82 x: ADD ____, -T1.z, PV81.y
y: ADD R2.y, PV81.z, C31.w
z: MULADD R2.z, PV81.w, T0.z, T0.w
w: ADD R6.w, PV81.z, C31.w
t: MUL ____, R6.y, C31.x
83 x: FRACT R4.x, PS82
y: MULADD R7.y, R3.y, C36.w, R9.y
z: MUL R5.z, R6.z, C31.y
w: MULADD R4.w, PV82.x, T0.z, T1.z
t: MULADD R8.y, R3.y, C33.z, R9.y VEC_021
11 TEX: ADDR(682) CNT(6) VALID_PIX
84 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
85 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
86 SAMPLE_G R2.___x, R2.xyxx, t0, s0
87 SAMPLE_G R2.x___, R6.yzyy, t0, s0
88 SAMPLE_G R2._x__, R6.xzxx, t0, s0
89 SAMPLE_G R6.__x_, R6.ywyy, t0, s0
12 ALU: ADDR(371) CNT(35)
90 x: MULADD ____, R1.z, R13.w, -R2.x VEC_021
y: MULADD ____, R1.z, R13.w, -R6.z VEC_021
z: MULADD ____, R1.z, R13.w, -R2.w VEC_021
w: MULADD ____, R1.z, R13.w, -R2.y VEC_021
t: FRACT T0.w, R5.z
91 x: CNDGE T0.x, PV90.y, 0.0f, 1.0f
y: CNDGE T0.y, PV90.x, 0.0f, 1.0f
z: CNDGE ____, PV90.w, 0.0f, 1.0f
w: CNDGE ____, PV90.z, 0.0f, 1.0f
t: ADD ____, -R2.z, R4.w
92 x: ADD ____, -PV91.y, PV91.z
y: MUL T1.y, R7.y, C31.x
z: MULADD ____, PS91, R4.y, R2.z
w: ADD ____, -PV91.x, PV91.w
t: MULADD R7.z, R3.z, C36.z, R9.w VEC_021
93 x: ADD T0.x, R8.z, PV92.z
y: MULADD T0.y, PV92.x, R4.x, T0.y
z: MULADD ____, PV92.w, R4.x, T0.x
w: ADD R4.w, R7.y, C31.z
t: ADD R4.y, PS92, C31.w
94 x: ADD R7.x, R7.y, C31.z
y: MUL ____, R7.z, C31.y
z: FRACT R5.z, T1.y VEC_120
w: ADD ____, -PV93.y, PV93.z
t: ADD R7.w, R7.z, C31.w
95 x: FRACT R6.x, PV94.y
y: MULADD ____, PV94.w, T0.w, T0.y VEC_021
z: MULADD R8.z, R3.z, C33.z, R9.w VEC_021
w: ADD R2.w, R8.y, C31.z
t: ADD R8.x, R8.y, C31.z
96 x: MUL R2.x, R8.y, C31.x
y: ADD R2.y, PV95.z, C31.w
z: ADD R6.z, T0.x, PV95.y
w: ADD R8.w, PV95.z, C31.w
t: MUL R2.z, PV95.z, C31.y
13 TEX: ADDR(694) CNT(7) VALID_PIX
97 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
98 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
99 SAMPLE_G R4.___x, R4.wyww, t0, s0
100 SAMPLE_G R2.___x, R2.wyww, t0, s0
101 SAMPLE_G R4.x___, R7.yzyy, t0, s0
102 SAMPLE_G R2._x__, R7.xzxx, t0, s0
103 SAMPLE_G R7.__x_, R7.ywyy, t0, s0
14 ALU: ADDR(406) CNT(16)
104 x: MULADD ____, R1.z, R13.w, -R4.w VEC_210
y: MULADD ____, R1.z, R13.w, -R2.y VEC_210
z: MULADD ____, R1.z, R13.w, -R4.x VEC_210
w: MULADD ____, R1.z, R13.w, -R7.z VEC_210
t: MULADD T0.z, R1.z, R13.w, -R2.w VEC_120
105 x: CNDGE ____, PV104.y, 0.0f, 1.0f
y: CNDGE ____, PV104.x, 0.0f, 1.0f
z: CNDGE T1.z, PV104.w, 0.0f, 1.0f
w: CNDGE T0.w, PV104.z, 0.0f, 1.0f
t: FRACT R4.x, R2.x
106 x: ADD ____, -PV105.z, PV105.y
y: FRACT R7.y, R2.z
z: CNDGE R2.z, T0.z, 0.0f, 1.0f VEC_120
w: ADD ____, -PV105.w, PV105.x
107 z: MULADD R5.z, PV106.w, R5.z, T0.w
w: MULADD R2.w, PV106.x, R5.z, T1.z
15 TEX: ADDR(708) CNT(5) VALID_PIX
108 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
109 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
110 SAMPLE_G R2.x___, R8.yzyy, t0, s0
111 SAMPLE_G R2._x__, R8.xzxx, t0, s0
112 SAMPLE_G R8.__x_, R8.ywyy, t0, s0
16 ALU_PUSH_BEFORE: ADDR(422) CNT(18)
113 x: MULADD ____, R1.z, R13.w, -R2.x
y: MULADD ____, R1.z, R13.w, -R8.z
z: ADD ____, -R5.z, R2.w VEC_120
w: MULADD ____, R1.z, R13.w, -R2.y
114 x: CNDGE ____, PV113.w, 0.0f, 1.0f
y: CNDGE T0.y, PV113.x, 0.0f, 1.0f
z: MULADD ____, PV113.z, R6.x, R5.z
w: CNDGE T0.w, PV113.y, 0.0f, 1.0f
115 x: ADD T0.x, R6.z, PV114.z
y: ADD ____, -PV114.y, PV114.x
w: ADD ____, -PV114.w, R2.z
116 y: MULADD T0.y, PV115.y, R4.x, T0.y
z: MULADD ____, PV115.w, R4.x, T0.w
117 w: ADD ____, -PV116.y, PV116.z
118 y: MULADD ____, PV117.w, R7.y, T0.y
119 z: ADD R8.z, T0.x, PV118.y
120 x: CNDGE R2.x, -PV119.z, 0.0f, 1.0f
121 x: PREDNE ____, R2.x, -R2.x UPDATE_EXEC_MASK UPDATE_PRED
17 ALU_PUSH_BEFORE: ADDR(440) CNT(3)
122 y: ADD ____, R8.z, C37.x
123 x: CNDGE R2.x, PV122.y, 1.0f, 0.0f
124 x: PREDNE ____, R2.x, -R2.x UPDATE_EXEC_MASK UPDATE_PRED
18 JUMP ADDR(20) VALID_PIX
19 ALU: ADDR(443) CNT(1)
125 z: MOV R8.z, 1.0f
20 ELSE POP_CNT(1) ADDR(34) VALID_PIX
21 ALU: ADDR(444) CNT(8)
126 y: MULADD R4.y, R3.y, C38.x, R9.y
z: MULADD R4.z, R3.z, C38.y, R9.w
t: MULADD R5.y, R3.y, C37.y, R9.y VEC_021
127 x: ADD R2.x, PV126.y, C31.z
y: ADD R2.y, PV126.z, C31.w
z: MUL R6.z, PV126.y, C31.x
w: ADD R4.w, PV126.z, C31.w
t: ADD R4.x, PV126.y, C31.z
22 TEX: ADDR(718) CNT(6) VALID_PIX
128 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
129 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
130 SAMPLE_G R2.___x, R2.xyxx, t0, s0
131 SAMPLE_G R2.x___, R4.yzyy, t0, s0
132 SAMPLE_G R2._x__, R4.xzxx, t0, s0
133 SAMPLE_G R2.__x_, R4.ywyy, t0, s0
23 ALU: ADDR(452) CNT(15)
134 x: MULADD ____, R1.z, R13.w, -R2.x VEC_021
y: MULADD ____, R1.z, R13.w, -R2.w VEC_021
z: MULADD ____, R1.z, R13.w, -R2.z VEC_021
w: MULADD ____, R1.z, R13.w, -R2.y VEC_021
t: MUL R4.w, R4.z, C31.y
135 x: CNDGE R4.x, PV134.x, 0.0f, 1.0f
y: CNDGE R4.y, PV134.y, 0.0f, 1.0f
z: CNDGE R7.z, PV134.z, 0.0f, 1.0f
w: CNDGE R6.w, PV134.w, 0.0f, 1.0f
t: MULADD R5.z, R3.z, C37.z, R9.w VEC_021
136 x: ADD R2.x, R5.y, C31.z
y: ADD R2.y, PS135, C31.w
z: MUL R4.z, R5.y, C31.x
w: ADD R5.w, PS135, C31.w
t: ADD R5.x, R5.y, C31.z
24 TEX: ADDR(730) CNT(6) VALID_PIX
137 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
138 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
139 SAMPLE_G R2.___x, R2.xyxx, t0, s0
140 SAMPLE_G R2.x___, R5.yzyy, t0, s0
141 SAMPLE_G R2._x__, R5.xzxx, t0, s0
142 SAMPLE_G R2.__x_, R5.ywyy, t0, s0
25 ALU: ADDR(467) CNT(39)
143 x: MULADD ____, R1.z, R13.w, -R2.y VEC_210
y: MULADD ____, R1.z, R13.w, -R2.x VEC_210
z: MULADD ____, R1.z, R13.w, -R2.z VEC_210
w: MUL ____, R5.z, C31.y VEC_120
t: MULADD T0.w, R1.z, R13.w, -R2.w
144 x: FRACT T0.x, PV143.w
y: FRACT T0.y, R4.z
z: CNDGE T0.z, PV143.y, 0.0f, 1.0f
w: CNDGE ____, PV143.x, 0.0f, 1.0f
t: CNDGE T1.y, PV143.z, 0.0f, 1.0f
145 x: CNDGE ____, T0.w, 0.0f, 1.0f
y: ADD ____, -PV144.z, PV144.w
z: FRACT T1.z, R6.z
w: FRACT R7.w, R4.w VEC_201
t: ADD ____, -R4.x, R6.w
146 x: MULADD R11.x, PS145, PV145.z, R4.x
y: ADD ____, -T1.y, PV145.x
z: MULADD T0.z, PV145.y, T0.y, T0.z
w: ADD ____, -R7.z, R4.y VEC_021
t: MULADD R6.z, R3.z, C40.y, R9.w VEC_021
147 x: MULADD R8.x, PV146.w, T1.z, R7.z
y: ADD R7.y, PS146, C31.w
z: MUL ____, PS146, C31.y
w: MULADD ____, PV146.y, T0.y, T1.y
t: ADD R6.w, PS146, C31.w
148 x: FRACT R2.x, PV147.z
y: ADD ____, -T0.z, PV147.w
z: MULADD R5.z, R3.z, C40.w, R9.w VEC_120
t: MULADD R6.y, R3.y, C40.x, R9.y VEC_021
149 x: MULADD ____, PV148.y, T0.x, T0.z
y: MULADD R5.y, R3.y, C40.z, R9.y
z: ADD R7.z, PS148, C31.z
w: MUL ____, PS148, C31.x
t: ADD R6.x, PS148, C31.z
150 x: ADD R4.x, PV149.y, C31.z
y: FRACT R4.y, PV149.w
z: ADD R4.z, R8.z, PV149.x
w: ADD R4.w, R5.z, C31.w VEC_120
t: ADD R5.x, PV149.y, C31.z
26 TEX: ADDR(742) CNT(7) VALID_PIX
151 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
152 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
153 SAMPLE_G R2.___x, R7.zyzz, t0, s0
154 SAMPLE_G R4.___x, R4.xwxx, t0, s0
155 SAMPLE_G R4.x___, R6.yzyy, t0, s0
156 SAMPLE_G R7._x__, R6.xzxx, t0, s0
157 SAMPLE_G R6.__x_, R6.ywyy, t0, s0
27 ALU: ADDR(506) CNT(25)
158 x: MULADD ____, R1.z, R13.w, -R7.y VEC_021
y: MULADD ____, R1.z, R13.w, -R4.x VEC_021
z: MULADD ____, R1.z, R13.w, -R2.w VEC_021
w: MULADD ____, R1.z, R13.w, -R6.z VEC_021
t: ADD R5.w, R5.z, C31.w
159 x: CNDGE ____, PV158.z, 0.0f, 1.0f
y: CNDGE T0.y, PV158.w, 0.0f, 1.0f
z: CNDGE ____, PV158.x, 0.0f, 1.0f
w: CNDGE T0.w, PV158.y, 0.0f, 1.0f
t: MUL ____, R5.y, C31.x
160 x: MUL ____, R5.z, C31.y
y: MULADD ____, R1.z, R13.w, -R4.w VEC_120
z: ADD ____, -PV159.y, PV159.x
w: ADD ____, -PV159.w, PV159.z
t: FRACT R4.w, PS159
161 x: FRACT R7.x, PV160.x
y: MULADD R4.y, PV160.w, R4.y, T0.w VEC_021
z: CNDGE R6.z, PV160.y, 0.0f, 1.0f
w: MULADD ____, PV160.z, R4.y, T0.y VEC_021
t: MULADD R10.z, R3.z, C39.y, R9.w VEC_021
162 x: ADD R4.x, -PV161.y, PV161.w
y: MULADD R10.y, R3.y, C39.x, R9.y
z: ADD R9.z, PS161, C31.w
w: ADD R10.w, PS161, C31.w
t: MUL R7.z, PS161, C31.y
28 TEX: ADDR(756) CNT(6) VALID_PIX
163 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
164 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
165 SAMPLE_G R6.x___, R5.yzyy, t0, s0
166 SAMPLE_G R7._x__, R5.xzxx, t0, s0
167 SAMPLE_G R5.__x_, R5.ywyy, t0, s0
168 SAMPLE_G R5.x___, R10.yzyy, t0, s0
29 ALU: ADDR(531) CNT(33)
169 x: MULADD ____, R1.z, R13.w, -R5.z
y: MULADD ____, R1.z, R13.w, -R7.y VEC_021
z: MULADD ____, R4.x, R2.x, R4.y VEC_120
w: MULADD ____, R1.z, R13.w, -R6.x VEC_120
t: ADD R9.x, R10.y, C31.z
170 x: CNDGE T0.x, PV169.w, 0.0f, 1.0f
y: CNDGE T0.y, PV169.x, 0.0f, 1.0f
z: CNDGE ____, PV169.y, 0.0f, 1.0f
w: ADD T0.w, R4.z, PV169.z
t: ADD R10.x, R10.y, C31.z
171 x: MUL ____, R10.y, C31.x
y: ADD ____, -PV170.y, R6.z
z: MULADD ____, R1.z, R13.w, -R5.x
w: ADD ____, -PV170.x, PV170.z
t: FRACT R2.x, R7.z
172 x: MULADD T0.x, PV171.w, R4.w, T0.x
y: FRACT R7.y, PV171.x
z: MULADD ____, PV171.y, R4.w, T0.y VEC_120
w: CNDGE R2.w, PV171.z, 0.0f, 1.0f
t: MULADD R6.y, R3.y, C39.z, R9.y VEC_021
173 x: ADD R5.x, PS172, C31.z
y: ADD ____, -PV172.x, PV172.z
z: MULADD R6.z, R3.z, C39.w, R9.w
w: MUL ____, PS172, C31.x
t: ADD R6.x, PS172, C31.z
174 x: MULADD ____, PV173.y, R7.x, T0.x
y: ADD R5.y, PV173.z, C31.w
z: MUL ____, PV173.z, C31.y
w: ADD R6.w, PV173.z, C31.w
t: FRACT R8.w, PV173.w
175 x: ADD R8.x, -R11.x, R8.x
y: FRACT R2.y, PV174.z
z: ADD R7.z, T0.w, PV174.x
30 TEX: ADDR(768) CNT(6) VALID_PIX
176 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
177 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
178 SAMPLE_G R4.___x, R9.xzxx, t0, s0
179 SAMPLE_G R4._x__, R10.xzxx, t0, s0
180 SAMPLE_G R5.___x, R5.xyxx, t0, s0
181 SAMPLE_G R10.__x_, R10.ywyy, t0, s0
31 ALU: ADDR(564) CNT(13)
182 x: MULADD ____, R1.z, R13.w, -R4.y
y: MULADD ____, R1.z, R13.w, -R10.z
z: MULADD ____, R1.z, R13.w, -R4.w
w: MULADD R4.w, R8.x, R7.w, R11.x VEC_102
183 x: CNDGE ____, PV182.z, 0.0f, 1.0f
y: CNDGE T0.y, PV182.y, 0.0f, 1.0f
z: CNDGE ____, PV182.x, 0.0f, 1.0f
w: MULADD ____, R1.z, R13.w, -R5.w
184 y: CNDGE R10.y, PV183.w, 0.0f, 1.0f
z: ADD ____, -PV183.y, PV183.x
w: ADD ____, -R2.w, PV183.z
185 y: MULADD R4.y, PV184.w, R7.y, R2.w
w: MULADD R2.w, PV184.z, R7.y, T0.y
32 TEX: ADDR(780) CNT(5) VALID_PIX
186 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
187 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
188 SAMPLE_G R8.x___, R6.yzyy, t0, s0
189 SAMPLE_G R7._x__, R6.xzxx, t0, s0
190 SAMPLE_G R6.__x_, R6.ywyy, t0, s0
33 ALU_POP_AFTER: ADDR(577) CNT(18)
191 x: MULADD ____, R1.z, R13.w, -R6.z
y: MULADD ____, R1.z, R13.w, -R7.y
z: ADD ____, -R4.y, R2.w VEC_021
w: MULADD ____, R1.z, R13.w, -R8.x
192 x: CNDGE T0.x, PV191.w, 0.0f, 1.0f
y: CNDGE ____, PV191.y, 0.0f, 1.0f
z: MULADD ____, PV191.z, R2.x, R4.y
w: CNDGE T0.w, PV191.x, 0.0f, 1.0f
193 x: ADD ____, -PV192.x, PV192.y
y: ADD ____, -PV192.w, R10.y
w: ADD T1.w, R7.z, PV192.z
194 x: MULADD T0.x, PV193.x, R8.w, T0.x
z: MULADD ____, PV193.y, R8.w, T0.w
195 y: ADD ____, -PV194.x, PV194.z
196 x: MULADD ____, PV195.y, R2.y, T0.x
197 y: ADD ____, T1.w, PV196.x
198 x: ADD ____, R4.w, PV197.y
199 z: MUL R8.z, PV198.x, C37.w
34 POP (2) ADDR(35)
35 ALU_POP_AFTER: ADDR(595) CNT(2)
200 x: ADD ____, R3.w, -R8.z
201 w: MULADD R3.w, R1.w, PV200.x, R8.z
36 TEX: ADDR(790) CNT(2) VALID_PIX
202 SAMPLE R2, R1.xyxx, t2, s2
203 SAMPLE R1, R1.xyxx, t1, s1
37 ALU: ADDR(597) CNT(48)
204 x: DOT4 ____, R12.x, -C29.x
y: DOT4 ____, R11.y, -C29.y
z: DOT4 ____, R12.z, -C29.z
w: DOT4 T1.w, (0x80000000, 0.0f).x, 0.0f
t: MULADD T0.w, R2.w, R11.z, C23.x
205 x: MUL ____, C27.x, C27.x
w: MAX ____, PV204.x, 0.0f
t: LOG_sat ____, |R12.w|
206 x: MUL ____, PV205.w, R1.y
y: MUL ____, T0.w, PS205
z: MUL ____, PV205.w, R1.x
w: MUL ____, PV205.w, R1.z
t: RCP_e ____, PV205.x
207 x: MUL ____, PV206.z, C30.x
y: MUL T1.y, R11.w, PS206 CLAMP
z: MUL T0.z, PV206.w, C30.z
w: MUL ____, PV206.x, C30.y
t: EXP_e ____, PV206.y
208 x: MUL ____, R2.z, PS207
y: MUL ____, R2.y, PS207
z: MUL ____, R2.x, PS207
w: MUL ____, R3.w, PV207.x
t: MUL T0.y, R3.w, PV207.w
209 x: MUL ____, R3.w, T0.z
y: MUL ____, PV208.x, C30.z
z: MUL ____, PV208.y, C30.y
w: MUL ____, PV208.z, C30.x
t: MULADD T0.w, R1.x, R0.x, PV208.w
210 x: MUL ____, PV209.w, C25.x
y: MULADD T0.y, R1.z, R0.z, PV209.x
z: MULADD T0.z, R1.y, R0.y, T0.y
w: MUL ____, PV209.z, C25.x
t: MUL ____, PV209.y, C25.x
211 x: MUL T0.x, T1.y, C27.y
y: MULADD ____, PS210, R3.w, PV210.y
z: MULADD ____, PV210.w, R3.w, PV210.z
w: MULADD ____, PV210.x, R3.w, T0.w
212 y: CNDGE T0.y, -T1.w, T0.y, PV211.y
z: CNDGE T0.z, -T1.w, T0.z, PV211.z
w: CNDGE T0.w, -T1.w, T0.w, PV211.w
213 x: ADD ____, -PV212.y, C26.z
y: ADD ____, -PV212.z, C26.y
z: ADD ____, -PV212.w, C26.x
w: MUL R0.w, R0.w, R1.w
214 x: MULADD R0.x, T0.x, PV213.z, T0.w
y: MULADD R0.y, T0.x, PV213.y, T0.z
z: MULADD R0.z, T0.x, PV213.x, T0.y
38 EXP_DONE: PIX0, R0
The overall utilization is about 80% but this isn't due to the t unit not being utilized but instead because of some scalar dependencies.
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
This is incorrect. You may be confused by how scheduling is prioritized - namely, "common" instructions will first be assigned to the "vector" ALUs (x,y,z,w) and only if those are occupied will they be assigned to the transcendental unit as well. Of course, transcendental ops (or stuff like INT MUL/DIV, for example) get scheduled to the trans ALU implicitly. There are also some GPR read port restrictions in place, which end up not always allowing an instruction to be scheduled there. But it does MADs just fine, and quite often, really.
OpenGL guy
19-Aug-2009, 21:36
Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction.
Average utilization doesn't really indicate how often the t unit is being used. If you have a bunch of very scalar code, utilization may go down, but you may find the rest of the code is fully utilizing all slots.
rpg.314
20-Aug-2009, 05:58
Which instruction is a CNDE btw?
OpenGL guy
20-Aug-2009, 06:16
Which instruction is a CNDE btw?
I believe it checks if a number is equal to 0. If so, chooses one of the operands, if not, chooses the other. I don't have the specs in front of me but Jawed posted a link to the instruction set specs recently.
Edit: Did you mean CNDGE? I believe that checks if a number is greater than or equal to 0 with similar behavior to what I posted above.
Edit again: Compare to the cmp instruction in the Direct3D instruction specs.
prunedtree
20-Aug-2009, 08:52
Onto the double-precision version!
Well, given that double precision multiply-adds are four (five if you count the `t' unit) times slower but only require twice the bandwidth, it's much easier to achieve high ALU utilization. ATi's implementation is almost optimal, over 200 Gflop/s (out of 240 Gflop/s peak).
Hmm, perhaps full cache bandwidth only comes with the same values being fetched multiple times.
No, I didn't measure more than ~444 GB/s even with all threads fetching the same value(s) over and over. Running ATi's various synthetic tests (among the samples in the SDK) gives similar results. As texture fetches are the bottleneck, it's actually impressive that the hardware manages to loose only 1% of efficiency with a more complex access pattern.
t: CNDE_INT R6.w, R3.x, R4.z, PV35.z
Table 4.4 in the ISA Guide says that it's a conditional move based on the first operand being equal to 0.0 (for the floating point version - very loose description for the integer version), so it looks like it chooses between operand 2 and operand 3 to put into the resultant.
Jawed
prunedtree
25-Aug-2009, 16:18
The 8x8 block kernel in the original post uses only 26 float4 registers. There's clearly plenty of margin, so how much further can we go ? Well, it's possible to fit 8x10 blocks using the integrality of the register file. This is 11% faster in theory. It achieves 980 Gflop/s in practice, over 4 multiply-adds per cycle on average.
However, the clarity of the code suffers a little ^^;;
00 ALU: ADDR(64) CNT(87)
0 x: MOV R13.x, 0.0f
y: MOV R13.y, 0.0f
z: AND_INT T0.z, R0.x, (0x0000003F, 8.828180325e-44f).x
w: LSHR T0.w, R0.x, (0x00000006, 8.407790786e-45f).y
t: MOV R13.z, 0.0f
1 x: MOV R11.x, 0.0f
y: MOV R11.y, 0.0f
z: MOV R11.z, 0.0f
w: MOV R13.w, 0.0f
t: MOV R11.w, 0.0f
2 x: MOV R10.x, 0.0f
y: MOV R10.y, 0.0f
z: MOV R10.z, 0.0f
w: MOV R10.w, 0.0f
t: MOV R9.x, 0.0f
3 x: MOV R8.x, 0.0f
y: MOV R9.y, 0.0f
z: MOV R9.z, 0.0f
w: MOV R9.w, 0.0f
t: MOV R22.z, 0.0f
4 x: MOV R7.x, 0.0f
y: MOV R8.y, 0.0f
z: MOV R8.z, 0.0f
w: MOV R22.w, 0.0f
t: MOV R8.w, 0.0f
5 x: MOV R29.x, 0.0f
y: MOV R7.y, 0.0f
z: MOV R7.z, 0.0f
w: MOV R7.w, 0.0f
t: MOV R29.y, 0.0f
6 x: MOV R6.x, 0.0f
y: MOV R6.y, 0.0f
z: MOV R29.z, 0.0f
w: MOV R29.w, 0.0f
t: MOV R6.z, 0.0f
7 x: MOV R21.x, 0.0f
y: MOV R21.y, 0.0f
z: MOV R21.z, 0.0f
w: MOV R6.w, 0.0f
t: MOV R21.w, 0.0f
8 x: MOV R28.x, 0.0f
y: MOV R28.y, 0.0f
z: MOV R28.z, 0.0f
w: MOV R28.w, 0.0f
t: MOV R20.x, 0.0f
9 x: MOV R5.x, 0.0f
y: MOV R20.y, 0.0f
z: MOV R20.z, 0.0f
w: MOV R20.w, 0.0f
t: MOV R5.y, 0.0f
10 x: MOV R19.x, 0.0f
y: MOV R19.y, 0.0f
z: MOV R5.z, 0.0f
w: MOV R5.w, 0.0f
t: MOV R19.z, 0.0f
11 x: MOV R4.x, 0.0f
y: MOV R4.y, 0.0f
z: MOV R4.z, 0.0f
w: MOV R19.w, 0.0f
t: MOV R4.w, 0.0f
12 x: MOV R18.x, 0.0f
y: MOV R18.y, 0.0f
z: MOV R18.z, 0.0f
w: MOV R18.w, 0.0f
t: MOV R17.x, 0.0f
13 x: MOV R16.x, 0.0f
y: MOV R17.y, 0.0f
z: MOV R17.z, 0.0f
w: MOV R17.w, 0.0f
t: MOV R16.y, 0.0f
14 x: MOV R15.x, 0.0f
y: MOV R15.y, 0.0f
z: MOV R16.z, 0.0f
w: MOV R16.w, 0.0f
t: MOV R15.z, 0.0f
15 x: MOV R14.x, 0.0f
y: MOV R14.y, 0.0f
z: MOV R14.z, 0.0f
w: MOV R15.w, 0.0f
t: MOV R14.w, 0.0f
16 x: MOV R12.x, 0.0f
y: MOV R12.y, 0.0f
z: MOV R12.z, 0.0f
w: MOV R12.w, 0.0f
t: I_TO_F R0.x, T0.z
17 t: I_TO_F R0.y, T0.w
01 TEX: ADDR(880) CNT(1)
18 SAMPLE R22.xy__, R0.xyxx, t8, s8 UNNORM(XYZW)
02 LOOP_DX10 i0 FAIL_JUMP_ADDR(33)
03 ALU_BREAK: ADDR(151) CNT(1) KCACHE0(CB0:0-15)
19 x: PREDGT ____, KC0[0].x, R22.z UPDATE_EXEC_MASK UPDATE_PRED
04 ALU: ADDR(152) CNT(2)
20 z: ADD R22.z, R22.w, 1.0f
w: ADD R22.w, R22.w, 1.0f
05 TEX: ADDR(882) CNT(8)
21 SAMPLE R1, R22.xwxx, t0, s0 UNNORM(XYZW)
22 SAMPLE R23, R22.xwxx, t1, s1 UNNORM(XYZW)
23 SAMPLE R24, R22.xwxx, t2, s2 UNNORM(XYZW)
24 SAMPLE R25, R22.xwxx, t3, s3 UNNORM(XYZW)
25 SAMPLE R0, R22.yzyy, t4, s4 UNNORM(XYZW)
26 SAMPLE R2, R22.yzyy, t5, s5 UNNORM(XYZW)
27 SAMPLE R26, R22.yzyy, t6, s6 UNNORM(XYZW)
28 SAMPLE R27, R22.yzyy, t7, s7 UNNORM(XYZW)
06 ALU_PUSH_BEFORE: ADDR(154) CNT(81) KCACHE0(CB0:0-15)
29 x: MULADD R29.x, R1.x, R0.x, R29.x
y: MULADD R29.y, R1.x, R0.y, R29.y
z: MULADD R29.z, R1.x, R0.z, R29.z
w: MULADD R29.w, R1.x, R0.w, R29.w
30 x: MULADD R21.x, R1.x, R2.x, R21.x
y: MULADD R21.y, R1.x, R2.y, R21.y
z: MULADD R21.z, R1.x, R2.z, R21.z
w: MULADD R21.w, R1.x, R2.w, R21.w
31 x: MULADD R20.x, R1.y, R0.x, R20.x VEC_210
y: MULADD R20.y, R1.y, R0.y, R20.y VEC_201
z: MULADD R20.z, R1.y, R0.z, R20.z VEC_201
w: MULADD R20.w, R1.y, R0.w, R20.w VEC_201
t: MULADD R18.x, R1.z, R0.x, R18.x VEC_120
32 x: MULADD R19.x, R1.y, R2.x, R19.x VEC_210
y: MULADD R19.y, R1.y, R2.y, R19.y VEC_201
z: MULADD R19.z, R1.y, R2.z, R19.z VEC_201
w: MULADD R19.w, R1.y, R2.w, R19.w VEC_201
t: MULADD R17.x, R1.z, R2.x, R17.x VEC_120
33 x: MULADD R16.x, R1.w, R0.x, R16.x VEC_201
y: MULADD R18.y, R1.z, R0.y, R18.y VEC_210
z: MULADD R18.z, R1.z, R0.z, R18.z VEC_201
w: MULADD R18.w, R1.z, R0.w, R18.w VEC_201
t: MULADD R16.y, R1.w, R0.y, R16.y VEC_120
34 x: MULADD R15.x, R1.w, R2.x, R15.x VEC_201
y: MULADD R17.y, R1.z, R2.y, R17.y VEC_210
z: MULADD R17.z, R1.z, R2.z, R17.z VEC_201
w: MULADD R17.w, R1.z, R2.w, R17.w VEC_201
t: MULADD R15.y, R1.w, R2.y, R15.y VEC_120
35 x: MULADD R14.x, R23.x, R0.x, R14.x VEC_201
y: MULADD R14.y, R23.x, R0.y, R14.y VEC_201
z: MULADD R16.z, R1.w, R0.z, R16.z
w: MULADD R16.w, R1.w, R0.w, R16.w
t: MULADD R14.z, R23.x, R0.z, R14.z
36 x: MULADD R12.x, R23.x, R2.x, R12.x VEC_201
y: MULADD R12.y, R23.x, R2.y, R12.y VEC_201
z: MULADD R15.z, R1.w, R2.z, R15.z
w: MULADD R15.w, R1.w, R2.w, R15.w
t: MULADD R12.z, R23.x, R2.z, R12.z
37 x: MULADD R13.x, R23.y, R0.x, R13.x VEC_201
y: MULADD R13.y, R23.y, R0.y, R13.y VEC_201
z: MULADD R13.z, R23.y, R0.z, R13.z VEC_201
w: MULADD R14.w, R23.x, R0.w, R14.w VEC_210
t: MULADD R13.w, R23.y, R0.w, R13.w VEC_120
38 x: MULADD R11.x, R23.y, R2.x, R11.x VEC_201
y: MULADD R11.y, R23.y, R2.y, R11.y VEC_201
z: MULADD R11.z, R23.y, R2.z, R11.z VEC_201
w: MULADD R12.w, R23.x, R2.w, R12.w VEC_210
t: MULADD R11.w, R23.y, R2.w, R11.w VEC_120
39 x: MULADD R10.x, R23.z, R0.x, R10.x VEC_210
y: MULADD R10.y, R23.z, R0.y, R10.y VEC_201
z: MULADD R10.z, R23.z, R0.z, R10.z VEC_201
w: MULADD R10.w, R23.z, R0.w, R10.w VEC_201
t: MULADD R8.x, R23.w, R0.x, R8.x VEC_120
40 x: MULADD R9.x, R23.z, R2.x, R9.x VEC_210
y: MULADD R9.y, R23.z, R2.y, R9.y VEC_201
z: MULADD R9.z, R23.z, R2.z, R9.z VEC_201
w: MULADD R9.w, R23.z, R2.w, R9.w VEC_201
t: MULADD R7.x, R23.w, R2.x, R7.x VEC_120
41 x: MULADD R6.x, R24.x, R0.x, R6.x VEC_201
y: MULADD R8.y, R23.w, R0.y, R8.y
z: MULADD R8.z, R23.w, R0.z, R8.z
w: MULADD R8.w, R23.w, R0.w, R8.w
t: MULADD R6.y, R24.x, R0.y, R6.y
42 x: MULADD R3.x, R24.x, R2.x, R28.x VEC_201
y: MULADD R7.y, R23.w, R2.y, R7.y
z: MULADD R7.z, R23.w, R2.z, R7.z
w: MULADD R7.w, R23.w, R2.w, R7.w
t: MULADD R3.y, R24.x, R2.y, R28.y
43 x: MULADD R5.x, R24.y, R0.x, R5.x VEC_201
y: MULADD R5.y, R24.y, R0.y, R5.y VEC_201
z: MULADD R6.z, R24.x, R0.z, R6.z VEC_210
w: MULADD R6.w, R24.x, R0.w, R6.w VEC_201
t: MULADD R5.z, R24.y, R0.z, R5.z VEC_120
44 x: MULADD R4.x, R24.y, R2.x, R4.x VEC_201
y: MULADD R4.y, R24.y, R2.y, R4.y VEC_201
z: MULADD R3.z, R24.x, R2.z, R28.z VEC_210
w: MULADD R5.w, R24.y, R0.w, R5.w VEC_201
t: MULADD R4.z, R24.y, R2.z, R4.z VEC_120
45 w: MULADD R3.w, R24.x, R2.w, R28.w
t: MULADD R4.w, R24.y, R2.w, R4.w
46 x: PREDE_INT ____, KC0[1].y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
07 JUMP POP_CNT(1) ADDR(9)
08 ALU_POP_AFTER: ADDR(235) CNT(48)
47 x: MULADD R29.x, R24.z, R26.x, R29.x VEC_210
y: MULADD R29.y, R24.z, R26.y, R29.y VEC_201
z: MULADD R29.z, R24.z, R26.z, R29.z VEC_201
w: MULADD R29.w, R24.z, R26.w, R29.w VEC_201
t: MULADD R20.x, R24.w, R26.x, R20.x VEC_120
48 x: MULADD R21.x, R24.z, R27.x, R21.x VEC_210
y: MULADD R21.y, R24.z, R27.y, R21.y VEC_201
z: MULADD R21.z, R24.z, R27.z, R21.z VEC_201
w: MULADD R21.w, R24.z, R27.w, R21.w VEC_201
t: MULADD R19.x, R24.w, R27.x, R19.x VEC_120
49 x: MULADD R18.x, R25.x, R26.x, R18.x VEC_201
y: MULADD R20.y, R24.w, R26.y, R20.y
z: MULADD R20.z, R24.w, R26.z, R20.z
w: MULADD R20.w, R24.w, R26.w, R20.w
t: MULADD R18.y, R25.x, R26.y, R18.y
50 x: MULADD R17.x, R25.x, R27.x, R17.x VEC_201
y: MULADD R19.y, R24.w, R27.y, R19.y
z: MULADD R19.z, R24.w, R27.z, R19.z
w: MULADD R19.w, R24.w, R27.w, R19.w
t: MULADD R17.y, R25.x, R27.y, R17.y
51 x: MULADD R16.x, R25.y, R26.x, R16.x VEC_201
y: MULADD R16.y, R25.y, R26.y, R16.y VEC_201
z: MULADD R18.z, R25.x, R26.z, R18.z VEC_210
w: MULADD R18.w, R25.x, R26.w, R18.w VEC_201
t: MULADD R16.z, R25.y, R26.z, R16.z VEC_120
52 x: MULADD R15.x, R25.y, R27.x, R15.x VEC_201
y: MULADD R15.y, R25.y, R27.y, R15.y VEC_201
z: MULADD R17.z, R25.x, R27.z, R17.z VEC_210
w: MULADD R17.w, R25.x, R27.w, R17.w VEC_201
t: MULADD R15.z, R25.y, R27.z, R15.z VEC_120
53 x: MULADD R14.x, R25.z, R26.x, R14.x VEC_201
y: MULADD R14.y, R25.z, R26.y, R14.y VEC_201
z: MULADD R14.z, R25.z, R26.z, R14.z VEC_201
w: MULADD R16.w, R25.y, R26.w, R16.w VEC_210
t: MULADD R14.w, R25.z, R26.w, R14.w VEC_120
54 x: MULADD R12.x, R25.z, R27.x, R12.x VEC_201
y: MULADD R12.y, R25.z, R27.y, R12.y VEC_201
z: MULADD R12.z, R25.z, R27.z, R12.z VEC_201
w: MULADD R15.w, R25.y, R27.w, R15.w VEC_210
t: MULADD R12.w, R25.z, R27.w, R12.w VEC_120
55 x: MULADD R13.x, R25.w, R26.x, R13.x
y: MULADD R13.y, R25.w, R26.y, R13.y
z: MULADD R13.z, R25.w, R26.z, R13.z
w: MULADD R13.w, R25.w, R26.w, R13.w
56 x: MULADD R11.x, R25.w, R27.x, R11.x
y: MULADD R11.y, R25.w, R27.y, R11.y
z: MULADD R11.z, R25.w, R27.z, R11.z
w: MULADD R11.w, R25.w, R27.w, R11.w
09 ALU_PUSH_BEFORE: ADDR(283) CNT(3) KCACHE0(CB0:0-15)
57 z: ADD R22.z, R22.z, 1.0f
w: ADD R22.w, R22.z, 1.0f
58 x: PREDE_INT ____, KC0[1].w, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
10 JUMP POP_CNT(1) ADDR(16)
11 TEX: ADDR(898) CNT(8)
59 SAMPLE R1, R22.xwxx, t0, s0 UNNORM(XYZW)
60 SAMPLE R23, R22.xwxx, t1, s1 UNNORM(XYZW)
61 SAMPLE R24, R22.xwxx, t2, s2 UNNORM(XYZW)
62 SAMPLE R25, R22.xwxx, t3, s3 UNNORM(XYZW)
63 SAMPLE R0, R22.yzyy, t4, s4 UNNORM(XYZW)
64 SAMPLE R2, R22.yzyy, t5, s5 UNNORM(XYZW)
65 SAMPLE R28, R22.yzyy, t6, s6 UNNORM(XYZW)
66 SAMPLE R30, R22.yzyy, t7, s7 UNNORM(XYZW)
12 ALU_PUSH_BEFORE: ADDR(286) CNT(33) KCACHE0(CB0:0-15)
67 x: MULADD R10.x, R1.x, R26.x, R10.x
y: MULADD R10.y, R1.x, R26.y, R10.y
z: MULADD R10.z, R1.x, R26.z, R10.z
w: MULADD R10.w, R1.x, R26.w, R10.w
68 x: MULADD R9.x, R1.x, R27.x, R9.x
y: MULADD R9.y, R1.x, R27.y, R9.y
z: MULADD R9.z, R1.x, R27.z, R9.z
w: MULADD R9.w, R1.x, R27.w, R9.w
69 x: MULADD R8.x, R1.y, R26.x, R8.x VEC_210
y: MULADD R8.y, R1.y, R26.y, R8.y VEC_201
z: MULADD R8.z, R1.y, R26.z, R8.z VEC_201
w: MULADD R8.w, R1.y, R26.w, R8.w VEC_201
t: MULADD R6.x, R1.z, R26.x, R6.x VEC_120
70 x: MULADD R7.x, R1.y, R27.x, R7.x VEC_210
y: MULADD R7.y, R1.y, R27.y, R7.y VEC_201
z: MULADD R7.z, R1.y, R27.z, R7.z VEC_201
w: MULADD R7.w, R1.y, R27.w, R7.w VEC_201
t: MULADD R3.x, R1.z, R27.x, R3.x VEC_120
71 x: MULADD R5.x, R1.w, R26.x, R5.x VEC_201
y: MULADD R6.y, R1.z, R26.y, R6.y VEC_210
z: MULADD R6.z, R1.z, R26.z, R6.z VEC_201
w: MULADD R6.w, R1.z, R26.w, R6.w VEC_201
t: MULADD R5.y, R1.w, R26.y, R5.y VEC_120
72 x: MULADD R4.x, R1.w, R27.x, R4.x VEC_201
y: MULADD R3.y, R1.z, R27.y, R3.y VEC_210
z: MULADD R5.z, R1.w, R26.z, R5.z VEC_201
w: MULADD R5.w, R1.w, R26.w, R5.w VEC_201
t: MULADD R4.y, R1.w, R27.y, R4.y VEC_120
73 z: MULADD R3.z, R1.z, R27.z, R3.z
w: MULADD R3.w, R1.z, R27.w, R3.w
74 z: MULADD R4.z, R1.w, R27.z, R4.z
w: MULADD R4.w, R1.w, R27.w, R4.w
75 x: PREDE_INT ____, KC0[1].y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
13 JUMP POP_CNT(1) ADDR(15)
14 ALU_POP_AFTER: ADDR(319) CNT(80)
76 x: MULADD R29.x, R23.x, R0.x, R29.x
y: MULADD R29.y, R23.x, R0.y, R29.y
z: MULADD R29.z, R23.x, R0.z, R29.z
w: MULADD R29.w, R23.x, R0.w, R29.w
77 x: MULADD R21.x, R23.x, R2.x, R21.x
y: MULADD R21.y, R23.x, R2.y, R21.y
z: MULADD R21.z, R23.x, R2.z, R21.z
w: MULADD R21.w, R23.x, R2.w, R21.w
78 x: MULADD R20.x, R23.y, R0.x, R20.x VEC_210
y: MULADD R20.y, R23.y, R0.y, R20.y VEC_201
z: MULADD R20.z, R23.y, R0.z, R20.z VEC_201
w: MULADD R20.w, R23.y, R0.w, R20.w VEC_201
t: MULADD R18.x, R23.z, R0.x, R18.x VEC_120
79 x: MULADD R19.x, R23.y, R2.x, R19.x VEC_210
y: MULADD R19.y, R23.y, R2.y, R19.y VEC_201
z: MULADD R19.z, R23.y, R2.z, R19.z VEC_201
w: MULADD R19.w, R23.y, R2.w, R19.w VEC_201
t: MULADD R17.x, R23.z, R2.x, R17.x VEC_120
80 x: MULADD R16.x, R23.w, R0.x, R16.x VEC_201
y: MULADD R18.y, R23.z, R0.y, R18.y VEC_210
z: MULADD R18.z, R23.z, R0.z, R18.z VEC_201
w: MULADD R18.w, R23.z, R0.w, R18.w VEC_201
t: MULADD R16.y, R23.w, R0.y, R16.y VEC_120
81 x: MULADD R15.x, R23.w, R2.x, R15.x VEC_201
y: MULADD R17.y, R23.z, R2.y, R17.y VEC_210
z: MULADD R17.z, R23.z, R2.z, R17.z VEC_201
w: MULADD R17.w, R23.z, R2.w, R17.w VEC_201
t: MULADD R15.y, R23.w, R2.y, R15.y VEC_120
82 x: MULADD R14.x, R24.x, R0.x, R14.x VEC_201
y: MULADD R14.y, R24.x, R0.y, R14.y VEC_201
z: MULADD R16.z, R23.w, R0.z, R16.z
w: MULADD R16.w, R23.w, R0.w, R16.w
t: MULADD R14.z, R24.x, R0.z, R14.z
83 x: MULADD R12.x, R24.x, R2.x, R12.x VEC_201
y: MULADD R12.y, R24.x, R2.y, R12.y VEC_201
z: MULADD R15.z, R23.w, R2.z, R15.z
w: MULADD R15.w, R23.w, R2.w, R15.w
t: MULADD R12.z, R24.x, R2.z, R12.z
84 x: MULADD R13.x, R24.y, R0.x, R13.x VEC_201
y: MULADD R13.y, R24.y, R0.y, R13.y VEC_201
z: MULADD R13.z, R24.y, R0.z, R13.z VEC_201
w: MULADD R14.w, R24.x, R0.w, R14.w VEC_210
t: MULADD R13.w, R24.y, R0.w, R13.w VEC_120
85 x: MULADD R11.x, R24.y, R2.x, R11.x VEC_201
y: MULADD R11.y, R24.y, R2.y, R11.y VEC_201
z: MULADD R11.z, R24.y, R2.z, R11.z VEC_201
w: MULADD R12.w, R24.x, R2.w, R12.w VEC_210
t: MULADD R11.w, R24.y, R2.w, R11.w VEC_120
86 x: MULADD R10.x, R24.z, R0.x, R10.x VEC_210
y: MULADD R10.y, R24.z, R0.y, R10.y VEC_201
z: MULADD R10.z, R24.z, R0.z, R10.z VEC_201
w: MULADD R10.w, R24.z, R0.w, R10.w VEC_201
t: MULADD R8.x, R24.w, R0.x, R8.x VEC_120
87 x: MULADD R9.x, R24.z, R2.x, R9.x VEC_210
y: MULADD R9.y, R24.z, R2.y, R9.y VEC_201
z: MULADD R9.z, R24.z, R2.z, R9.z VEC_201
w: MULADD R9.w, R24.z, R2.w, R9.w VEC_201
t: MULADD R7.x, R24.w, R2.x, R7.x VEC_120
88 x: MULADD R6.x, R25.x, R0.x, R6.x VEC_201
y: MULADD R8.y, R24.w, R0.y, R8.y
z: MULADD R8.z, R24.w, R0.z, R8.z
w: MULADD R8.w, R24.w, R0.w, R8.w
t: MULADD R6.y, R25.x, R0.y, R6.y
89 x: MULADD R3.x, R25.x, R2.x, R3.x VEC_201
y: MULADD R7.y, R24.w, R2.y, R7.y
z: MULADD R7.z, R24.w, R2.z, R7.z
w: MULADD R7.w, R24.w, R2.w, R7.w
t: MULADD R3.y, R25.x, R2.y, R3.y
90 x: MULADD R5.x, R25.y, R0.x, R5.x VEC_201
y: MULADD R5.y, R25.y, R0.y, R5.y VEC_201
z: MULADD R6.z, R25.x, R0.z, R6.z VEC_210
w: MULADD R6.w, R25.x, R0.w, R6.w VEC_201
t: MULADD R5.z, R25.y, R0.z, R5.z VEC_120
91 x: MULADD R4.x, R25.y, R2.x, R4.x VEC_201
y: MULADD R4.y, R25.y, R2.y, R4.y VEC_201
z: MULADD R3.z, R25.x, R2.z, R3.z VEC_210
w: MULADD R5.w, R25.y, R0.w, R5.w VEC_201
t: MULADD R4.z, R25.y, R2.z, R4.z VEC_120
92 w: MULADD R3.w, R25.x, R2.w, R3.w
t: MULADD R4.w, R25.y, R2.w, R4.w
15 ALU_POP_AFTER: ADDR(399) CNT(25)
93 x: MULADD R29.x, R25.z, R28.x, R29.x VEC_210
y: MULADD R29.y, R25.z, R28.y, R29.y VEC_201
z: MULADD R29.z, R25.z, R28.z, R29.z VEC_201
w: MULADD R29.w, R25.z, R28.w, R29.w VEC_201
t: MULADD R20.x, R25.w, R28.x, R20.x VEC_120
94 x: MULADD R21.x, R25.z, R30.x, R21.x VEC_210
y: MULADD R20.y, R25.w, R28.y, R20.y VEC_201
z: MULADD R20.z, R25.w, R28.z, R20.z VEC_201
w: MULADD R20.w, R25.w, R28.w, R20.w VEC_201
t: MULADD R19.x, R25.w, R30.x, R19.x VEC_120
95 y: MULADD R21.y, R25.z, R30.y, R21.y VEC_210
z: MULADD R21.z, R25.z, R30.z, R21.z VEC_201
w: MULADD R21.w, R25.z, R30.w, R21.w VEC_201
t: MULADD R19.y, R25.w, R30.y, R19.y VEC_120
96 z: MULADD R19.z, R25.w, R30.z, R19.z VEC_201
w: MULADD R19.w, R25.w, R30.w, R19.w VEC_201
t: ADD R22.w, R22.z, 1.0f
97 x: MOV R26.x, R28.x
y: MOV R26.y, R28.y
z: MOV R26.z, R28.z
w: MOV R26.w, R28.w
98 x: MOV R27.x, R30.x
y: MOV R27.y, R30.y
z: MOV R27.z, R30.z
w: MOV R27.w, R30.w
16 ALU_PUSH_BEFORE: ADDR(424) CNT(1) KCACHE0(CB0:0-15)
99 x: PREDE_INT ____, KC0[1].w, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
17 JUMP POP_CNT(1) ADDR(20)
18 TEX: ADDR(914) CNT(4)
100 SAMPLE R1, R22.xwxx, t0, s0 UNNORM(XYZW)
101 SAMPLE R23, R22.xwxx, t1, s1 UNNORM(XYZW)
102 SAMPLE R24, R22.xwxx, t2, s2 UNNORM(XYZW)
103 SAMPLE R25, R22.xwxx, t3, s3 UNNORM(XYZW)
19 ALU_POP_AFTER: ADDR(425) CNT(66)
104 x: MULADD R18.x, R1.x, R26.x, R18.x VEC_201
y: MULADD R18.y, R1.x, R26.y, R18.y VEC_201
z: MULADD R18.z, R1.x, R26.z, R18.z VEC_201
w: MULADD R18.w, R1.x, R26.w, R18.w VEC_201
t: ADD R22.z, R22.w, 1.0f
105 x: MULADD R17.x, R1.x, R27.x, R17.x VEC_201
y: MULADD R17.y, R1.x, R27.y, R17.y VEC_201
z: MULADD R17.z, R1.x, R27.z, R17.z VEC_201
w: MULADD R17.w, R1.x, R27.w, R17.w VEC_201
t: ADD R22.w, R22.w, 1.0f
106 x: MULADD R16.x, R1.y, R26.x, R16.x VEC_210
y: MULADD R16.y, R1.y, R26.y, R16.y VEC_201
z: MULADD R16.z, R1.y, R26.z, R16.z VEC_201
w: MULADD R16.w, R1.y, R26.w, R16.w VEC_201
t: MULADD R14.x, R1.z, R26.x, R14.x VEC_120
107 x: MULADD R15.x, R1.y, R27.x, R15.x VEC_210
y: MULADD R15.y, R1.y, R27.y, R15.y VEC_201
z: MULADD R15.z, R1.y, R27.z, R15.z VEC_201
w: MULADD R15.w, R1.y, R27.w, R15.w VEC_201
t: MULADD R12.x, R1.z, R27.x, R12.x VEC_120
108 x: MULADD R13.x, R1.w, R26.x, R13.x VEC_201
y: MULADD R14.y, R1.z, R26.y, R14.y VEC_210
z: MULADD R14.z, R1.z, R26.z, R14.z VEC_201
w: MULADD R14.w, R1.z, R26.w, R14.w VEC_201
t: MULADD R13.y, R1.w, R26.y, R13.y VEC_120
109 x: MULADD R11.x, R1.w, R27.x, R11.x VEC_201
y: MULADD R12.y, R1.z, R27.y, R12.y VEC_210
z: MULADD R12.z, R1.z, R27.z, R12.z VEC_201
w: MULADD R12.w, R1.z, R27.w, R12.w VEC_201
t: MULADD R11.y, R1.w, R27.y, R11.y VEC_120
110 x: MULADD R10.x, R23.x, R26.x, R10.x VEC_201
y: MULADD R10.y, R23.x, R26.y, R10.y VEC_201
z: MULADD R13.z, R1.w, R26.z, R13.z
w: MULADD R13.w, R1.w, R26.w, R13.w
t: MULADD R10.z, R23.x, R26.z, R10.z
111 x: MULADD R9.x, R23.x, R27.x, R9.x VEC_201
y: MULADD R9.y, R23.x, R27.y, R9.y VEC_201
z: MULADD R11.z, R1.w, R27.z, R11.z
w: MULADD R11.w, R1.w, R27.w, R11.w
t: MULADD R9.z, R23.x, R27.z, R9.z
112 x: MULADD R8.x, R23.y, R26.x, R8.x VEC_201
y: MULADD R8.y, R23.y, R26.y, R8.y VEC_201
z: MULADD R8.z, R23.y, R26.z, R8.z VEC_201
w: MULADD R10.w, R23.x, R26.w, R10.w VEC_210
t: MULADD R8.w, R23.y, R26.w, R8.w VEC_120
113 x: MULADD R7.x, R23.y, R27.x, R7.x VEC_201
y: MULADD R7.y, R23.y, R27.y, R7.y VEC_201
z: MULADD R7.z, R23.y, R27.z, R7.z VEC_201
w: MULADD R9.w, R23.x, R27.w, R9.w VEC_210
t: MULADD R7.w, R23.y, R27.w, R7.w VEC_120
114 x: MULADD R6.x, R23.z, R26.x, R6.x VEC_210
y: MULADD R6.y, R23.z, R26.y, R6.y VEC_201
z: MULADD R6.z, R23.z, R26.z, R6.z VEC_201
w: MULADD R6.w, R23.z, R26.w, R6.w VEC_201
t: MULADD R5.x, R23.w, R26.x, R5.x VEC_120
115 x: MULADD R3.x, R23.z, R27.x, R3.x VEC_210
y: MULADD R5.y, R23.w, R26.y, R5.y VEC_201
z: MULADD R5.z, R23.w, R26.z, R5.z VEC_201
w: MULADD R5.w, R23.w, R26.w, R5.w VEC_201
t: MULADD R4.x, R23.w, R27.x, R4.x VEC_120
116 y: MULADD R3.y, R23.z, R27.y, R3.y VEC_210
z: MULADD R3.z, R23.z, R27.z, R3.z VEC_201
w: MULADD R3.w, R23.z, R27.w, R3.w VEC_201
t: MULADD R4.y, R23.w, R27.y, R4.y VEC_120
117 z: MULADD R4.z, R23.w, R27.z, R4.z
w: MULADD R4.w, R23.w, R27.w, R4.w
20 ALU_PUSH_BEFORE: ADDR(491) CNT(1) KCACHE0(CB0:0-15)
118 x: PREDE_INT ____, KC0[1].w, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
21 JUMP POP_CNT(1) ADDR(24)
22 TEX: ADDR(922) CNT(8)
119 SAMPLE R1, R22.xwxx, t0, s0 UNNORM(XYZW)
120 SAMPLE R23, R22.xwxx, t1, s1 UNNORM(XYZW)
121 SAMPLE R28, R22.xwxx, t2, s2 UNNORM(XYZW)
122 SAMPLE R30, R22.xwxx, t3, s3 UNNORM(XYZW)
123 SAMPLE R0, R22.yzyy, t4, s4 UNNORM(XYZW)
124 SAMPLE R2, R22.yzyy, t5, s5 UNNORM(XYZW)
125 SAMPLE R26, R22.yzyy, t6, s6 UNNORM(XYZW)
126 SAMPLE R27, R22.yzyy, t7, s7 UNNORM(XYZW)
23 ALU_POP_AFTER: ADDR(492) CNT(88)
127 x: MULADD R29.x, R24.x, R0.x, R29.x
y: MULADD R29.y, R24.x, R0.y, R29.y
z: MULADD R29.z, R24.x, R0.z, R29.z
w: MULADD R29.w, R24.x, R0.w, R29.w
128 x: MULADD R21.x, R24.x, R2.x, R21.x
y: MULADD R21.y, R24.x, R2.y, R21.y
z: MULADD R21.z, R24.x, R2.z, R21.z
w: MULADD R21.w, R24.x, R2.w, R21.w
129 x: MULADD R20.x, R24.y, R0.x, R20.x VEC_210
y: MULADD R20.y, R24.y, R0.y, R20.y VEC_201
z: MULADD R20.z, R24.y, R0.z, R20.z VEC_201
w: MULADD R20.w, R24.y, R0.w, R20.w VEC_201
t: MULADD R18.x, R24.z, R0.x, R18.x VEC_120
130 x: MULADD R19.x, R24.y, R2.x, R19.x VEC_210
y: MULADD R19.y, R24.y, R2.y, R19.y VEC_201
z: MULADD R19.z, R24.y, R2.z, R19.z VEC_201
w: MULADD R19.w, R24.y, R2.w, R19.w VEC_201
t: MULADD R17.x, R24.z, R2.x, R17.x VEC_120
131 x: MULADD R16.x, R24.w, R0.x, R16.x VEC_201
y: MULADD R18.y, R24.z, R0.y, R18.y VEC_210
z: MULADD R18.z, R24.z, R0.z, R18.z VEC_201
w: MULADD R18.w, R24.z, R0.w, R18.w VEC_201
t: MULADD R16.y, R24.w, R0.y, R16.y VEC_120
132 x: MULADD R15.x, R24.w, R2.x, R15.x VEC_201
y: MULADD R17.y, R24.z, R2.y, R17.y VEC_210
z: MULADD R17.z, R24.z, R2.z, R17.z VEC_201
w: MULADD R17.w, R24.z, R2.w, R17.w VEC_201
t: MULADD R15.y, R24.w, R2.y, R15.y VEC_120
133 x: MULADD R14.x, R25.x, R0.x, R14.x VEC_201
y: MULADD R14.y, R25.x, R0.y, R14.y VEC_201
z: MULADD R16.z, R24.w, R0.z, R16.z
w: MULADD R16.w, R24.w, R0.w, R16.w
t: MULADD R14.z, R25.x, R0.z, R14.z
134 x: MULADD R12.x, R25.x, R2.x, R12.x VEC_201
y: MULADD R12.y, R25.x, R2.y, R12.y VEC_201
z: MULADD R15.z, R24.w, R2.z, R15.z
w: MULADD R15.w, R24.w, R2.w, R15.w
t: MULADD R12.z, R25.x, R2.z, R12.z
135 x: MULADD R13.x, R25.y, R0.x, R13.x VEC_201
y: MULADD R13.y, R25.y, R0.y, R13.y VEC_201
z: MULADD R13.z, R25.y, R0.z, R13.z VEC_201
w: MULADD R14.w, R25.x, R0.w, R14.w VEC_210
t: MULADD R13.w, R25.y, R0.w, R13.w VEC_120
136 x: MULADD R11.x, R25.y, R2.x, R11.x VEC_201
y: MULADD R11.y, R25.y, R2.y, R11.y VEC_201
z: MULADD R11.z, R25.y, R2.z, R11.z VEC_201
w: MULADD R12.w, R25.x, R2.w, R12.w VEC_210
t: MULADD R11.w, R25.y, R2.w, R11.w VEC_120
137 x: MULADD R10.x, R25.z, R0.x, R10.x VEC_210
y: MULADD R10.y, R25.z, R0.y, R10.y VEC_201
z: MULADD R10.z, R25.z, R0.z, R10.z VEC_201
w: MULADD R10.w, R25.z, R0.w, R10.w VEC_201
t: MULADD R8.x, R25.w, R0.x, R8.x VEC_120
138 x: MULADD R9.x, R25.z, R2.x, R9.x VEC_210
y: MULADD R9.y, R25.z, R2.y, R9.y VEC_201
z: MULADD R9.z, R25.z, R2.z, R9.z VEC_201
w: MULADD R9.w, R25.z, R2.w, R9.w VEC_201
t: MULADD R7.x, R25.w, R2.x, R7.x VEC_120
139 x: MULADD R6.x, R1.x, R0.x, R6.x VEC_201
y: MULADD R8.y, R25.w, R0.y, R8.y
z: MULADD R8.z, R25.w, R0.z, R8.z
w: MULADD R8.w, R25.w, R0.w, R8.w
t: MULADD R6.y, R1.x, R0.y, R6.y
140 x: MULADD R3.x, R1.x, R2.x, R3.x VEC_201
y: MULADD R7.y, R25.w, R2.y, R7.y
z: MULADD R7.z, R25.w, R2.z, R7.z
w: MULADD R7.w, R25.w, R2.w, R7.w
t: MULADD R3.y, R1.x, R2.y, R3.y
141 x: MULADD R5.x, R1.y, R0.x, R5.x VEC_201
y: MULADD R5.y, R1.y, R0.y, R5.y VEC_201
z: MULADD R6.z, R1.x, R0.z, R6.z VEC_210
w: MULADD R6.w, R1.x, R0.w, R6.w VEC_201
t: MULADD R5.z, R1.y, R0.z, R5.z VEC_120
142 x: MULADD R4.x, R1.y, R2.x, R4.x VEC_201
y: MULADD R4.y, R1.y, R2.y, R4.y VEC_201
z: MULADD R3.z, R1.x, R2.z, R3.z VEC_210
w: MULADD R5.w, R1.y, R0.w, R5.w VEC_201
t: MULADD R4.z, R1.y, R2.z, R4.z VEC_120
143 w: MULADD R3.w, R1.x, R2.w, R3.w
t: MULADD R4.w, R1.y, R2.w, R4.w
144 x: MOV R24.x, R28.x
y: MOV R24.y, R28.y
z: MOV R24.z, R28.z
w: MOV R24.w, R28.w
145 x: MOV R25.x, R30.x
y: MOV R25.y, R30.y
z: MOV R25.z, R30.z
w: MOV R25.w, R30.w
24 ALU_PUSH_BEFORE: ADDR(580) CNT(1) KCACHE0(CB0:0-15)
146 x: PREDE_INT ____, KC0[1].y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
25 JUMP POP_CNT(1) ADDR(27)
26 ALU_POP_AFTER: ADDR(581) CNT(66)
147 x: MULADD R29.x, R1.z, R2.x, R29.x VEC_210
y: MULADD R29.y, R1.z, R2.y, R29.y VEC_201
z: MULADD R29.z, R1.z, R2.z, R29.z VEC_201
w: MULADD R29.w, R1.z, R2.w, R29.w VEC_201
t: MULADD R20.x, R1.w, R2.x, R20.x VEC_120
148 x: MULADD R21.x, R1.z, R26.x, R21.x VEC_210
y: MULADD R21.y, R1.z, R26.y, R21.y VEC_201
z: MULADD R21.z, R1.z, R26.z, R21.z VEC_201
w: MULADD R21.w, R1.z, R26.w, R21.w VEC_201
t: MULADD R19.x, R1.w, R26.x, R19.x VEC_120
149 x: MULADD R18.x, R23.x, R2.x, R18.x VEC_201
y: MULADD R20.y, R1.w, R2.y, R20.y
z: MULADD R20.z, R1.w, R2.z, R20.z
w: MULADD R20.w, R1.w, R2.w, R20.w
t: MULADD R18.y, R23.x, R2.y, R18.y
150 x: MULADD R17.x, R23.x, R26.x, R17.x VEC_201
y: MULADD R19.y, R1.w, R26.y, R19.y
z: MULADD R19.z, R1.w, R26.z, R19.z
w: MULADD R19.w, R1.w, R26.w, R19.w
t: MULADD R17.y, R23.x, R26.y, R17.y
151 x: MULADD R16.x, R23.y, R2.x, R16.x VEC_201
y: MULADD R16.y, R23.y, R2.y, R16.y VEC_201
z: MULADD R18.z, R23.x, R2.z, R18.z VEC_210
w: MULADD R18.w, R23.x, R2.w, R18.w VEC_201
t: MULADD R16.z, R23.y, R2.z, R16.z VEC_120
152 x: MULADD R15.x, R23.y, R26.x, R15.x VEC_201
y: MULADD R15.y, R23.y, R26.y, R15.y VEC_201
z: MULADD R17.z, R23.x, R26.z, R17.z VEC_210
w: MULADD R17.w, R23.x, R26.w, R17.w VEC_201
t: MULADD R15.z, R23.y, R26.z, R15.z VEC_120
153 x: MULADD R14.x, R23.z, R2.x, R14.x VEC_201
y: MULADD R14.y, R23.z, R2.y, R14.y VEC_201
z: MULADD R14.z, R23.z, R2.z, R14.z VEC_201
w: MULADD R16.w, R23.y, R2.w, R16.w VEC_210
t: MULADD R14.w, R23.z, R2.w, R14.w VEC_120
154 x: MULADD R12.x, R23.z, R26.x, R12.x VEC_201
y: MULADD R12.y, R23.z, R26.y, R12.y VEC_201
z: MULADD R12.z, R23.z, R26.z, R12.z VEC_201
w: MULADD R15.w, R23.y, R26.w, R15.w VEC_210
t: MULADD R12.w, R23.z, R26.w, R12.w VEC_120
155 x: MULADD R13.x, R23.w, R2.x, R13.x VEC_210
y: MULADD R13.y, R23.w, R2.y, R13.y VEC_201
z: MULADD R13.z, R23.w, R2.z, R13.z VEC_201
w: MULADD R13.w, R23.w, R2.w, R13.w VEC_201
t: MULADD R6.x, R24.z, R2.x, R6.x VEC_120
156 x: MULADD R11.x, R23.w, R26.x, R11.x VEC_210
y: MULADD R11.y, R23.w, R26.y, R11.y VEC_201
z: MULADD R11.z, R23.w, R26.z, R11.z VEC_201
w: MULADD R11.w, R23.w, R26.w, R11.w VEC_201
t: MULADD R3.x, R24.z, R26.x, R3.x VEC_120
157 x: MULADD R5.x, R24.w, R2.x, R5.x VEC_201
y: MULADD R6.y, R24.z, R2.y, R6.y VEC_210
z: MULADD R6.z, R24.z, R2.z, R6.z VEC_201
w: MULADD R6.w, R24.z, R2.w, R6.w VEC_201
t: MULADD R5.y, R24.w, R2.y, R5.y VEC_120
158 x: MULADD R4.x, R24.w, R26.x, R4.x VEC_201
y: MULADD R3.y, R24.z, R26.y, R3.y VEC_210
z: MULADD R5.z, R24.w, R2.z, R5.z VEC_201
w: MULADD R5.w, R24.w, R2.w, R5.w VEC_201
t: MULADD R4.y, R24.w, R26.y, R4.y VEC_120
159 z: MULADD R3.z, R24.z, R26.z, R3.z VEC_201
w: MULADD R3.w, R24.z, R26.w, R3.w VEC_201
t: ADD R22.z, R22.w, 1.0f
160 z: MULADD R4.z, R24.w, R26.z, R4.z
w: MULADD R4.w, R24.w, R26.w, R4.w
161 w: ADD R22.w, R22.w, 1.0f
27 ALU_PUSH_BEFORE: ADDR(647) CNT(1) KCACHE0(CB0:0-15)
162 x: PREDE_INT ____, KC0[1].w, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
28 JUMP POP_CNT(1) ADDR(31)
29 TEX: ADDR(938) CNT(8)
163 SAMPLE R0, R22.xwxx, t0, s0 UNNORM(XYZW)
164 SAMPLE R23, R22.xwxx, t1, s1 UNNORM(XYZW)
165 SAMPLE R24, R22.xwxx, t2, s2 UNNORM(XYZW)
166 SAMPLE R28, R22.xwxx, t3, s3 UNNORM(XYZW)
167 SAMPLE R1, R22.yzyy, t4, s4 UNNORM(XYZW)
168 SAMPLE R2, R22.yzyy, t5, s5 UNNORM(XYZW)
169 SAMPLE R26, R22.yzyy, t6, s6 UNNORM(XYZW)
170 SAMPLE R27, R22.yzyy, t7, s7 UNNORM(XYZW)
30 ALU_POP_AFTER: ADDR(648) CNT(84)
171 x: MULADD R29.x, R25.x, R1.x, R29.x
y: MULADD R29.y, R25.x, R1.y, R29.y
z: MULADD R29.z, R25.x, R1.z, R29.z
w: MULADD R29.w, R25.x, R1.w, R29.w
172 x: MULADD R21.x, R25.x, R2.x, R21.x
y: MULADD R21.y, R25.x, R2.y, R21.y
z: MULADD R21.z, R25.x, R2.z, R21.z
w: MULADD R21.w, R25.x, R2.w, R21.w
173 x: MULADD R20.x, R25.y, R1.x, R20.x VEC_210
y: MULADD R20.y, R25.y, R1.y, R20.y VEC_201
z: MULADD R20.z, R25.y, R1.z, R20.z VEC_201
w: MULADD R20.w, R25.y, R1.w, R20.w VEC_201
t: MULADD R18.x, R25.z, R1.x, R18.x VEC_120
174 x: MULADD R19.x, R25.y, R2.x, R19.x VEC_210
y: MULADD R19.y, R25.y, R2.y, R19.y VEC_201
z: MULADD R19.z, R25.y, R2.z, R19.z VEC_201
w: MULADD R19.w, R25.y, R2.w, R19.w VEC_201
t: MULADD R17.x, R25.z, R2.x, R17.x VEC_120
175 x: MULADD R16.x, R25.w, R1.x, R16.x VEC_201
y: MULADD R18.y, R25.z, R1.y, R18.y VEC_210
z: MULADD R18.z, R25.z, R1.z, R18.z VEC_201
w: MULADD R18.w, R25.z, R1.w, R18.w VEC_201
t: MULADD R16.y, R25.w, R1.y, R16.y VEC_120
176 x: MULADD R15.x, R25.w, R2.x, R15.x VEC_201
y: MULADD R17.y, R25.z, R2.y, R17.y VEC_210
z: MULADD R17.z, R25.z, R2.z, R17.z VEC_201
w: MULADD R17.w, R25.z, R2.w, R17.w VEC_201
t: MULADD R15.y, R25.w, R2.y, R15.y VEC_120
177 x: MULADD R14.x, R0.x, R1.x, R14.x VEC_201
y: MULADD R14.y, R0.x, R1.y, R14.y VEC_201
z: MULADD R16.z, R25.w, R1.z, R16.z
w: MULADD R16.w, R25.w, R1.w, R16.w
t: MULADD R14.z, R0.x, R1.z, R14.z
178 x: MULADD R12.x, R0.x, R2.x, R12.x VEC_201
y: MULADD R12.y, R0.x, R2.y, R12.y VEC_201
z: MULADD R15.z, R25.w, R2.z, R15.z
w: MULADD R15.w, R25.w, R2.w, R15.w
t: MULADD R12.z, R0.x, R2.z, R12.z
179 x: MULADD R13.x, R0.y, R1.x, R13.x VEC_201
y: MULADD R13.y, R0.y, R1.y, R13.y VEC_201
z: MULADD R13.z, R0.y, R1.z, R13.z VEC_201
w: MULADD R14.w, R0.x, R1.w, R14.w VEC_210
t: MULADD R13.w, R0.y, R1.w, R13.w VEC_120
180 x: MULADD R11.x, R0.y, R2.x, R11.x VEC_201
y: MULADD R11.y, R0.y, R2.y, R11.y VEC_201
z: MULADD R11.z, R0.y, R2.z, R11.z VEC_201
w: MULADD R12.w, R0.x, R2.w, R12.w VEC_210
t: MULADD R11.w, R0.y, R2.w, R11.w VEC_120
181 x: MULADD R10.x, R0.z, R1.x, R10.x VEC_210
y: MULADD R10.y, R0.z, R1.y, R10.y VEC_201
z: MULADD R10.z, R0.z, R1.z, R10.z VEC_201
w: MULADD R10.w, R0.z, R1.w, R10.w VEC_201
t: MULADD R8.x, R0.w, R1.x, R8.x VEC_120
182 x: MULADD R9.x, R0.z, R2.x, R9.x VEC_210
y: MULADD R9.y, R0.z, R2.y, R9.y VEC_201
z: MULADD R9.z, R0.z, R2.z, R9.z VEC_201
w: MULADD R9.w, R0.z, R2.w, R9.w VEC_201
t: MULADD R7.x, R0.w, R2.x, R7.x VEC_120
183 x: MULADD R6.x, R23.x, R1.x, R6.x VEC_201
y: MULADD R8.y, R0.w, R1.y, R8.y
z: MULADD R8.z, R0.w, R1.z, R8.z
w: MULADD R8.w, R0.w, R1.w, R8.w
t: MULADD R6.y, R23.x, R1.y, R6.y
184 x: MULADD R3.x, R23.x, R2.x, R3.x VEC_201
y: MULADD R7.y, R0.w, R2.y, R7.y
z: MULADD R7.z, R0.w, R2.z, R7.z
w: MULADD R7.w, R0.w, R2.w, R7.w
t: MULADD R3.y, R23.x, R2.y, R3.y
185 x: MULADD R5.x, R23.y, R1.x, R5.x VEC_201
y: MULADD R5.y, R23.y, R1.y, R5.y VEC_201
z: MULADD R6.z, R23.x, R1.z, R6.z VEC_210
w: MULADD R6.w, R23.x, R1.w, R6.w VEC_201
t: MULADD R5.z, R23.y, R1.z, R5.z VEC_120
186 x: MULADD R4.x, R23.y, R2.x, R4.x VEC_201
y: MULADD R4.y, R23.y, R2.y, R4.y VEC_201
z: MULADD R3.z, R23.x, R2.z, R3.z VEC_210
w: MULADD R5.w, R23.y, R1.w, R5.w VEC_201
t: MULADD R4.z, R23.y, R2.z, R4.z VEC_120
187 w: MULADD R3.w, R23.x, R2.w, R3.w
t: MULADD R4.w, R23.y, R2.w, R4.w
188 x: MOV R25.x, R28.x
y: MOV R25.y, R28.y
z: MOV R25.z, R28.z
w: MOV R25.w, R28.w
31 ALU: ADDR(732) CNT(80)
189 x: MULADD R29.x, R23.z, R26.x, R29.x VEC_210
y: MULADD R29.y, R23.z, R26.y, R29.y VEC_201
z: MULADD R29.z, R23.z, R26.z, R29.z VEC_201
w: MULADD R29.w, R23.z, R26.w, R29.w VEC_201
t: MULADD R20.x, R23.w, R26.x, R20.x VEC_120
190 x: MULADD R21.x, R23.z, R27.x, R21.x VEC_210
y: MULADD R21.y, R23.z, R27.y, R21.y VEC_201
z: MULADD R21.z, R23.z, R27.z, R21.z VEC_201
w: MULADD R21.w, R23.z, R27.w, R21.w VEC_201
t: MULADD R19.x, R23.w, R27.x, R19.x VEC_120
191 x: MULADD R18.x, R24.x, R26.x, R18.x VEC_201
y: MULADD R20.y, R23.w, R26.y, R20.y
z: MULADD R20.z, R23.w, R26.z, R20.z
w: MULADD R20.w, R23.w, R26.w, R20.w
t: MULADD R18.y, R24.x, R26.y, R18.y
192 x: MULADD R17.x, R24.x, R27.x, R17.x VEC_201
y: MULADD R19.y, R23.w, R27.y, R19.y
z: MULADD R19.z, R23.w, R27.z, R19.z
w: MULADD R19.w, R23.w, R27.w, R19.w
t: MULADD R17.y, R24.x, R27.y, R17.y
193 x: MULADD R16.x, R24.y, R26.x, R16.x VEC_201
y: MULADD R16.y, R24.y, R26.y, R16.y VEC_201
z: MULADD R18.z, R24.x, R26.z, R18.z VEC_210
w: MULADD R18.w, R24.x, R26.w, R18.w VEC_201
t: MULADD R16.z, R24.y, R26.z, R16.z VEC_120
194 x: MULADD R15.x, R24.y, R27.x, R15.x VEC_201
y: MULADD R15.y, R24.y, R27.y, R15.y VEC_201
z: MULADD R17.z, R24.x, R27.z, R17.z VEC_210
w: MULADD R17.w, R24.x, R27.w, R17.w VEC_201
t: MULADD R15.z, R24.y, R27.z, R15.z VEC_120
195 x: MULADD R14.x, R24.z, R26.x, R14.x VEC_201
y: MULADD R14.y, R24.z, R26.y, R14.y VEC_201
z: MULADD R14.z, R24.z, R26.z, R14.z VEC_201
w: MULADD R16.w, R24.y, R26.w, R16.w VEC_210
t: MULADD R14.w, R24.z, R26.w, R14.w VEC_120
196 x: MULADD R12.x, R24.z, R27.x, R12.x VEC_201
y: MULADD R12.y, R24.z, R27.y, R12.y VEC_201
z: MULADD R12.z, R24.z, R27.z, R12.z VEC_201
w: MULADD R15.w, R24.y, R27.w, R15.w VEC_210
t: MULADD R12.w, R24.z, R27.w, R12.w VEC_120
197 x: MULADD R13.x, R24.w, R26.x, R13.x VEC_210
y: MULADD R13.y, R24.w, R26.y, R13.y VEC_201
z: MULADD R13.z, R24.w, R26.z, R13.z VEC_201
w: MULADD R13.w, R24.w, R26.w, R13.w VEC_201
t: MULADD R8.x, R25.y, R26.x, R8.x VEC_120
198 x: MULADD R11.x, R24.w, R27.x, R11.x VEC_210
y: MULADD R11.y, R24.w, R27.y, R11.y VEC_201
z: MULADD R11.z, R24.w, R27.z, R11.z VEC_201
w: MULADD R11.w, R24.w, R27.w, R11.w VEC_201
t: MULADD R7.x, R25.y, R27.x, R7.x VEC_120
199 x: MULADD R10.x, R25.x, R26.x, R10.x
y: MULADD R10.y, R25.x, R26.y, R10.y
z: MULADD R10.z, R25.x, R26.z, R10.z
w: MULADD R10.w, R25.x, R26.w, R10.w
200 x: MULADD R9.x, R25.x, R27.x, R9.x
y: MULADD R9.y, R25.x, R27.y, R9.y
z: MULADD R9.z, R25.x, R27.z, R9.z
w: MULADD R9.w, R25.x, R27.w, R9.w
201 x: MULADD R6.x, R25.z, R26.x, R6.x VEC_210
y: MULADD R8.y, R25.y, R26.y, R8.y VEC_201
z: MULADD R8.z, R25.y, R26.z, R8.z VEC_201
w: MULADD R8.w, R25.y, R26.w, R8.w VEC_201
t: MULADD R5.x, R25.w, R26.x, R5.x VEC_120
202 x: MULADD R28.x, R25.z, R27.x, R3.x VEC_210
y: MULADD R7.y, R25.y, R27.y, R7.y VEC_201
z: MULADD R7.z, R25.y, R27.z, R7.z VEC_201
w: MULADD R7.w, R25.y, R27.w, R7.w VEC_201
t: MULADD R4.x, R25.w, R27.x, R4.x VEC_120
203 y: MULADD R6.y, R25.z, R26.y, R6.y VEC_210
z: MULADD R6.z, R25.z, R26.z, R6.z VEC_201
w: MULADD R6.w, R25.z, R26.w, R6.w VEC_201
t: MULADD R5.y, R25.w, R26.y, R5.y VEC_120
204 y: MULADD R28.y, R25.z, R27.y, R3.y VEC_210
z: MULADD R5.z, R25.w, R26.z, R5.z VEC_201
w: MULADD R5.w, R25.w, R26.w, R5.w VEC_201
t: MULADD R4.y, R25.w, R27.y, R4.y VEC_120
205 z: MULADD R28.z, R25.z, R27.z, R3.z
w: MULADD R28.w, R25.z, R27.w, R3.w
206 z: MULADD R4.z, R25.w, R27.z, R4.z
w: MULADD R4.w, R25.w, R27.w, R4.w
32 ENDLOOP i0 PASS_JUMP_ADDR(3)
33 ALU: ADDR(812) CNT(20) KCACHE0(CB0:0-15)
207 t: MULLO_INT T0.z, R22.x, KC0[0].z
208 t: MULLO_INT ____, R22.y, KC0[0].w
209 w: ADD_INT ____, T0.z, PS208
210 x: ADD_INT T0.x, PV209.w, (0x00000003, 4.203895393e-45f).x
y: ADD_INT ____, PV209.w, 1
z: ADD_INT ____, PV209.w, 0.0f
w: ADD_INT T0.w, PV209.w, (0x00000002, 2.802596929e-45f).y
211 x: LSHL R0.x, PV210.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, PV210.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R0.z, PV210.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R0.w, PV210.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R1.x, PV210.y, (0x00000002, 2.802596929e-45f).x
212 x: LSHL R2.x, T0.w, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R1.y, T0.x, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R1.z, PV211.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R1.w, PV211.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R3.x, T0.x, (0x00000002, 2.802596929e-45f).x
34 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R29, ELEM_SIZE(3)
35 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R21, ELEM_SIZE(3)
36 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R20, ELEM_SIZE(3)
37 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R19, ELEM_SIZE(3)
38 ALU: ADDR(832) CNT(12)
213 x: LSHL R3.x, R0.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, R0.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R2.z, R1.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R0.w, R1.y, (0x00000004, 5.605193857e-45f).y VEC_120
t: LSHL R2.x, R0.w, (0x00000002, 2.802596929e-45f).x
214 x: LSHL R1.x, R0.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R1.y, R1.z, (0x00000004, 5.605193857e-45f).y VEC_120
z: ADD_INT R0.z, PV213.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R2.w, PV213.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R0.x, R1.y, (0x00000002, 2.802596929e-45f).x
39 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R18, ELEM_SIZE(3)
40 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R17, ELEM_SIZE(3)
41 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R16, ELEM_SIZE(3)
42 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R15, ELEM_SIZE(3)
43 ALU: ADDR(844) CNT(10)
215 x: LSHL R0.x, R1.w, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R2.y, R2.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R1.z, R2.w, (0x00000004, 5.605193857e-45f).y VEC_120
w: ADD_INT R1.w, R1.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R1.x, R1.z, (0x00000002, 2.802596929e-45f).x
216 x: LSHL R2.x, R0.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, R0.z, (0x00000004, 5.605193857e-45f).y
t: LSHL R3.x, R0.w, (0x00000002, 2.802596929e-45f).x
44 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R14, ELEM_SIZE(3)
45 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R12, ELEM_SIZE(3)
46 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R13, ELEM_SIZE(3)
47 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R11, ELEM_SIZE(3)
48 ALU: ADDR(854) CNT(6)
217 x: LSHL R3.x, R2.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R2.x, R1.y, (0x00000002, 2.802596929e-45f).x
218 x: LSHL R1.x, R2.w, (0x00000002, 2.802596929e-45f).x
t: LSHL R0.x, R0.z, (0x00000002, 2.802596929e-45f).x
49 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R10, ELEM_SIZE(3)
50 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R9, ELEM_SIZE(3)
51 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R8, ELEM_SIZE(3)
52 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R7, ELEM_SIZE(3)
53 ALU: ADDR(860) CNT(6)
219 x: LSHL R0.x, R2.y, (0x00000002, 2.802596929e-45f).x
t: LSHL R1.x, R1.w, (0x00000002, 2.802596929e-45f).x
220 x: LSHL R2.x, R1.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R3.x, R0.y, (0x00000002, 2.802596929e-45f).x
54 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R6, ELEM_SIZE(3)
55 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R28, ELEM_SIZE(3)
56 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R5, ELEM_SIZE(3)
57 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R4, ELEM_SIZE(3)
END_OF_PROGRAM
:razz: OK, now you're just showing off. That's absurd! :shock:
I count 31 registers which means you should have 8 wavefronts per SIMD (8*31*64=15,872 + 4 clause temporaries * 64 strands * 2 wavefronts = 512 makes 16,384 registers total).
I get ALU:TEX of 152:36, 4.22:1, 87% ALU utilisation. The 20 MOVs (5 clocks) look like wastage, effectively making utilisation 84%.
Overall I guess HD3870 is roughly as fast as GTX285. Ouch.
Jawed
rpg.314
25-Aug-2009, 18:55
:razz: OK, now you're just showing off. That's absurd! :shock:
No really, you are actually showing off!!!:grin:
May be you should talk to AMD about licensing your codes. You might end up having something tangible to show off then. :lol:
GREAT job though.
rpg.314
25-Aug-2009, 19:01
I assume you calculation of gigaflops here follows the 10^9=1Giga rule. If so, how long it's gonna be before you treat us to the 1Tflop matrix multiplication running on ~$200 gpu? :) :)
If he had an HD4890 at stock clocks it should do it without any trouble.
Jawed
MicahVillmow
25-Aug-2009, 21:21
prunedtree,
Congratulations on improving on our algorithm for dense-matmat mul. It is very impressive to see that people can take our code developed on older hardware and improve it past its original design. The original code was developed on the R600 hence the 8x4 design and later optimized for R670 but the original design didn't change.
riza.guntur
26-Aug-2009, 04:44
Impressive
Congratulations prunedtree
Somebody please tried it on 4890 1GHz :grin:
prunedtree
26-Aug-2009, 05:28
The 20 MOVs (5 clocks) look like wastage, effectively making utilisation 84%.
Actually, these MOVs are critical in order to save registers.
If I understand you right, this kernel could achieve ~1008 Gflop/s if it was not bottlenecked by L1 bandwidth ?
The problem with the 8x10 layout is that we have to ensure we don't waste any fetches as otherwise there would be no improvement. However each external product needs 10+8 scalars, which doesn't fit float4 fetches well. So I spread the loads over (20+16)x 4 = 144 = 8x 10+8 loads. The 8x10 block takes 20 registers, and we need a register for texture indexing, which leaves 10 registers. It's thus impossible to hold 36 temporary float4, which is why the code is scheduled in order to do external products as early as possible, and needs two float4 registers to juggle around fetches.
The pattern looks like this:
10a | 6a 4b | 10b | 2b 8c | [8c] 2d | 10d | [4d] 6e | 10e
8a [8a] 8b [8b] 8d 8d 8e 8e
The five texture clauses are labeled a-e, and brackets show the registers that are kept through a fetch. There are eight 8x10 outer products in total.
I assume you calculation of gigaflops here follows the 10^9=1Giga rule.
Indeed. I thought for a moment that the ~450 GB/s limit on L1 bandwith I am measuring was due to a 2^30=1Giga mistake, but that is not the case.
If so, how long it's gonna be before you treat us to the 1Tflop matrix multiplication running on ~$200 gpu? :) :)
Well it would be cute if that was possible (1000 Gflop/s is symbolic after all) but given the 450 GB/s limit, with 8x10 blocks that give a 80/9 bandwidth reduction, this kernel can only achieve up to 1000 Gflop/s. As it's already using the whole register file I'm afraid it's difficult to do considerably better using a similar approach.
If he had an HD4890 at stock clocks it should do it without any trouble.
Jawed
That's tempting... even a very mild overclock (3%) on HD4870 could do the trick...
fuu@hydra:~$ aticonfig --adapter=3 --odgc --odgt
Adapter 3 - ATI Radeon HD 4870 X2
Core (MHz) Memory (MHz)
Current Clocks : 507 500
Current Peak : 770 900
Configurable Peak Range : [507-778] [500-980]
GPU load : 0%
Adapter 3 - ATI Radeon HD 4870 X2
Sensor 0: Temperature - 55.50 C
And there it is: 1 Teraflop/s SGEMM ^^
prunedtree,
Congratulations on improving on our algorithm for dense-matmat mul. It is very impressive to see that people can take our code developed on older hardware and improve it past its original design. The original code was developed on the R600 hence the 8x4 design and later optimized for R670 but the original design didn't change.
Thank you for these precisions. Can you tell us where this 450 GB/s L1 bandwidth limitation (instead of the expected 480 GB/s) comes from? Is it a scheduler bottleneck?
rpg.314
26-Aug-2009, 05:54
I wonder how did you partition a 4096x4096 matrix into 8x10 blocks? 8x8 I can understand. Did you use a different size for this?
mhouston
26-Aug-2009, 06:06
Yes, you are pushing the limits of the scheduler and the hardware, excellent work. ;-)
There may be issues hiding all of the latency since you are dropping wavefront count as you increase register usage. This may be one cause of the L1 drop off, or it could be that you aren't quite getting the L2 latency coverage and stalls are happening there. Remember that you will have some cold misses into the cache which will drop utilization and it's possible there are some conflict misses in the chain as well. If you write a simple texture cache throughput benchmark, for example everyone fetching from texel 0, you will minimize access outside of the L1 and should get very close to peaks.
prunedtree
26-Aug-2009, 06:24
I wonder how did you partition a 4096x4096 matrix into 8x10 blocks? 8x8 I can understand. Did you use a different size for this?
You can always pad with zeros, using 4100x4096 matrices. With large matrices and small blocks it doesn't affect results much, so you can use any block size. The difficult part here is to ensure we don't waste float4 fetches.
Yes, you are pushing the limits of the scheduler and the hardware, excellent work. ;-)
There may be issues hiding all of the latency since you are dropping wavefront count as you increase register usage. This may be one cause of the L1 drop off, or it could be that you aren't quite getting the L2 latency coverage and stalls are happening there. Remember that you will have some cold misses into the cache which will drop utilization and it's possible there are some conflict misses in the chain as well. If you write a simple texture cache throughput benchmark, for example everyone fetching from texel 0, you will minimize access outside of the L1 and should get very close to peaks.
I've never obtained more than 450 GB/s no matter how I try, even with highly synthetic tests all fetching from the same location (and plenty of variations to avoid potential bank conflicts I wouldn't know of...) or with the samples in the SDK, which is why I'm asking.
riza.guntur
26-Aug-2009, 06:38
I wonder, how differ it is compared to Core i7 using ATLAS :)
riza.guntur
26-Aug-2009, 07:53
I wonder if you could help me port my Brook+ program to CAL
If you can use global buffer I'm sure it can blast of to thousand times faster :grin:
I had a rummage and it seems I had a working Brook+ 64-scalars-output MM working a few months back (pure gather/scatter, but not in CS mode). It still seems to work as it verifies OK. I've broken something as only the debug version compiles so I can't produce a .exe (and I can only run on CPU anyway). I gave up because the assembly it outputs is fragmented due to a storm of IFs - though it looks like I can rearrange the code to get rid of a lot of them. Seems I got bored/cheesed-off or decided it was a blind alley and just abandoned it :razz:
Jawed
I think this colouring should highlight why the MOVs are superfluous:
188 x: MOV R25.x, R28.x
y: MOV R25.y, R28.y
z: MOV R25.z, R28.z
w: MOV R25.w, R28.w
31 ALU: ADDR(732) CNT(80)
189 x: MULADD R29.x, R23.z, R26.x, R29.x VEC_210
y: MULADD R29.y, R23.z, R26.y, R29.y VEC_201
z: MULADD R29.z, R23.z, R26.z, R29.z VEC_201
w: MULADD R29.w, R23.z, R26.w, R29.w VEC_201
t: MULADD R20.x, R23.w, R26.x, R20.x VEC_120
190 x: MULADD R21.x, R23.z, R27.x, R21.x VEC_210
y: MULADD R21.y, R23.z, R27.y, R21.y VEC_201
z: MULADD R21.z, R23.z, R27.z, R21.z VEC_201
w: MULADD R21.w, R23.z, R27.w, R21.w VEC_201
t: MULADD R19.x, R23.w, R27.x, R19.x VEC_120
191 x: MULADD R18.x, R24.x, R26.x, R18.x VEC_201
y: MULADD R20.y, R23.w, R26.y, R20.y
z: MULADD R20.z, R23.w, R26.z, R20.z
w: MULADD R20.w, R23.w, R26.w, R20.w
t: MULADD R18.y, R24.x, R26.y, R18.y
192 x: MULADD R17.x, R24.x, R27.x, R17.x VEC_201
y: MULADD R19.y, R23.w, R27.y, R19.y
z: MULADD R19.z, R23.w, R27.z, R19.z
w: MULADD R19.w, R23.w, R27.w, R19.w
t: MULADD R17.y, R24.x, R27.y, R17.y
193 x: MULADD R16.x, R24.y, R26.x, R16.x VEC_201
y: MULADD R16.y, R24.y, R26.y, R16.y VEC_201
z: MULADD R18.z, R24.x, R26.z, R18.z VEC_210
w: MULADD R18.w, R24.x, R26.w, R18.w VEC_201
t: MULADD R16.z, R24.y, R26.z, R16.z VEC_120
194 x: MULADD R15.x, R24.y, R27.x, R15.x VEC_201
y: MULADD R15.y, R24.y, R27.y, R15.y VEC_201
z: MULADD R17.z, R24.x, R27.z, R17.z VEC_210
w: MULADD R17.w, R24.x, R27.w, R17.w VEC_201
t: MULADD R15.z, R24.y, R27.z, R15.z VEC_120
195 x: MULADD R14.x, R24.z, R26.x, R14.x VEC_201
y: MULADD R14.y, R24.z, R26.y, R14.y VEC_201
z: MULADD R14.z, R24.z, R26.z, R14.z VEC_201
w: MULADD R16.w, R24.y, R26.w, R16.w VEC_210
t: MULADD R14.w, R24.z, R26.w, R14.w VEC_120
196 x: MULADD R12.x, R24.z, R27.x, R12.x VEC_201
y: MULADD R12.y, R24.z, R27.y, R12.y VEC_201
z: MULADD R12.z, R24.z, R27.z, R12.z VEC_201
w: MULADD R15.w, R24.y, R27.w, R15.w VEC_210
t: MULADD R12.w, R24.z, R27.w, R12.w VEC_120
197 x: MULADD R13.x, R24.w, R26.x, R13.x VEC_210
y: MULADD R13.y, R24.w, R26.y, R13.y VEC_201
z: MULADD R13.z, R24.w, R26.z, R13.z VEC_201
w: MULADD R13.w, R24.w, R26.w, R13.w VEC_201
t: MULADD R8.x, R25.y, R26.x, R8.x VEC_120
198 x: MULADD R11.x, R24.w, R27.x, R11.x VEC_210
y: MULADD R11.y, R24.w, R27.y, R11.y VEC_201
z: MULADD R11.z, R24.w, R27.z, R11.z VEC_201
w: MULADD R11.w, R24.w, R27.w, R11.w VEC_201
t: MULADD R7.x, R25.y, R27.x, R7.x VEC_120
199 x: MULADD R10.x, R25.x, R26.x, R10.x
y: MULADD R10.y, R25.x, R26.y, R10.y
z: MULADD R10.z, R25.x, R26.z, R10.z
w: MULADD R10.w, R25.x, R26.w, R10.w
200 x: MULADD R9.x, R25.x, R27.x, R9.x
y: MULADD R9.y, R25.x, R27.y, R9.y
z: MULADD R9.z, R25.x, R27.z, R9.z
w: MULADD R9.w, R25.x, R27.w, R9.w
201 x: MULADD R6.x, R25.z, R26.x, R6.x VEC_210
y: MULADD R8.y, R25.y, R26.y, R8.y VEC_201
z: MULADD R8.z, R25.y, R26.z, R8.z VEC_201
w: MULADD R8.w, R25.y, R26.w, R8.w VEC_201
t: MULADD R5.x, R25.w, R26.x, R5.x VEC_120
202 x: MULADD R28.x, R25.z, R27.x, R3.x VEC_210
y: MULADD R7.y, R25.y, R27.y, R7.y VEC_201
z: MULADD R7.z, R25.y, R27.z, R7.z VEC_201
w: MULADD R7.w, R25.y, R27.w, R7.w VEC_201
t: MULADD R4.x, R25.w, R27.x, R4.x VEC_120
203 y: MULADD R6.y, R25.z, R26.y, R6.y VEC_210
z: MULADD R6.z, R25.z, R26.z, R6.z VEC_201
w: MULADD R6.w, R25.z, R26.w, R6.w VEC_201
t: MULADD R5.y, R25.w, R26.y, R5.y VEC_120
204 y: MULADD R28.y, R25.z, R27.y, R3.y VEC_210
z: MULADD R5.z, R25.w, R26.z, R5.z VEC_201
w: MULADD R5.w, R25.w, R26.w, R5.w VEC_201
t: MULADD R4.y, R25.w, R27.y, R4.y VEC_120
205 z: MULADD R28.z, R25.z, R27.z, R3.z
w: MULADD R28.w, R25.z, R27.w, R3.w
206 z: MULADD R4.z, R25.w, R27.z, R4.z
w: MULADD R4.w, R25.w, R27.w, R4.w
Obviously, if the MOVs were deleted (and references to R25 changed into R28) then the operations 205:z and 205:w would need to be moved after 206:z and 206:w, and their order reversed.
But this is a minor point really, after you've wrangled the compiler so successfully, this last few percent just doesn't seem like it's worth the heartache.
Oh, and to answer your earlier point, yeah around 1010GFLOPs at 750MHz is the theoretical limit with 152 cycles for 1280 FLOPs, i.e. 8.42FLOPs per clock.
Jawed
MicahVillmow
26-Aug-2009, 19:59
PrunedTree,
Would it be possible to send us your code so that we can have some fun with it here internally? :) my email address is <firstname> dot <lastname> at amd dot com.
digitalwanderer
26-Aug-2009, 20:48
Hey, don't you have some DX11 stuff to work on or something? :|
Oh, what the hey...let them have it Prune. They need some fun too. ;)
No demand a 4890 card from them
or if your feeling really mean ask them to buy you a nv card :D
oscarbg
02-Sep-2009, 02:53
Hi,
I also want the code!!
Could you send me at rtfss1 dot gmail.com
Thanks..
PrunedTree,
Would it be possible to send us your code so that we can have some fun with it here internally? :) my email address is <firstname> dot <lastname> at amd dot com.
rpg.314
11-Sep-2009, 04:29
Would you be porting (and optimizing to death) your code to 58xx cards anytime soon? :)
prunedtree
11-Sep-2009, 15:53
Well, unless there's a regression in architecture, this implementation could easily achieve over 2000 Gflop/s on RV870 if we believe the rumored specs (20 execution units). However, I don't plan to buy any new ATI card this year.
If someone gives me a ssh account on a linux system with such hardware, I won't mind giving it a try though. (All the stuff in this thread was done through ssh).
Just send me a PM.
trinibwoy
11-Sep-2009, 17:33
Isn't your implementation dependent on the current clause approach AMD has taken? It's unlikely but if their new compiler and/or scheduler were to be more dynamic how would it affect this solution? I'm assuming that you can't neatly arrange things like this on Nvidia hardware because ALU and texture instruction issue is more fine grained?
The fundamental problem with this approach on NVidia is that the per-strand state is too big. The current kernel uses 31 vec4 registers, that's 124 scalars. If you try to allocate that many on NVidia you get only 4 threads. Clearly the state can be reduced (and implemented on NVidia it may be cheaper still for architectural reasons). But as you do so the code's ALU:TEX falls. Because NVidia requires a lower ratio here, the cut-off point is lowered, so there's some slack there.
It'd be interesting to see what's possible on NVidia, using a texturing-centric approach with few threads (e.g. go for 6 or 8 threads instead). Gotta find someone who can be bothered, I guess.
Jawed
rpg.314
12-Sep-2009, 06:14
The fundamental problem with this approach on NVidia is that the per-strand state is too big. The current kernel uses 31 vec4 registers, that's 124 scalars. If you try to allocate that many on NVidia you get only 4 threads. Clearly the state can be reduced (and implemented on NVidia it may be cheaper still for architectural reasons). But as you do so the code's ALU:TEX falls. Because NVidia requires a lower ratio here, the cut-off point is lowered, so there's some slack there.
It'd be interesting to see what's possible on NVidia, using a texturing-centric approach with few threads (e.g. go for 6 or 8 threads instead). Gotta find someone who can be bothered, I guess.
Jawed
When you say threads, do you mean warps? Because 124 scalars * 4 threads= 496 registers. G80 has 8192 registers. :shock:Then why cant you have this register heavy kernels on nv?
Also, volkov, iirc, used 2warps=64threads in his sgemm code.
Enforcer
17-Sep-2009, 02:04
The result ? I measure 880 Gflop/s for 4096x4096 dense matrix-matrix products.
Amazing work!
I never fully believed that all those tremendous (1:1 MUL:ADD) computational power of AMD cards cannot be utilized in sgemm. Those 300-500 Gflops numbers just didnt fit in my own vision of the world.
I guess this particular approach would not work for NVIDIA cards. TU's arent fast enough for (just!) 8x bandwidth reduction. I have to check to make 100% sure.(FP16 fetches could be 2x faster!)
I'm using VVolkov's code now for experiments with (fully connected) Neural Networks (~460 Gflops at VMOD'ed gt200b). Its quite spectacular that GPU can run ANN at close to peak theoretical performance rate.
compres
16-Nov-2009, 18:40
I think the Chinese may have a job for you :D
http://www.brightsideofnews.com/news/2009/11/3/gpgpu-start-to-take-over-the-hpc-sector-5600-ati-gpus-deployed-in-china.aspx
prunedtree
17-Dec-2009, 08:21
Would you be porting (and optimizing to death) your code to 58xx cards anytime soon? :)
It turns out that Santa was a bit early this year ^^
It seems that RV870 is quite similar to RV770, although I've noticed a few potentially significant differences (lower L1/L2/memory latencies, 16 fetch4 per clause) that might allow one to get away with even less running threads, and perhaps get even higher utilization on RV870 than on RV770.
For now, I've just experimented with my RV770 kernels, and I get perfect scaling in the case of SGEMM:
RV770 with 800 ALUs @ 750 mhz: 980 Gflop/s (81% utilization)
RV870 with 1600 ALUs @ 850 mhz: 2220 Gflop/s (81% utilization)
Nice, I thought you would start becoming (memory) bandwidth limited again.
rpg.314
17-Dec-2009, 13:12
It turns out that Santa was a bit early this year ^^
It seems that RV870 is quite similar to RV770, although I've noticed a few potentially significant differences (lower L1/L2/memory latencies, 16 fetch4 per clause) that might allow one to get away with even less running threads, and perhaps get even higher utilization on RV870 than on RV770.
For now, I've just experimented with my RV770 kernels, and I get perfect scaling in the case of SGEMM:
RV770 with 800 ALUs @ 750 mhz: 980 Gflop/s (81% utilization)
RV870 with 1600 ALUs @ 850 mhz: 2220 Gflop/s (81% utilization)
What is the situation with the shared memory? Just how fast is that on rv870?
prunedtree
17-Dec-2009, 15:17
What is the situation with the shared memory? Just how fast is that on rv870?
It does not really matter for SGEMM: Optimizing a little for RV870, I managed to reach up to 1083 GB/s (L1 fetch bandwidth peaks at 1088 GB/s afaik) with 12x8 blocks (that puts the TMU bottlneck at 2.6 TFlop/s) yet this only achieves 2.17 TFlop/s in practice: ALU becomes the bottlneck.
Playing around further, is seems very difficult to significantly improve ALU utilization past 80% when you only have multiply-adds and a limited amount of registers. The compiler is part of the limitation here, and it might be possible to extract a little more performance by scheduling the machine code by hand. Part of the difficulty lies in the fact the architecture can only access 12 scalars in the register file per clock, while five multiply-add instructions would consume 15 scalars, thus you need to exploit common operands between multiply-add instructions.
OpenGL guy
17-Dec-2009, 15:47
It does not really matter for SGEMM: Optimizing a little for RV870, I managed to reach up to 1083 GB/s (theoretical peak at 1088 GB/s) with 12x8 blocks (that puts the TMU bottlneck at 2.6 TFlop/s) yet this only achieves 2.17 TFlop/s in practice: ALU becomes the bottlneck.
Theoretical LDS bandwidth is much higher than that.
rpg.314
17-Dec-2009, 16:31
Theoretical LDS bandwidth is much higher than that.
what is it? in terms of floats per clock per simd.
trinibwoy
17-Dec-2009, 19:29
Theoretical LDS bandwidth is much higher than that.
Well that's one thing. The other is that any widespread implementation of this isn't going to rely on the texture units. So it's probably a good exercise to use the LDS to get a more practical view of how things will work out. It's interesting that nobody has stepped up with an OpenCL BLAS implementation though - Nvidia surely has no motivation to do it. Will AMD step up?
prunedtree
18-Dec-2009, 05:56
Small update, for completeness. I implemented DGEMM (RV870 starts to have decent double precision performance after all) and as expected (2x higher bandwidth/flop in double precision) it's very easy to achieve high performance, although squeezing a little more required some effort.
It's rather similar to SGEMM, I'm using 8x8 blocks, with 6 threads of 41 registers.
Here's the code: (RV870 ISA)
00 ALU: ADDR(64) CNT(14) KCACHE0(CB0:0-15)
0 x: LSHL ____, R0.y, (0x00000006, 8.407790786e-45f).x
y: LSHL ____, R0.z, (0x00000006, 8.407790786e-45f).x
w: MOV R0.w, 0.0f
t: RCP_UINT R1.w, KC0[0].z
1 x: MOV R8.x, 0.0f
y: MOV R8.y, 0.0f
z: MOV R8.z, 0.0f
w: ADD_INT ____, PV0.y, PV0.x
t: MULLO_UINT R2.y, KC0[0].z, PS0
2 x: SUB_INT R0.x, 0.0f, PS1
z: ADD_INT R0.z, R0.x, PV1.w
w: MOV R8.w, 0.0f
t: MULHI_UINT R2.x, KC0[0].z, R1.w
01 TEX: ADDR(592) CNT(1)
3 VFETCH R3.xy__, R0.w, fc147 MEGA(8)
FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(78) CNT(83) KCACHE0(CB0:0-15)
4 x: MOV R9.x, 0.0f
y: MOV R9.y, 0.0f
z: MOV R9.z, 0.0f
w: CNDE_INT T0.w, R2.x, R0.x, R2.y VEC_021
t: MULLO_UINT ____, R1.z, R3.x
5 x: MOV R10.x, 0.0f
y: MOV R10.y, 0.0f
z: MOV R10.z, 0.0f
w: MOV R9.w, 0.0f
t: MULLO_UINT T0.x, PS4, R3.y
6 x: MOV R11.x, 0.0f
y: MOV R11.y, 0.0f
z: MOV R11.z, 0.0f
w: MOV R10.w, 0.0f
t: MULLO_UINT ____, R1.y, R3.x
7 x: MOV R12.x, 0.0f
y: MOV R12.y, 0.0f
z: MOV R12.z, 0.0f
w: ADD_INT ____, T0.x, PS6
t: MULHI_UINT ____, T0.w, R1.w
8 x: SUB_INT ____, R1.w, PS7
y: ADD_INT ____, R1.x, PV7.w
z: ADD_INT ____, R1.w, PS7
w: MOV R11.w, 0.0f
t: MOV R12.w, 0.0f
9 x: LSHL ____, PV8.y, (0x00000006, 8.407790786e-45f).x
y: MOV R13.y, 0.0f
z: MOV R13.z, 0.0f
w: CNDE_INT T0.w, R2.x, PV8.z, PV8.x
t: MOV R13.x, 0.0f
10 x: MOV R14.x, 0.0f
y: MOV R14.y, 0.0f
z: ADD_INT T0.z, R0.z, PV9.x
w: MOV R13.w, 0.0f
t: MOV R14.z, 0.0f
11 x: MOV R15.x, 0.0f
y: MOV R15.y, 0.0f
z: MOV R15.z, 0.0f
w: MOV R14.w, 0.0f
t: MULHI_UINT T0.y, T0.w, PV10.z
12 x: ADD_INT T0.x, -1, PS11
y: ADD_INT T1.y, PS11, 1
z: MOV R16.z, -1.0f
w: MOV R15.w, 0.0f
t: MULLO_UINT ____, PS11, KC0[0].z
13 x: MOV R17.x, 0.0f
y: MOV R17.y, 0.0f
z: SUB_INT T0.z, T0.z, PS12
w: SETGE_UINT T1.w, T0.z, PS12
t: MOV R17.z, 0.0f
14 x: SETGE_UINT ____, PV13.z, KC0[0].z
w: SUB_INT T0.w, PV13.z, KC0[0].z
t: MOV R17.w, 0.0f
15 x: MOV R18.x, 0.0f
y: MOV R18.y, 0.0f
z: AND_INT ____, T1.w, PV14.x
w: MOV R18.w, 0.0f
t: MOV R18.z, 0.0f
16 x: MOV R19.x, 0.0f
y: CNDE_INT T0.y, PV15.z, T0.z, T0.w
z: MOV R19.z, 0.0f
w: CNDE_INT R123.w, PV15.z, T0.y, T1.y
t: MOV R19.y, 0.0f
17 x: CNDE_INT R123.x, T1.w, T0.x, PV16.w
y: ADD_INT ____, KC0[0].z, PV16.y
w: MOV R19.w, 0.0f
t: MOV R20.x, 0.0f
18 x: CNDE_INT R123.x, KC0[0].z, -1, PV17.x
y: MOV R20.y, 0.0f
z: CNDE_INT R123.z, T1.w, PV17.y, T0.y
w: MOV R20.w, 0.0f
t: MOV R20.z, 0.0f
19 x: MOV R21.x, 0.0f
y: MOV R21.y, 0.0f
z: MOV R21.z, 0.0f
w: CNDE_INT R123.w, KC0[0].z, -1, PV18.z
t: I_TO_F R2.y, PV18.x
20 x: MOV R22.x, 0.0f
y: MOV R22.y, 0.0f
z: MOV R22.z, 0.0f
w: MOV R21.w, 0.0f
t: I_TO_F R2.x, PV19.w
03 TEX: ADDR(594) CNT(1)
21 SAMPLE R16.xy__, R2.xy0x, t8, s8 UNNORM(XYZW)
04 ALU: ADDR(161) CNT(73)
22 x: MOV R23.x, 0.0f
y: MOV R23.y, 0.0f
z: MOV R23.z, 0.0f
w: MOV R22.w, 0.0f
t: MOV R23.w, 0.0f
23 x: MOV R24.x, 0.0f
y: MOV R24.y, 0.0f
z: MOV R24.z, 0.0f
w: MOV R24.w, 0.0f
t: MOV R25.x, 0.0f
24 x: MOV R26.x, 0.0f
y: MOV R25.y, 0.0f
z: MOV R25.z, 0.0f
w: MOV R25.w, 0.0f
t: MOV R26.y, 0.0f
25 x: MOV R27.x, 0.0f
y: MOV R27.y, 0.0f
z: MOV R26.z, 0.0f
w: MOV R26.w, 0.0f
t: MOV R27.z, 0.0f
26 x: MOV R28.x, 0.0f
y: MOV R28.y, 0.0f
z: MOV R28.z, 0.0f
w: MOV R27.w, 0.0f
t: MOV R28.w, 0.0f
27 x: MOV R29.x, 0.0f
y: MOV R29.y, 0.0f
z: MOV R29.z, 0.0f
w: MOV R29.w, 0.0f
t: MOV R30.x, 0.0f
28 x: MOV R31.x, 0.0f
y: MOV R30.y, 0.0f
z: MOV R30.z, 0.0f
w: MOV R30.w, 0.0f
t: MOV R31.y, 0.0f
29 x: MOV R32.x, 0.0f
y: MOV R32.y, 0.0f
z: MOV R31.z, 0.0f
w: MOV R31.w, 0.0f
t: MOV R32.z, 0.0f
30 x: MOV R33.x, 0.0f
y: MOV R33.y, 0.0f
z: MOV R33.z, 0.0f
w: MOV R32.w, 0.0f
t: MOV R33.w, 0.0f
31 x: MOV R34.x, 0.0f
y: MOV R34.y, 0.0f
z: MOV R34.z, 0.0f
w: MOV R34.w, 0.0f
t: MOV R35.x, 0.0f
32 x: MOV R36.x, 0.0f
y: MOV R35.y, 0.0f
z: MOV R35.z, 0.0f
w: MOV R35.w, 0.0f
t: MOV R36.y, 0.0f
33 x: MOV R37.x, 0.0f
y: MOV R37.y, 0.0f
z: MOV R36.z, 0.0f
w: MOV R36.w, 0.0f
t: MOV R37.z, 0.0f
34 x: MOV R38.x, 0.0f
y: MOV R38.y, 0.0f
z: MOV R38.z, 0.0f
w: MOV R37.w, 0.0f
t: MOV R38.w, 0.0f
35 x: MOV R39.x, 0.0f
y: MOV R39.y, 0.0f
z: MOV R39.z, 0.0f
w: MOV R39.w, 0.0f
t: MOV R40.x, 0.0f
36 y: MOV R40.y, 0.0f
z: MOV R40.z, 0.0f
w: MOV R40.w, 0.0f
05 LOOP_DX10 i0 FAIL_JUMP_ADDR(12)
06 ALU_BREAK: ADDR(234) CNT(4) KCACHE0(CB0:0-15)
37 x: ADD R16.x, KC0[0].y, R16.x
y: ADD R16.y, KC0[0].y, R16.y
z: ADD R16.z, R16.z, 1.0f
38 x: PREDGT ____, KC0[0].x, R16.z UPDATE_EXEC_MASK UPDATE_PRED
07 TEX: ADDR(596) CNT(8)
39 SAMPLE R0, R16.xz0x, t0, s0 UNNORM(XYZW)
40 SAMPLE R1, R16.xz0x, t1, s1 UNNORM(XYZW)
41 SAMPLE R2, R16.xz0x, t2, s2 UNNORM(XYZW)
42 SAMPLE R6, R16.xz0x, t3, s3 UNNORM(XYZW)
43 SAMPLE R3, R16.yz0y, t4, s4 UNNORM(XYZW)
44 SAMPLE R4, R16.yz0y, t5, s5 UNNORM(XYZW)
45 SAMPLE R5, R16.yz0y, t6, s6 UNNORM(XYZW)
46 SAMPLE R7, R16.yz0y, t7, s7 UNNORM(XYZW)
08 ALU: ADDR(238) CNT(124)
47 x: FMA_64 R18.x, R0.y, R3.y, R18.y
y: FMA_64 R18.y, R0.y, R3.y, R18.y
z: FMA_64 R123.z, R0.y, R3.y, R18.y
w: FMA_64 R123.w, R0.x, R3.x, R18.x
48 x: FMA_64 R123.x, R0.y, R3.w, R18.w
y: FMA_64 R123.y, R0.y, R3.w, R18.w
z: FMA_64 R18.z, R0.y, R3.w, R18.w
w: FMA_64 R18.w, R0.x, R3.z, R18.z
49 x: FMA_64 R20.x, R0.y, R4.y, R20.y
y: FMA_64 R20.y, R0.y, R4.y, R20.y
z: FMA_64 R123.z, R0.y, R4.y, R20.y
w: FMA_64 R123.w, R0.x, R4.x, R20.x
50 x: FMA_64 R123.x, R0.y, R4.w, R20.w
y: FMA_64 R123.y, R0.y, R4.w, R20.w
z: FMA_64 R20.z, R0.y, R4.w, R20.w
w: FMA_64 R20.w, R0.x, R4.z, R20.z
51 x: FMA_64 R22.x, R0.y, R5.y, R22.y
y: FMA_64 R22.y, R0.y, R5.y, R22.y
z: FMA_64 R123.z, R0.y, R5.y, R22.y
w: FMA_64 R123.w, R0.x, R5.x, R22.x
52 x: FMA_64 R123.x, R0.y, R5.w, R22.w
y: FMA_64 R123.y, R0.y, R5.w, R22.w
z: FMA_64 R22.z, R0.y, R5.w, R22.w
w: FMA_64 R22.w, R0.x, R5.z, R22.z
53 x: FMA_64 R24.x, R0.y, R7.y, R24.y
y: FMA_64 R24.y, R0.y, R7.y, R24.y
z: FMA_64 R123.z, R0.y, R7.y, R24.y
w: FMA_64 R123.w, R0.x, R7.x, R24.x
54 x: FMA_64 R123.x, R0.y, R7.w, R24.w
y: FMA_64 R123.y, R0.y, R7.w, R24.w
z: FMA_64 R24.z, R0.y, R7.w, R24.w
w: FMA_64 R24.w, R0.x, R7.z, R24.z
55 x: FMA_64 R38.x, R0.w, R3.y, R38.y
y: FMA_64 R38.y, R0.w, R3.y, R38.y
z: FMA_64 R123.z, R0.w, R3.y, R38.y
w: FMA_64 R123.w, R0.z, R3.x, R38.x
56 x: FMA_64 R123.x, R0.w, R3.w, R38.w
y: FMA_64 R123.y, R0.w, R3.w, R38.w
z: FMA_64 R38.z, R0.w, R3.w, R38.w
w: FMA_64 R38.w, R0.z, R3.z, R38.z
57 x: FMA_64 R40.x, R0.w, R4.y, R40.y
y: FMA_64 R40.y, R0.w, R4.y, R40.y
z: FMA_64 R123.z, R0.w, R4.y, R40.y
w: FMA_64 R123.w, R0.z, R4.x, R40.x
58 x: FMA_64 R123.x, R0.w, R4.w, R40.w
y: FMA_64 R123.y, R0.w, R4.w, R40.w
z: FMA_64 R40.z, R0.w, R4.w, R40.w
w: FMA_64 R40.w, R0.z, R4.z, R40.z
59 x: FMA_64 R9.x, R0.w, R5.y, R9.y
y: FMA_64 R9.y, R0.w, R5.y, R9.y
z: FMA_64 R123.z, R0.w, R5.y, R9.y
w: FMA_64 R123.w, R0.z, R5.x, R9.x
60 x: FMA_64 R123.x, R0.w, R5.w, R9.w
y: FMA_64 R123.y, R0.w, R5.w, R9.w
z: FMA_64 R9.z, R0.w, R5.w, R9.w
w: FMA_64 R9.w, R0.z, R5.z, R9.z
61 x: FMA_64 R11.x, R0.w, R7.y, R11.y
y: FMA_64 R11.y, R0.w, R7.y, R11.y
z: FMA_64 R123.z, R0.w, R7.y, R11.y
w: FMA_64 R123.w, R0.z, R7.x, R11.x
62 x: FMA_64 R123.x, R0.w, R7.w, R11.w
y: FMA_64 R123.y, R0.w, R7.w, R11.w
z: FMA_64 R11.z, R0.w, R7.w, R11.w
w: FMA_64 R11.w, R0.z, R7.z, R11.z
63 x: FMA_64 R26.x, R1.y, R3.y, R26.y
y: FMA_64 R26.y, R1.y, R3.y, R26.y
z: FMA_64 R123.z, R1.y, R3.y, R26.y
w: FMA_64 R123.w, R1.x, R3.x, R26.x
64 x: FMA_64 R123.x, R1.y, R3.w, R26.w
y: FMA_64 R123.y, R1.y, R3.w, R26.w
z: FMA_64 R26.z, R1.y, R3.w, R26.w
w: FMA_64 R26.w, R1.x, R3.z, R26.z
65 x: FMA_64 R28.x, R1.y, R4.y, R28.y
y: FMA_64 R28.y, R1.y, R4.y, R28.y
z: FMA_64 R123.z, R1.y, R4.y, R28.y
w: FMA_64 R123.w, R1.x, R4.x, R28.x
66 x: FMA_64 R123.x, R1.y, R4.w, R28.w
y: FMA_64 R123.y, R1.y, R4.w, R28.w
z: FMA_64 R28.z, R1.y, R4.w, R28.w
w: FMA_64 R28.w, R1.x, R4.z, R28.z
67 x: FMA_64 R30.x, R1.y, R5.y, R30.y
y: FMA_64 R30.y, R1.y, R5.y, R30.y
z: FMA_64 R123.z, R1.y, R5.y, R30.y
w: FMA_64 R123.w, R1.x, R5.x, R30.x
68 x: FMA_64 R123.x, R1.y, R5.w, R30.w
y: FMA_64 R123.y, R1.y, R5.w, R30.w
z: FMA_64 R30.z, R1.y, R5.w, R30.w
w: FMA_64 R30.w, R1.x, R5.z, R30.z
69 x: FMA_64 R32.x, R1.y, R7.y, R32.y
y: FMA_64 R32.y, R1.y, R7.y, R32.y
z: FMA_64 R123.z, R1.y, R7.y, R32.y
w: FMA_64 R123.w, R1.x, R7.x, R32.x
70 x: FMA_64 R123.x, R1.y, R7.w, R32.w
y: FMA_64 R123.y, R1.y, R7.w, R32.w
z: FMA_64 R32.z, R1.y, R7.w, R32.w
w: FMA_64 R32.w, R1.x, R7.z, R32.z
71 x: FMA_64 R13.x, R1.w, R3.y, R13.y
y: FMA_64 R13.y, R1.w, R3.y, R13.y
z: FMA_64 R123.z, R1.w, R3.y, R13.y
w: FMA_64 R123.w, R1.z, R3.x, R13.x
72 x: FMA_64 R123.x, R1.w, R3.w, R13.w
y: FMA_64 R123.y, R1.w, R3.w, R13.w
z: FMA_64 R13.z, R1.w, R3.w, R13.w
w: FMA_64 R13.w, R1.z, R3.z, R13.z
73 x: FMA_64 R15.x, R1.w, R4.y, R15.y
y: FMA_64 R15.y, R1.w, R4.y, R15.y
z: FMA_64 R123.z, R1.w, R4.y, R15.y
w: FMA_64 R123.w, R1.z, R4.x, R15.x
74 x: FMA_64 R123.x, R1.w, R4.w, R15.w
y: FMA_64 R123.y, R1.w, R4.w, R15.w
z: FMA_64 R15.z, R1.w, R4.w, R15.w
w: FMA_64 R15.w, R1.z, R4.z, R15.z
75 x: FMA_64 R17.x, R1.w, R5.y, R17.y
y: FMA_64 R17.y, R1.w, R5.y, R17.y
z: FMA_64 R123.z, R1.w, R5.y, R17.y
w: FMA_64 R123.w, R1.z, R5.x, R17.x
76 x: FMA_64 R123.x, R1.w, R5.w, R17.w
y: FMA_64 R123.y, R1.w, R5.w, R17.w
z: FMA_64 R17.z, R1.w, R5.w, R17.w
w: FMA_64 R17.w, R1.z, R5.z, R17.z
77 x: FMA_64 R19.x, R1.w, R7.y, R19.y
y: FMA_64 R19.y, R1.w, R7.y, R19.y
z: FMA_64 R123.z, R1.w, R7.y, R19.y
w: FMA_64 R123.w, R1.z, R7.x, R19.x
09 ALU: ADDR(362) CNT(124)
78 x: FMA_64 R123.x, R1.w, R7.w, R19.w
y: FMA_64 R123.y, R1.w, R7.w, R19.w
z: FMA_64 R19.z, R1.w, R7.w, R19.w
w: FMA_64 R19.w, R1.z, R7.z, R19.z
79 x: FMA_64 R34.x, R2.y, R3.y, R34.y
y: FMA_64 R34.y, R2.y, R3.y, R34.y
z: FMA_64 R123.z, R2.y, R3.y, R34.y
w: FMA_64 R123.w, R2.x, R3.x, R34.x
80 x: FMA_64 R123.x, R2.y, R3.w, R34.w
y: FMA_64 R123.y, R2.y, R3.w, R34.w
z: FMA_64 R34.z, R2.y, R3.w, R34.w
w: FMA_64 R34.w, R2.x, R3.z, R34.z
81 x: FMA_64 R36.x, R2.y, R4.y, R36.y
y: FMA_64 R36.y, R2.y, R4.y, R36.y
z: FMA_64 R123.z, R2.y, R4.y, R36.y
w: FMA_64 R123.w, R2.x, R4.x, R36.x
82 x: FMA_64 R123.x, R2.y, R4.w, R36.w
y: FMA_64 R123.y, R2.y, R4.w, R36.w
z: FMA_64 R36.z, R2.y, R4.w, R36.w
w: FMA_64 R36.w, R2.x, R4.z, R36.z
83 x: FMA_64 R37.x, R2.y, R5.y, R37.y
y: FMA_64 R37.y, R2.y, R5.y, R37.y
z: FMA_64 R123.z, R2.y, R5.y, R37.y
w: FMA_64 R123.w, R2.x, R5.x, R37.x
84 x: FMA_64 R123.x, R2.y, R5.w, R37.w
y: FMA_64 R123.y, R2.y, R5.w, R37.w
z: FMA_64 R37.z, R2.y, R5.w, R37.w
w: FMA_64 R37.w, R2.x, R5.z, R37.z
85 x: FMA_64 R39.x, R2.y, R7.y, R39.y
y: FMA_64 R39.y, R2.y, R7.y, R39.y
z: FMA_64 R123.z, R2.y, R7.y, R39.y
w: FMA_64 R123.w, R2.x, R7.x, R39.x
86 x: FMA_64 R123.x, R2.y, R7.w, R39.w
y: FMA_64 R123.y, R2.y, R7.w, R39.w
z: FMA_64 R39.z, R2.y, R7.w, R39.w
w: FMA_64 R39.w, R2.x, R7.z, R39.z
87 x: FMA_64 R21.x, R2.w, R3.y, R21.y
y: FMA_64 R21.y, R2.w, R3.y, R21.y
z: FMA_64 R123.z, R2.w, R3.y, R21.y
w: FMA_64 R123.w, R2.z, R3.x, R21.x
88 x: FMA_64 R123.x, R2.w, R3.w, R21.w
y: FMA_64 R123.y, R2.w, R3.w, R21.w
z: FMA_64 R21.z, R2.w, R3.w, R21.w
w: FMA_64 R21.w, R2.z, R3.z, R21.z
89 x: FMA_64 R23.x, R2.w, R4.y, R23.y
y: FMA_64 R23.y, R2.w, R4.y, R23.y
z: FMA_64 R123.z, R2.w, R4.y, R23.y
w: FMA_64 R123.w, R2.z, R4.x, R23.x
90 x: FMA_64 R123.x, R2.w, R4.w, R23.w
y: FMA_64 R123.y, R2.w, R4.w, R23.w
z: FMA_64 R23.z, R2.w, R4.w, R23.w
w: FMA_64 R23.w, R2.z, R4.z, R23.z
91 x: FMA_64 R25.x, R2.w, R5.y, R25.y
y: FMA_64 R25.y, R2.w, R5.y, R25.y
z: FMA_64 R123.z, R2.w, R5.y, R25.y
w: FMA_64 R123.w, R2.z, R5.x, R25.x
92 x: FMA_64 R123.x, R2.w, R5.w, R25.w
y: FMA_64 R123.y, R2.w, R5.w, R25.w
z: FMA_64 R25.z, R2.w, R5.w, R25.w
w: FMA_64 R25.w, R2.z, R5.z, R25.z
93 x: FMA_64 R27.x, R2.w, R7.y, R27.y
y: FMA_64 R27.y, R2.w, R7.y, R27.y
z: FMA_64 R123.z, R2.w, R7.y, R27.y
w: FMA_64 R123.w, R2.z, R7.x, R27.x
94 x: FMA_64 R123.x, R2.w, R7.w, R27.w
y: FMA_64 R123.y, R2.w, R7.w, R27.w
z: FMA_64 R27.z, R2.w, R7.w, R27.w
w: FMA_64 R27.w, R2.z, R7.z, R27.z
95 x: FMA_64 R8.x, R6.y, R3.y, R8.y
y: FMA_64 R8.y, R6.y, R3.y, R8.y
z: FMA_64 R123.z, R6.y, R3.y, R8.y
w: FMA_64 R123.w, R6.x, R3.x, R8.x
96 x: FMA_64 R123.x, R6.y, R3.w, R8.w
y: FMA_64 R123.y, R6.y, R3.w, R8.w
z: FMA_64 R8.z, R6.y, R3.w, R8.w
w: FMA_64 R8.w, R6.x, R3.z, R8.z
97 x: FMA_64 R10.x, R6.y, R4.y, R10.y
y: FMA_64 R10.y, R6.y, R4.y, R10.y
z: FMA_64 R123.z, R6.y, R4.y, R10.y
w: FMA_64 R123.w, R6.x, R4.x, R10.x
98 x: FMA_64 R123.x, R6.y, R4.w, R10.w
y: FMA_64 R123.y, R6.y, R4.w, R10.w
z: FMA_64 R10.z, R6.y, R4.w, R10.w
w: FMA_64 R10.w, R6.x, R4.z, R10.z
99 x: FMA_64 R12.x, R6.y, R5.y, R12.y
y: FMA_64 R12.y, R6.y, R5.y, R12.y
z: FMA_64 R123.z, R6.y, R5.y, R12.y
w: FMA_64 R123.w, R6.x, R5.x, R12.x
100 x: FMA_64 R123.x, R6.y, R5.w, R12.w
y: FMA_64 R123.y, R6.y, R5.w, R12.w
z: FMA_64 R12.z, R6.y, R5.w, R12.w
w: FMA_64 R12.w, R6.x, R5.z, R12.z
101 x: FMA_64 R14.x, R6.y, R7.y, R14.y
y: FMA_64 R14.y, R6.y, R7.y, R14.y
z: FMA_64 R123.z, R6.y, R7.y, R14.y
w: FMA_64 R123.w, R6.x, R7.x, R14.x
102 x: FMA_64 R123.x, R6.y, R7.w, R14.w
y: FMA_64 R123.y, R6.y, R7.w, R14.w
z: FMA_64 R14.z, R6.y, R7.w, R14.w
w: FMA_64 R14.w, R6.x, R7.z, R14.z
103 x: FMA_64 R29.x, R6.w, R3.y, R29.y
y: FMA_64 R29.y, R6.w, R3.y, R29.y
z: FMA_64 R123.z, R6.w, R3.y, R29.y
w: FMA_64 R123.w, R6.z, R3.x, R29.x
104 x: FMA_64 R123.x, R6.w, R3.w, R29.w
y: FMA_64 R123.y, R6.w, R3.w, R29.w
z: FMA_64 R29.z, R6.w, R3.w, R29.w
w: FMA_64 R29.w, R6.z, R3.z, R29.z
105 x: FMA_64 R31.x, R6.w, R4.y, R31.y
y: FMA_64 R31.y, R6.w, R4.y, R31.y
z: FMA_64 R123.z, R6.w, R4.y, R31.y
w: FMA_64 R123.w, R6.z, R4.x, R31.x
106 x: FMA_64 R123.x, R6.w, R4.w, R31.w
y: FMA_64 R123.y, R6.w, R4.w, R31.w
z: FMA_64 R31.z, R6.w, R4.w, R31.w
w: FMA_64 R31.w, R6.z, R4.z, R31.z
107 x: FMA_64 R33.x, R6.w, R5.y, R33.y
y: FMA_64 R33.y, R6.w, R5.y, R33.y
z: FMA_64 R123.z, R6.w, R5.y, R33.y
w: FMA_64 R123.w, R6.z, R5.x, R33.x
108 x: FMA_64 R123.x, R6.w, R5.w, R33.w
y: FMA_64 R123.y, R6.w, R5.w, R33.w
z: FMA_64 R33.z, R6.w, R5.w, R33.w
w: FMA_64 R33.w, R6.z, R5.z, R33.z
10 ALU: ADDR(486) CNT(8)
109 x: FMA_64 R35.x, R6.w, R7.y, R35.y
y: FMA_64 R35.y, R6.w, R7.y, R35.y
z: FMA_64 R123.z, R6.w, R7.y, R35.y
w: FMA_64 R123.w, R6.z, R7.x, R35.x
110 x: FMA_64 R123.x, R6.w, R7.w, R35.w
y: FMA_64 R123.y, R6.w, R7.w, R35.w
z: FMA_64 R35.z, R6.w, R7.w, R35.w
w: FMA_64 R35.w, R6.z, R7.z, R35.z
11 ENDLOOP i0 PASS_JUMP_ADDR(6)
12 ALU: ADDR(494) CNT(20) KCACHE0(CB0:0-15)
111 t: MULLO_INT T0.z, R16.x, KC0[0].z
112 t: MULLO_INT ____, R16.y, KC0[0].w
113 w: ADD_INT ____, T0.z, PS112
114 x: ADD_INT T0.x, PV113.w, (0x00000003, 4.203895393e-45f).x
y: ADD_INT ____, PV113.w, 1
z: ADD_INT ____, PV113.w, 0.0f
w: ADD_INT T0.w, PV113.w, (0x00000002, 2.802596929e-45f).y
115 x: LSHL R0.x, PV114.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, PV114.z, (0x00000003, 4.203895393e-45f).y
z: ADD_INT R0.z, PV114.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R0.w, PV114.y, (0x00000003, 4.203895393e-45f).y
t: LSHL R1.x, PV114.y, (0x00000002, 2.802596929e-45f).x
116 x: LSHL R2.x, T0.w, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R1.y, T0.x, (0x00000003, 4.203895393e-45f).y
z: ADD_INT R1.z, PV115.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R1.w, PV115.y, (0x00000003, 4.203895393e-45f).y
t: LSHL R3.x, T0.x, (0x00000002, 2.802596929e-45f).x
13 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R18, ELEM_SIZE(3) VPM
14 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R20, ELEM_SIZE(3) VPM
15 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R22, ELEM_SIZE(3) VPM
16 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R24, ELEM_SIZE(3) VPM
17 ALU: ADDR(514) CNT(12)
117 x: LSHL R3.x, R0.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, R0.z, (0x00000003, 4.203895393e-45f).y
z: ADD_INT R2.z, R1.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R0.w, R1.y, (0x00000003, 4.203895393e-45f).y VEC_120
t: LSHL R2.x, R0.w, (0x00000002, 2.802596929e-45f).x
118 x: LSHL R1.x, R0.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R1.y, R1.z, (0x00000003, 4.203895393e-45f).y VEC_120
z: ADD_INT R0.z, PV117.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R2.w, PV117.y, (0x00000003, 4.203895393e-45f).y
t: LSHL R0.x, R1.y, (0x00000002, 2.802596929e-45f).x
18 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R38, ELEM_SIZE(3) VPM
19 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R40, ELEM_SIZE(3) VPM
20 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R9, ELEM_SIZE(3) VPM
21 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R11, ELEM_SIZE(3) VPM
22 ALU: ADDR(526) CNT(12)
119 x: LSHL R0.x, R1.w, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R2.y, R2.z, (0x00000003, 4.203895393e-45f).y
z: ADD_INT R1.z, R2.w, (0x00000003, 4.203895393e-45f).y VEC_120
w: ADD_INT R1.w, R1.y, (0x00000003, 4.203895393e-45f).y
t: LSHL R1.x, R1.z, (0x00000002, 2.802596929e-45f).x
120 x: LSHL R2.x, R0.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, R0.z, (0x00000003, 4.203895393e-45f).y
z: ADD_INT R3.z, PV119.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R0.w, PV119.y, (0x00000003, 4.203895393e-45f).y
t: LSHL R3.x, R0.w, (0x00000002, 2.802596929e-45f).x
23 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R26, ELEM_SIZE(3) VPM
24 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R28, ELEM_SIZE(3) VPM
25 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R30, ELEM_SIZE(3) VPM
26 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R32, ELEM_SIZE(3) VPM
27 ALU: ADDR(538) CNT(12)
121 x: LSHL R3.x, R2.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R1.y, R1.z, (0x00000003, 4.203895393e-45f).y VEC_120
z: ADD_INT R2.z, R0.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R3.w, R0.y, (0x00000003, 4.203895393e-45f).y
t: LSHL R2.x, R1.y, (0x00000002, 2.802596929e-45f).x
122 x: LSHL R1.x, R2.w, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R3.y, R3.z, (0x00000003, 4.203895393e-45f).y
z: ADD_INT R0.z, PV121.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R2.w, PV121.y, (0x00000003, 4.203895393e-45f).y
t: LSHL R0.x, R0.z, (0x00000002, 2.802596929e-45f).x
28 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R13, ELEM_SIZE(3) VPM
29 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R15, ELEM_SIZE(3) VPM
30 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R17, ELEM_SIZE(3) VPM
31 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R19, ELEM_SIZE(3) VPM
32 ALU: ADDR(550) CNT(10)
123 x: LSHL R0.x, R2.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R2.y, R2.z, (0x00000003, 4.203895393e-45f).y
z: ADD_INT R4.z, R2.w, (0x00000003, 4.203895393e-45f).y
w: ADD_INT R1.w, R3.y, (0x00000003, 4.203895393e-45f).y VEC_120
t: LSHL R1.x, R1.w, (0x00000002, 2.802596929e-45f).x
124 x: LSHL R2.x, R1.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, R0.z, (0x00000003, 4.203895393e-45f).y VEC_120
t: LSHL R3.x, R0.y, (0x00000002, 2.802596929e-45f).x
33 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R34, ELEM_SIZE(3) VPM
34 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R36, ELEM_SIZE(3) VPM
35 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R37, ELEM_SIZE(3) VPM
36 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R39, ELEM_SIZE(3) VPM
37 ALU: ADDR(560) CNT(6)
125 x: LSHL R3.x, R0.w, (0x00000002, 2.802596929e-45f).x
t: LSHL R2.x, R3.z, (0x00000002, 2.802596929e-45f).x
126 x: LSHL R1.x, R1.y, (0x00000002, 2.802596929e-45f).x
t: LSHL R0.x, R3.w, (0x00000002, 2.802596929e-45f).x
38 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R21, ELEM_SIZE(3) VPM
39 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R23, ELEM_SIZE(3) VPM
40 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R25, ELEM_SIZE(3) VPM
41 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R27, ELEM_SIZE(3) VPM
42 ALU: ADDR(566) CNT(6)
127 x: LSHL R0.x, R2.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R1.x, R3.y, (0x00000002, 2.802596929e-45f).x
128 x: LSHL R2.x, R2.w, (0x00000002, 2.802596929e-45f).x
t: LSHL R3.x, R0.z, (0x00000002, 2.802596929e-45f).x
43 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R8, ELEM_SIZE(3) VPM
44 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R10, ELEM_SIZE(3) VPM
45 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R12, ELEM_SIZE(3) VPM
46 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R14, ELEM_SIZE(3) VPM
47 ALU: ADDR(572) CNT(6)
129 x: LSHL R3.x, R2.y, (0x00000002, 2.802596929e-45f).x
t: LSHL R2.x, R1.w, (0x00000002, 2.802596929e-45f).x
130 x: LSHL R1.x, R4.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R0.x, R0.y, (0x00000002, 2.802596929e-45f).x
48 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R29, ELEM_SIZE(3) VPM
49 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R31, ELEM_SIZE(3) VPM
50 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R33, ELEM_SIZE(3) VPM
51 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R35, ELEM_SIZE(3) VPM
END_OF_PROGRAM
For future reference, computing 2048x2048 DGEMM
on RV870: 500 GFlop/s (92% utilization)
on RV770: 210 GFlop/s (88% utilization)
Of course, Fermi is expected to be faster, but to significantly outperform RV870, it seems it will need quite high utilization and high clocks.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.