|
|
#1 |
|
Regular
Join Date: Aug 2009
Posts: 21
|
Dense matrix-matrix products, a basic element for dense linear algebra, is rather straightforward in terms of multiply-add operations, and very friendly to memory hierarchies, which makes it an interesting benchmark for throughput-oriented hardware: The performance of matrix-matrix multiply is often much more relevant to performance than the sum of all ALU operations than can be completed in a second.
Contrast for instance the peak performance claimed by both major IHVs: nVidia GT200 (GTX280) : 933 Gflop/s ATi RV770 (HD4870) : 1200 Gflop/s ...to the current fastest matrix-matrix product implementations: CUBLAS 2.0 on GTX280[1] achieves 375 Gflop/s and on HD4870[2] ATi reckons 540 Gflop/s However, there is a significant difference as Volkov's implementation achieves the peak multiply-add rate with on operand from shared memory, while ATi's implementation is limited by the speed of the texture units. As many others, I thought that higher performance could be possible on ATi boards, using some mechanism to avoid the memory bottleneck. However, according to ATi, this is not possible[3], but they do not want to disclose details on the hardware. Thus, I experimented with all the ideas I could come up with. And I've hit these limitations ATi knew about, one after another: Shared memory (LDS in ATi parlance) is no faster than texture fetches that hit L1 (30 billion float4 per second, 480 GB/s, for both). Shared memory broadcasting requires unpractical amounts of ALU in order to put addresses into registers (sigh). Shared registers have half the peak bandwidth of local registers, giving us a limit of 480 Gflop/s... ATi's claim checks out: The limited features of their hardware do not really offer any help. But is their implementation really optimal ? The bandwidth intensity of a matrix-matrix product implementation is directly related to the size of the blocks in the destination matrix, and `simple_matmult' uses 8x4 blocks (this is also the maximum for their pixel shader approach). RV770's texture units can deliver 120 billion single precision values per second and we need two input values for each multiply-add operation. With 8x4 blocks, the bandwith reduction is ~5, and thus we obtain a peak of 600 Gflop/s. Using 8x8 blocks would bring the bandwidth reduction to 8, for a peak of 960 Gflop/s. However, the obvious limitation to higher block sizes is that we need enough space in the register file to store them. The size of the register file (1024 scalars) on RV770 may seem impressive, but with 180 cycles of latency for L1 hits, you need 8 threads (wraps in ATi parlance) to hide one texture clause behind 30 cycles of computation (128 multiply-add). This gives us less than ~120 scalars in order to compute, for instance, two 8x8 outer products per loop: A single texture clause can load the four float8 inputs, and the two outer products amount to 128 multiply-add instructions. How much register space would we need ? 64 scalars for the output block, the 32 values that are fetched by the texture units, and some registers for the loop index and texture addresses... a hundred scalars. This looks quite reasonable, so I implemented it. The major difficulty is to trick ATi's horrible compiler (which reflects the current quality of their `GPU computing' software stack well) into producing decent machine code. Here's what it looks like: Code:
00 ALU: ADDR(32) CNT(71)
0 x: LSHR T0.x, R0.x, (0x00000006, 8.407790786e-45f).x
y: MOV R23.y, 0.0f
z: MOV R23.z, 0.0f
w: AND_INT T0.w, R0.x, (0x0000003F, 8.828180325e-44f).y
t: MOV R23.x, 0.0f
1 x: MOV R22.x, 0.0f
y: MOV R22.y, 0.0f
z: MOV R22.z, 0.0f
w: MOV R23.w, 0.0f
t: MOV R22.w, 0.0f
2 x: MOV R21.x, 0.0f
y: MOV R21.y, 0.0f
z: MOV R21.z, 0.0f
w: MOV R21.w, 0.0f
t: MOV R20.x, 0.0f
3 x: MOV R19.x, 0.0f
y: MOV R20.y, 0.0f
z: MOV R20.z, 0.0f
w: MOV R20.w, 0.0f
t: MOV R4.z, (0xC0000000, -2.0f).x
4 x: MOV R18.x, 0.0f
y: MOV R19.y, 0.0f
z: MOV R19.z, 0.0f
w: MOV R19.w, 0.0f
t: MOV R18.y, 0.0f
5 x: MOV R17.x, 0.0f
y: MOV R17.y, 0.0f
z: MOV R18.z, 0.0f
w: MOV R18.w, 0.0f
t: MOV R17.z, 0.0f
6 x: MOV R16.x, 0.0f
y: MOV R16.y, 0.0f
z: MOV R16.z, 0.0f
w: MOV R17.w, 0.0f
t: MOV R16.w, 0.0f
7 x: MOV R15.x, 0.0f
y: MOV R15.y, 0.0f
z: MOV R15.z, 0.0f
w: MOV R15.w, 0.0f
t: MOV R13.x, 0.0f
8 x: MOV R14.x, 0.0f
y: MOV R13.y, 0.0f
z: MOV R13.z, 0.0f
w: MOV R13.w, 0.0f
t: MOV R14.y, 0.0f
9 x: MOV R12.x, 0.0f
y: MOV R12.y, 0.0f
z: MOV R14.z, 0.0f
w: MOV R14.w, 0.0f
t: MOV R12.z, 0.0f
10 x: MOV R11.x, 0.0f
y: MOV R11.y, 0.0f
z: MOV R11.z, 0.0f
w: MOV R12.w, 0.0f
t: MOV R11.w, 0.0f
11 x: MOV R9.x, 0.0f
y: MOV R9.y, 0.0f
z: MOV R9.z, 0.0f
w: MOV R9.w, 0.0f
t: MOV R10.x, 0.0f
12 x: MOV R8.x, 0.0f
y: MOV R10.y, 0.0f
z: MOV R10.z, 0.0f
w: MOV R10.w, 0.0f
t: MOV R8.y, 0.0f
13 z: MOV R8.z, 0.0f
w: MOV R8.w, 0.0f
t: I_TO_F R0.x, T0.w
14 t: I_TO_F R0.y, T0.x
01 TEX: ADDR(288) CNT(1)
15 SAMPLE R5.xyz_, R0.xyxx, t4, s4 UNNORM(XYZW)
02 ALU: ADDR(103) CNT(2)
16 x: MOV R4.x, R5.x
y: MOV R4.y, R5.y
03 LOOP_DX10 i0 FAIL_JUMP_ADDR(11)
04 ALU_BREAK: ADDR(105) CNT(3) KCACHE0(CB0:0-15)
17 z: ADD R4.z, R4.z, (0x40000000, 2.0f).x
18 x: PREDGT ____, KC0[0].x, R4.z UPDATE_EXEC_MASK UPDATE_PRED
05 ALU: ADDR(108) CNT(3) KCACHE0(CB0:0-15)
19 x: ADD R4.x, R4.x, KC0[0].y
y: ADD R4.y, R4.y, KC0[0].y
w: ADD R4.w, R4.z, 1.0f
06 TEX: ADDR(290) CNT(8)
20 SAMPLE R0, R4.xzxx, t0, s0 UNNORM(XYZW)
21 SAMPLE R2, R4.xzxx, t1, s1 UNNORM(XYZW)
22 SAMPLE R1, R4.yzyy, t2, s2 UNNORM(XYZW)
23 SAMPLE R3, R4.yzyy, t3, s3 UNNORM(XYZW)
24 SAMPLE R6, R4.xwxx, t0, s0 UNNORM(XYZW)
25 SAMPLE R7, R4.xwxx, t1, s1 UNNORM(XYZW)
26 SAMPLE R24, R4.ywyy, t2, s2 UNNORM(XYZW)
27 SAMPLE R25, R4.ywyy, t3, s3 UNNORM(XYZW)
07 ALU_PUSH_BEFORE: ADDR(111) CNT(65) KCACHE0(CB0:0-15)
28 x: MULADD R23.x, R0.x, R1.x, R23.x
y: MULADD R23.y, R0.x, R1.y, R23.y
z: MULADD R23.z, R0.x, R1.z, R23.z
w: MULADD R23.w, R0.x, R1.w, R23.w
29 x: MULADD R22.x, R0.x, R3.x, R22.x
y: MULADD R22.y, R0.x, R3.y, R22.y
z: MULADD R22.z, R0.x, R3.z, R22.z
w: MULADD R22.w, R0.x, R3.w, R22.w
30 x: MULADD R21.x, R0.y, R1.x, R21.x VEC_210
y: MULADD R21.y, R0.y, R1.y, R21.y VEC_201
z: MULADD R21.z, R0.y, R1.z, R21.z VEC_201
w: MULADD R21.w, R0.y, R1.w, R21.w VEC_201
t: MULADD R19.x, R0.z, R1.x, R19.x VEC_120
31 x: MULADD R20.x, R0.y, R3.x, R20.x VEC_210
y: MULADD R20.y, R0.y, R3.y, R20.y VEC_201
z: MULADD R20.z, R0.y, R3.z, R20.z VEC_201
w: MULADD R20.w, R0.y, R3.w, R20.w VEC_201
t: MULADD R18.x, R0.z, R3.x, R18.x VEC_120
32 x: MULADD R17.x, R0.w, R1.x, R17.x VEC_201
y: MULADD R19.y, R0.z, R1.y, R19.y VEC_210
z: MULADD R19.z, R0.z, R1.z, R19.z VEC_201
w: MULADD R19.w, R0.z, R1.w, R19.w VEC_201
t: MULADD R17.y, R0.w, R1.y, R17.y VEC_120
33 x: MULADD R16.x, R0.w, R3.x, R16.x VEC_201
y: MULADD R18.y, R0.z, R3.y, R18.y VEC_210
z: MULADD R18.z, R0.z, R3.z, R18.z VEC_201
w: MULADD R18.w, R0.z, R3.w, R18.w VEC_201
t: MULADD R16.y, R0.w, R3.y, R16.y VEC_120
34 x: MULADD R15.x, R2.x, R1.x, R15.x VEC_201
y: MULADD R15.y, R2.x, R1.y, R15.y VEC_201
z: MULADD R17.z, R0.w, R1.z, R17.z
w: MULADD R17.w, R0.w, R1.w, R17.w
t: MULADD R15.z, R2.x, R1.z, R15.z
35 x: MULADD R13.x, R2.x, R3.x, R13.x VEC_201
y: MULADD R13.y, R2.x, R3.y, R13.y VEC_201
z: MULADD R16.z, R0.w, R3.z, R16.z
w: MULADD R16.w, R0.w, R3.w, R16.w
t: MULADD R13.z, R2.x, R3.z, R13.z
36 x: MULADD R14.x, R2.y, R1.x, R14.x VEC_201
y: MULADD R14.y, R2.y, R1.y, R14.y VEC_201
z: MULADD R14.z, R2.y, R1.z, R14.z VEC_201
w: MULADD R15.w, R2.x, R1.w, R15.w VEC_210
t: MULADD R14.w, R2.y, R1.w, R14.w VEC_120
37 x: MULADD R12.x, R2.y, R3.x, R12.x VEC_201
y: MULADD R12.y, R2.y, R3.y, R12.y VEC_201
z: MULADD R12.z, R2.y, R3.z, R12.z VEC_201
w: MULADD R13.w, R2.x, R3.w, R13.w VEC_210
t: MULADD R12.w, R2.y, R3.w, R12.w VEC_120
38 x: MULADD R11.x, R2.z, R1.x, R11.x VEC_210
y: MULADD R11.y, R2.z, R1.y, R11.y VEC_201
z: MULADD R11.z, R2.z, R1.z, R11.z VEC_201
w: MULADD R11.w, R2.z, R1.w, R11.w VEC_201
t: MULADD R10.x, R2.w, R1.x, R10.x VEC_120
39 x: MULADD R9.x, R2.z, R3.x, R9.x VEC_210
y: MULADD R10.y, R2.w, R1.y, R10.y VEC_201
z: MULADD R10.z, R2.w, R1.z, R10.z VEC_201
w: MULADD R10.w, R2.w, R1.w, R10.w VEC_201
t: MULADD R8.x, R2.w, R3.x, R8.x VEC_120
40 y: MULADD R9.y, R2.z, R3.y, R9.y VEC_210
z: MULADD R9.z, R2.z, R3.z, R9.z VEC_201
w: MULADD R9.w, R2.z, R3.w, R9.w VEC_201
t: MULADD R8.y, R2.w, R3.y, R8.y VEC_120
41 z: MULADD R8.z, R2.w, R3.z, R8.z
w: MULADD R8.w, R2.w, R3.w, R8.w
42 x: PREDE_INT ____, KC0[0].y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
08 JUMP POP_CNT(1) ADDR(10)
09 ALU_POP_AFTER: ADDR(176) CNT(64)
43 x: MULADD R23.x, R6.x, R24.x, R23.x
y: MULADD R23.y, R6.x, R24.y, R23.y
z: MULADD R23.z, R6.x, R24.z, R23.z
w: MULADD R23.w, R6.x, R24.w, R23.w
44 x: MULADD R22.x, R6.x, R25.x, R22.x
y: MULADD R22.y, R6.x, R25.y, R22.y
z: MULADD R22.z, R6.x, R25.z, R22.z
w: MULADD R22.w, R6.x, R25.w, R22.w
45 x: MULADD R20.x, R6.y, R24.x, R20.x VEC_210
y: MULADD R20.y, R6.y, R24.y, R20.y VEC_201
z: MULADD R20.z, R6.y, R24.z, R20.z VEC_201
w: MULADD R20.w, R6.y, R24.w, R20.w VEC_201
t: MULADD R19.x, R6.z, R24.x, R19.x VEC_120
46 x: MULADD R21.x, R6.y, R25.x, R21.x VEC_210
y: MULADD R21.y, R6.y, R25.y, R21.y VEC_201
z: MULADD R21.z, R6.y, R25.z, R21.z VEC_201
w: MULADD R21.w, R6.y, R25.w, R21.w VEC_201
t: MULADD R18.x, R6.z, R25.x, R18.x VEC_120
47 x: MULADD R17.x, R6.w, R24.x, R17.x VEC_201
y: MULADD R19.y, R6.z, R24.y, R19.y VEC_210
z: MULADD R19.z, R6.z, R24.z, R19.z VEC_201
w: MULADD R19.w, R6.z, R24.w, R19.w VEC_201
t: MULADD R17.y, R6.w, R24.y, R17.y VEC_120
48 x: MULADD R16.x, R6.w, R25.x, R16.x VEC_201
y: MULADD R18.y, R6.z, R25.y, R18.y VEC_210
z: MULADD R18.z, R6.z, R25.z, R18.z VEC_201
w: MULADD R18.w, R6.z, R25.w, R18.w VEC_201
t: MULADD R16.y, R6.w, R25.y, R16.y VEC_120
49 x: MULADD R15.x, R7.x, R24.x, R15.x VEC_201
y: MULADD R15.y, R7.x, R24.y, R15.y VEC_201
z: MULADD R17.z, R6.w, R24.z, R17.z
w: MULADD R17.w, R6.w, R24.w, R17.w
t: MULADD R15.z, R7.x, R24.z, R15.z
50 x: MULADD R13.x, R7.x, R25.x, R13.x VEC_201
y: MULADD R13.y, R7.x, R25.y, R13.y VEC_201
z: MULADD R16.z, R6.w, R25.z, R16.z
w: MULADD R16.w, R6.w, R25.w, R16.w
t: MULADD R13.z, R7.x, R25.z, R13.z
51 x: MULADD R14.x, R7.y, R24.x, R14.x VEC_201
y: MULADD R14.y, R7.y, R24.y, R14.y VEC_201
z: MULADD R14.z, R7.y, R24.z, R14.z VEC_201
w: MULADD R15.w, R7.x, R24.w, R15.w VEC_210
t: MULADD R14.w, R7.y, R24.w, R14.w VEC_120
52 x: MULADD R12.x, R7.y, R25.x, R12.x VEC_201
y: MULADD R12.y, R7.y, R25.y, R12.y VEC_201
z: MULADD R12.z, R7.y, R25.z, R12.z VEC_201
w: MULADD R13.w, R7.x, R25.w, R13.w VEC_210
t: MULADD R12.w, R7.y, R25.w, R12.w VEC_120
53 x: MULADD R11.x, R7.z, R24.x, R11.x VEC_210
y: MULADD R11.y, R7.z, R24.y, R11.y VEC_201
z: MULADD R11.z, R7.z, R24.z, R11.z VEC_201
w: MULADD R11.w, R7.z, R24.w, R11.w VEC_201
t: MULADD R10.x, R7.w, R24.x, R10.x VEC_120
54 x: MULADD R9.x, R7.z, R25.x, R9.x VEC_210
y: MULADD R10.y, R7.w, R24.y, R10.y VEC_201
z: MULADD R10.z, R7.w, R24.z, R10.z VEC_201
w: MULADD R10.w, R7.w, R24.w, R10.w VEC_201
t: MULADD R8.x, R7.w, R25.x, R8.x VEC_120
55 y: MULADD R9.y, R7.z, R25.y, R9.y VEC_210
z: MULADD R9.z, R7.z, R25.z, R9.z VEC_201
w: MULADD R9.w, R7.z, R25.w, R9.w VEC_201
t: MULADD R8.y, R7.w, R25.y, R8.y VEC_120
56 z: MULADD R8.z, R7.w, R25.z, R8.z
w: MULADD R8.w, R7.w, R25.w, R8.w
10 ENDLOOP i0 PASS_JUMP_ADDR(4)
11 ALU: ADDR(240) CNT(29)
57 x: ADD_INT T0.x, R5.z, (0x00000003, 4.203895393e-45f).x
y: ADD_INT ____, R5.z, 0.0f
z: ADD_INT T0.z, R5.z, (0x00000002, 2.802596929e-45f).y
w: ADD_INT ____, R5.z, 1
58 x: LSHL R0.x, PV57.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT T0.y, PV57.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT T1.z, PV57.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT T0.w, PV57.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R1.x, PV57.w, (0x00000002, 2.802596929e-45f).x
59 x: LSHL R2.x, T0.z, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R0.y, PV58.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R0.z, PV58.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT T1.w, T0.x, (0x00000004, 5.605193857e-45f).y
t: LSHL R3.x, T0.x, (0x00000002, 2.802596929e-45f).x
60 x: LSHL R4.x, T0.w, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R1.y, PV59.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R1.z, PV59.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R0.w, T0.y, (0x00000004, 5.605193857e-45f).y
t: LSHL R5.x, T1.z, (0x00000002, 2.802596929e-45f).x
61 x: LSHL R6.x, T0.y, (0x00000002, 2.802596929e-45f).x
y: ADD_INT R2.y, PV60.z, (0x00000004, 5.605193857e-45f).y
z: ADD_INT R2.z, PV60.w, (0x00000004, 5.605193857e-45f).y
w: ADD_INT R1.w, R0.y, (0x00000004, 5.605193857e-45f).y VEC_120
t: LSHL R7.x, T1.w, (0x00000002, 2.802596929e-45f).x
12 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R23, ELEM_SIZE(3)
13 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R22, ELEM_SIZE(3)
14 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R21, ELEM_SIZE(3)
15 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R20, ELEM_SIZE(3)
16 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R19, ELEM_SIZE(3)
17 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R18, ELEM_SIZE(3)
18 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R17, ELEM_SIZE(3)
19 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R16, ELEM_SIZE(3)
20 ALU: ADDR(269) CNT(12)
62 x: LSHL R7.x, R0.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R6.x, R0.y, (0x00000002, 2.802596929e-45f).x
63 x: LSHL R5.x, R0.w, (0x00000002, 2.802596929e-45f).x
t: LSHL R4.x, R1.z, (0x00000002, 2.802596929e-45f).x
64 x: LSHL R3.x, R1.y, (0x00000002, 2.802596929e-45f).x
t: LSHL R2.x, R1.w, (0x00000002, 2.802596929e-45f).x
65 x: LSHL R1.x, R2.z, (0x00000002, 2.802596929e-45f).x
t: LSHL R0.x, R2.y, (0x00000002, 2.802596929e-45f).x
21 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R7.x], R15, ELEM_SIZE(3)
22 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R13, ELEM_SIZE(3)
23 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R5.x], R14, ELEM_SIZE(3)
24 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R4.x], R12, ELEM_SIZE(3)
25 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R3.x], R11, ELEM_SIZE(3)
26 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R2.x], R9, ELEM_SIZE(3)
27 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R10, ELEM_SIZE(3)
28 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R8, ELEM_SIZE(3)
END_OF_PROGRAM
EDIT: 1000 Gflop/s later in this thread References: [1] V. Volkov, J. W. Demmel: Benchmarking GPUs to tune dense linear algebra. Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008 http://mc.stanford.edu/cgi-bin/image...Volkov_GPU.pdf [2] `What we see on our optimized MM kernel is ~540 gflops in IL.' Micah Villmow, AMD. Answering to vvolkov on the ATi Stream sectionof the AMD Developer Forums http://forums.amd.com/forum/messagev...hreadid=105221 [3] `The simple_matmult example that we have is pretty much optimal for our hardware' Micah Villmow, AMD. Answering to sgratton on the ATi Stream sectionof the AMD Developer Forums http://forums.amd.com/forum/messagev...hreadid=102771 |
|
|
|
|
|
#2 | ||
|
Senior Member
|
Quote:
Err..., you quote vendor's peak claims and calculate your fraction of flops from your own estimate of peak flops??? Quote:
![]() I assume that you are using a 4870. Right? |
||
|
|
|
|
|
#3 | |||
|
Regular
Join Date: Aug 2009
Posts: 21
|
Quote:
Quote:
My peak estimate is from the L1 bandwidth, which I assume to be the bottleneck. 100% would be unlikely as there's no cache prefetching. Quote:
Given that it's essentially lots of calls to the SGEMM kernel, it could be funny to try to achieve 3500 Gflop/s in single precision LU factorization (LINPACK benchmark) using Volkov's approach for multi-GPU computation. |
|||
|
|
|
|
|
#4 |
|
Senior Member
|
Why would you ignore the t unit? Can you not see from the example given that it's being utilized in the majority of the slots? In fact, all 5 units are used in most of the shader.
__________________
I speak only for myself. |
|
|
|
|
|
#5 |
|
Regular
|
Ooh, very impressive.
Vasily Volkov and I discussed some of this stuff: http://forum.beyond3d.com/showthread...19#post1290019 I bashed my head against this for a while, mostly non-LDS, but focussed too much on maintaining cache locality for maximum throughput. And got somewhat confused I like the fact you're ignoring cache locality - that makes me chuckle. So I'm very impressed, 880Gflops is quite something. Did you write this in IL? Looking at the assembly it appears it's possible to write this in Brook+ (scatter stream not standard stream output). I'm a bit puzzled why the break from the loop is in the middle - need to think about that some more. Will you be writing a paper or presenting your results formally some time? You need a shiny graph of performance at various matrix sizes, 128x128, 256x256, etc. I do disagree on the 5th MAD in ATI. Your code is clearly doing 5 MADs per cycle most of the time! By the way the loop is 32 ALU cycles, 960GFLOPs peak. Jawed |
|
|
|
|
|
#6 |
|
Dangerously Mirthful
Join Date: Feb 2002
Location: Highland, IN USA
Posts: 14,172
|
Pfft, some newb posting up a troll thread.
|
|
|
|
|
|
#7 | |
|
Tiled
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,205
|
Quote:
__________________
"we have hardware lens flare acceleration" |
|
|
|
|
|
|
#8 | ||||
|
Regular
Join Date: Aug 2009
Posts: 21
|
Quote:
Quote:
Quote:
Quote:
By the way, it seems I can't extract more than 444 GB/s out of the texture units (even in synthetic tests with extreme locality) so that gives an upper bound of 888 Gflop/s for this kernel. |
||||
|
|
|
|
|
#9 |
|
Senior Member
|
Of course the t unit is being used here, and is one of the reasons why this code is able to reach such high speeds. And yet, the t unit and the mul are accidents. You would almost never be able to exploit them. This code is perhaps an exception.
However, I find that calculating the fraction of the peak from other bottlenecks (assumed or otherwise), is not the best way. Calculating the peak from actual max throughput is better as it can expose other deficiencies/bottlenecks. |
|
|
|
|
|
#10 | ||
|
Senior Member
|
Quote:
Quote:
__________________
I speak only for myself. |
||
|
|
|
|
|
#11 | |
|
Senior Member
|
Quote:
Real world, heavily optimized by compiler shaders typically average 3.5-4 alu slots per instruction. |
|
|
|
|
|
|
#12 | ||
|
Senior Member
|
Quote:
Quote:
__________________
I speak only for myself. |
||
|
|
|
|
|
#13 | ||
|
Senior Member
|
Quote:
It is, 90% of the time, not. Quote:
|
||
|
|
|
|
|
#14 |
|
Regular
|
|
|
|
|
|
|
#15 |
|
Senior Member
|
Oh dear, I should have made it clear upfront. I mean it is not used explicitly and implicitly, there is a good chance that it will be used only as an sfu. This gemm is an exception. In fact, this is the first exception I have seen in this regard.
|
|
|
|
|
|
#16 | |
|
Regular
|
Quote:
Jawed |
|
|
|
|
|
|
#18 | ||||
|
Regular
|
Quote:
I guess the scanline access pattern ends up with L2 filled with data it junks, which increases the number of fetches into L2 to fulfil the 8 TEX instructions. Quote:
Quote:
Quote:
Jawed |
||||
|
|
|
|
|
#19 |
|
Regular
|
|
|
|
|
|
|
#20 | |
|
Senior Member
|
Quote:
Code:
; -------- Disassembly --------------------
00 TEX: ADDR(656) CNT(1) VALID_PIX
0 SAMPLE R6, R1.xyxx, t3, s3
01 ALU_PUSH_BEFORE: ADDR(64) CNT(101)
1 x: MULADD T1.x, C34.x, R6.x, -1.0f
y: MULADD T0.y, C34.x, R6.y, -1.0f
z: MULADD T0.z, C34.x, R6.z, -1.0f
w: MULADD T1.w, C34.x, R6.w, -1.0f
t: MULADD T2.y, C34.x, R6.y, -1.0f VEC_021
2 x: MUL ____, PV1.z, PV1.z
y: MUL ____, R5.z, R5.z
z: MULADD T1.z, C34.x, R6.z, -1.0f
w: MUL T0.w, R2.z, R2.z VEC_201
t: ADD R11.z, -C23.x, C24.x
3 x: DOT4 ____, T1.x, T1.x
y: DOT4 ____, T0.y, T0.y
z: DOT4 ____, PV2.x, 1.0f
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: MULADD T0.x, R5.y, R5.y, PV2.y
4 x: DOT4 ____, T1.w, T1.w
y: DOT4 ____, T2.y, T2.y
z: DOT4 ____, T1.z, T1.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: RSQ_sat T1.y, PV3.x
5 x: MUL T0.x, T1.x, PS4
y: MULADD ____, R5.x, R5.x, T0.x VEC_102
z: MULADD ____, R2.y, R2.y, T0.w
w: MUL T0.w, T0.z, PS4
t: RSQ_sat T2.x, PV4.x
6 x: MUL T1.x, T1.z, PS5
y: MULADD ____, R2.x, R2.x, PV5.z
z: MUL ____, T1.w, PS5
w: MUL T2.w, T0.y, T1.y
t: RSQ_sat ____, PV5.y
7 x: CNDGE T0.x, -C22.x, T0.x, PV6.z
y: MUL T0.y, R5.z, PS6
z: MUL T1.z, R5.x, PS6
w: MUL T1.w, R5.y, PS6
t: RSQ_sat T3.x, PV6.y
8 x: DOT4 ____, R4.x, R4.x
y: DOT4 T1.y, R4.y, R4.y
z: DOT4 ____, R4.z, R4.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: CNDGE T0.w, -C22.x, T0.w, T1.x VEC_021
9 x: MUL T0.x, T0.x, T1.z
y: MUL T0.y, T0.x, T0.y
z: MUL ____, T0.x, T1.w
w: MUL ____, T2.y, T2.x VEC_021
t: MUL ____, R2.y, T3.x
10 x: CNDGE T2.x, -C22.x, T2.w, PV9.w
y: MUL T1.y, R2.z, T3.x
z: MUL ____, R2.x, T3.x
w: MULADD T2.w, PS9, T0.w, PV9.z VEC_021
t: RSQ_sat T3.x, T1.y
11 x: DOT4 ____, R3.x, R3.x VEC_120
y: DOT4 ____, R3.y, R3.y
z: DOT4 ____, R3.z, R3.z
w: DOT4 R11.w, (0x80000000, 0.0f).x, 0.0f
t: MULADD T0.x, PV10.z, T0.w, T0.x
12 x: MUL ____, R4.y, T3.x
y: MUL ____, R4.z, T3.x
z: MUL ____, R4.x, T3.x
w: MULADD ____, T1.y, T0.w, T0.y VEC_102
t: RSQ_e T1.z, |PV11.x|
13 x: MULADD T2.x, T2.x, PV12.z, T0.x
y: MULADD T0.y, T2.x, PV12.x, T2.w
z: MULADD T0.z, T2.x, PV12.y, PV12.w
w: MULADD T2.w, R3.x, PS12, -C29.x VEC_120
t: MULADD T2.y, R3.y, PS12, -C29.y
14 x: MUL ____, PV13.z, PV13.z
z: MULADD T1.z, R3.z, T1.z, -C29.z
t: RCP_e ____, T1.z
15 x: DOT4 T0.x, T2.x, T2.x
y: DOT4 ____, T0.y, T0.y
z: DOT4 ____, PV14.x, 1.0f
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: ADD ____, PS14, -C28.x
16 x: DOT4 ____, T2.w, T2.w
y: DOT4 T1.y, T2.y, T2.y
z: DOT4 ____, T1.z, T1.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: MUL R1.w, PS15, C28.y CLAMP
17 x: ADD ____, PS16, -1.0f
t: RSQ_sat ____, T0.x
18 x: MUL R12.x, T2.x, PS17
y: MUL R11.y, T0.y, PS17
z: MUL R12.z, T0.z, PS17
w: CNDGE R2.w, PV17.x, 0.0f, 1.0f
t: RSQ_sat ____, T1.y
19 x: MUL ____, T2.w, PS18
y: MUL ____, T2.y, PS18
z: MUL ____, T1.z, PS18
20 x: DOT4 ____, R12.x, PV19.x
y: DOT4 ____, R11.y, PV19.y
z: DOT4 ____, R12.z, PV19.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
21 w: MAX R12.w, PV20.x, C38.z
22 x: PREDNE ____, R2.w, -R2.w UPDATE_EXEC_MASK UPDATE_PRED
02 JUMP POP_CNT(1) ADDR(36) VALID_PIX
03 ALU: ADDR(165) CNT(121)
23 x: ADD R5.x, -R3.x, C21.x
y: ADD R3.y, -R3.y, C21.y
z: ADD R1.z, -R3.z, C21.z
w: MOV R6.w, 1.0f
t: MOV R2.z, C35.z
24 x: DOT4 ____, PV23.x, C18.x
y: DOT4 ____, PV23.y, C18.y
z: DOT4 T0.z, PV23.z, C18.z
w: DOT4 ____, PV23.w, C18.w
t: MOV R3.x, 0.0f
25 x: MUL R2.x, C32.z, -1.0f
y: MUL R2.y, C32.w, -1.0f
z: ADD ____, PV24.x, R2.z
w: ADD ____, PV24.x, -C32.y
t: ADD T0.w, PV24.x, -C32.z
26 x: ADD ____, T0.z, -C32.w
y: ADD ____, T0.z, PV25.y
z: CNDGE T1.z, PV25.z, 0.0f, 1.0f
w: ADD ____, T0.z, PV25.x
t: CNDGE T0.y, PV25.w, 1.0f, 0.0f
27 x: CNDGE ____, PV26.w, 0.0f, 1.0f
y: CNDGE ____, T0.w, 1.0f, 0.0f
z: CNDGE ____, PV26.y, 0.0f, 1.0f
w: CNDGE ____, PV26.x, 1.0f, 0.0f
t: ADD ____, T0.z, C36.x
28 x: MUL ____, PV27.x, T0.y
y: MUL ____, PV27.z, PV27.y
z: MUL ____, T1.z, PV27.w
t: MUL R6.x, PS27, C36.y CLAMP
29 x: DOT4 R8.x, PV28.x, 1.0f
y: DOT4 ____, PV28.y, C33.y
z: DOT4 ____, PV28.z, C33.z
w: DOT4 ____, (0x80000000, 0.0f).x, 0.0f
t: ADD T0.y, PS28, -1.0f
30 x: ADD ____, PV29.x, -1.0f
y: ADD R2.y, PV29.x, -1.0f
z: ADD ____, PV29.x, -C33.w
w: ADD ____, PV29.x, -C33.y
t: ADD R2.z, PV29.x, -C33.y
31 x: CNDGE R4.x, PV30.w, PV30.w, -PS30
y: CNDGE R5.y, PV30.x, PV30.x, -PV30.y
z: CNDGE T3.z, PV30.z, PV30.z, -R8.x
w: ADD R2.w, R8.x, -C33.z
t: ADD ____, R8.x, -C33.z
32 x: CNDGE ____, -PV31.z, C3.z, 0.0f
y: CNDGE ____, -PV31.z, C3.y, 0.0f
z: CNDGE ____, -PV31.z, C3.x, 0.0f
w: CNDGE ____, -PV31.z, C3.w, 0.0f
t: CNDGE R5.w, PS31, PS31, -PV31.w
33 x: CNDGE T0.x, -R5.y, C7.z, PV32.x
y: CNDGE T1.y, -R5.y, C7.y, PV32.y
z: CNDGE T1.z, -R5.y, C7.x, PV32.z
w: CNDGE T0.w, -R5.y, C7.w, PV32.w
t: MUL R2.z, R8.x, (0x3E800000, 0.25f).x
34 x: CNDGE T1.x, -T3.z, C0.z, 0.0f
y: CNDGE T0.y, -T3.z, C0.y, 0.0f
z: CNDGE T0.z, -T3.z, C0.x, 0.0f
w: CNDGE T1.w, -T3.z, C0.w, 0.0f
t: CNDGE R3.z, T0.y, 0.0f, 1.0f
35 x: CNDGE T2.x, -T3.z, C1.x, 0.0f
y: CNDGE T2.y, -T3.z, C1.w, 0.0f
z: CNDGE T2.z, -T3.z, C1.y, 0.0f
w: CNDGE T2.w, -T3.z, C1.z, 0.0f
36 x: CNDGE T0.x, -R4.x, C11.z, T0.x
y: CNDGE T1.y, -R4.x, C11.y, T1.y
z: CNDGE T1.z, -R4.x, C11.x, T1.z
w: CNDGE T0.w, -R4.x, C11.w, T0.w
37 x: CNDGE T1.x, -R5.y, C4.z, T1.x
y: CNDGE T0.y, -R5.y, C4.y, T0.y
z: CNDGE T0.z, -R5.y, C4.x, T0.z
w: CNDGE T1.w, -R5.y, C4.w, T1.w
38 x: CNDGE T2.x, -R5.y, C5.x, T2.x
y: CNDGE T2.y, -R5.y, C5.w, T2.y
z: CNDGE T2.z, -R5.y, C5.y, T2.z
w: CNDGE T2.w, -R5.y, C5.z, T2.w
39 x: CNDGE T0.x, -R5.w, C15.x, T1.z
y: CNDGE T1.y, -R5.w, C15.y, T1.y
z: CNDGE T1.z, -R5.w, C15.z, T0.x
w: CNDGE ____, -R5.w, C15.w, T0.w
40 x: CNDGE T1.x, -R4.x, C8.z, T1.x
y: CNDGE T0.y, -R4.x, C8.y, T0.y
z: CNDGE T0.z, -R4.x, C8.x, T0.z
w: CNDGE T1.w, -R4.x, C8.w, T1.w VEC_021
t: MUL ____, R6.w, PV39.w
41 x: CNDGE T2.x, -R4.x, C9.x, T2.x
y: CNDGE T2.y, -R4.x, C9.w, T2.y
z: CNDGE T1.z, -R4.x, C9.y, T2.z VEC_120
w: CNDGE T2.w, -R4.x, C9.z, T2.w
t: MULADD ____, R1.z, T1.z, PS40
42 x: DOT4 ____, R5.x, T0.x
y: DOT4 ____, R3.y, T1.y
z: DOT4 ____, PS41, 1.0f
w: DOT4 ____, 0.0f, 0.0f
t: CNDGE T0.x, -R5.w, C12.x, T0.z VEC_021
43 x: CNDGE T1.x, -R5.w, C13.x, T2.x
y: CNDGE T0.y, -R5.w, C12.y, T0.y
z: CNDGE T0.z, -R5.w, C12.z, T1.x VEC_021
w: CNDGE ____, -R5.w, C12.w, T1.w
t: RCP_e R13.w, PV42.x
44 x: CNDGE R2.x, -T3.z, C2.x, 0.0f
y: CNDGE T2.y, -R5.w, C13.y, T1.z
z: CNDGE T1.z, -R5.w, C13.z, T2.w VEC_021
w: CNDGE T2.w, -R5.w, C13.w, T2.y
t: MUL ____, R6.w, PV43.w
45 x: DOT4 ____, R5.x, T0.x
y: DOT4 R2.y, R3.y, T0.y
z: DOT4 ____, R1.z, T0.z
w: DOT4 ____, PS44, 1.0f
t: CNDGE R4.z, -T3.z, C2.y, 0.0f
46 x: DOT4 ____, R5.x, T1.x
y: DOT4 ____, R3.y, T2.y
z: DOT4 ____, R1.z, T1.z
w: DOT4 ____, R6.w, T2.w
t: MUL R9.y, R13.w, PV45.x
47 x: MOV R7.x, PS46
y: CNDGE R4.y, -T3.z, C2.z, 0.0f
z: MUL R5.z, R13.w, PV46.x
w: ADD R8.w, PS46, C31.z
t: CNDGE R6.y, -T3.z, C2.w, 0.0f
04 ALU: ADDR(286) CNT(11)
48 x: MULADD R9.x, R2.y, R13.w, C31.z
y: CNDGE ____, -R5.y, C6.x, R2.x VEC_120
z: CNDGE ____, -R5.y, C6.y, R4.z VEC_120
w: MULADD R9.w, R5.z, (0x3E800000, 0.25f).x, R2.z VEC_102
t: CNDGE R2.w, -R5.y, C6.z, R4.y VEC_021
49 x: CNDGE R2.x, -R5.y, C6.w, R6.y
y: ADD R7.y, PV48.w, C31.w
z: CNDGE R2.z, -R4.x, C10.x, PV48.y
w: CNDGE R4.w, -R4.x, C10.y, PV48.z
t: ADD R8.y, PV48.w, C31.w
05 TEX: ADDR(658) CNT(6) VALID_PIX
50 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
51 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
52 SAMPLE_G R4.__x_, R7.xyxx, t0, s0
53 SAMPLE_G R8.___x, R8.wyww, t0, s0
54 SAMPLE_G R8._x__, R9.xwxx, t0, s0
55 SAMPLE_G R7.x___, R9.ywyy, t0, s0
06 ALU_PUSH_BEFORE: ADDR(297) CNT(29)
56 x: CNDGE T1.x, -R5.w, C14.x, R2.z
y: CNDGE ____, -R4.x, C10.w, R2.x
z: CNDGE T3.z, -R5.w, C14.y, R4.w
w: CNDGE ____, -R4.x, C10.z, R2.w VEC_021
57 x: MUL ____, R9.y, C31.x
y: MUL T2.y, R9.w, C31.y
z: CNDGE ____, -R5.w, C14.z, PV56.w VEC_120
w: CNDGE ____, -R5.w, C14.w, PV56.y VEC_120
58 x: DOT4 ____, R5.x, T1.x
y: DOT4 ____, R3.y, T3.z
z: DOT4 R1.z, R1.z, PV57.z
w: DOT4 ____, R6.w, PV57.w
t: FRACT T0.y, PV57.x
59 x: MULADD ____, PV58.x, R13.w, -R7.x
y: MULADD ____, PV58.x, R13.w, -R8.w
z: MULADD ____, PV58.x, R13.w, -R4.z
w: MULADD ____, PV58.x, R13.w, -R8.y VEC_201
t: FRACT T2.w, T2.y
60 x: CNDGE T1.x, PV59.x, 0.0f, 1.0f
y: CNDGE ____, PV59.y, 0.0f, 1.0f
z: CNDGE T3.z, PV59.z, 0.0f, 1.0f
w: CNDGE ____, PV59.w, 0.0f, 1.0f
61 y: ADD ____, -PV60.z, PV60.y
z: ADD ____, -PV60.x, PV60.w
62 x: MULADD T1.x, PV61.z, T0.y, T1.x
w: MULADD ____, PV61.y, T0.y, T3.z
63 x: ADD ____, -PV62.x, PV62.w
64 z: MULADD R8.z, PV63.x, T2.w, T1.x
65 x: PREDNE ____, R3.z, -R3.z UPDATE_EXEC_MASK UPDATE_PRED
07 JUMP POP_CNT(1) ADDR(35) VALID_PIX
08 ALU: ADDR(326) CNT(20)
66 x: MULADD T0.x, -R6.x, C31.z, C31.z
y: MULADD T0.y, -R6.x, C31.w, C31.w
z: ADD ____, R8.x, -1.0f VEC_120
w: ADD ____, R8.x, 0.0f VEC_120
67 y: MOV ____, -|PV66.w|
z: MOV T0.z, -|PV66.z|
68 x: CNDGE ____, PV67.y, C19.x, 0.0f
w: CNDGE ____, PV67.y, C19.y, 0.0f
69 x: CNDGE ____, T0.z, C20.y, PV68.w
y: CNDGE ____, T0.z, C20.x, PV68.x
70 y: MUL R3.y, PV69.y, T0.x
z: MUL R3.z, PV69.x, T0.y
71 y: MULADD R4.y, PV70.y, -C33.z, R9.y
z: MULADD R4.z, PV70.z, -C33.z, R9.w
t: MULADD R6.y, PV70.y, C36.z, R9.y VEC_021
72 x: ADD R2.x, PV71.y, C31.z
y: ADD R2.y, PV71.z, C31.w
z: MUL R5.z, PV71.y, C31.x
w: ADD R4.w, PV71.z, C31.w
t: ADD R4.x, PV71.y, C31.z
09 TEX: ADDR(670) CNT(6) VALID_PIX
73 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
74 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
75 SAMPLE_G R2.___x, R2.xyxx, t0, s0
76 SAMPLE_G R2.x___, R4.yzyy, t0, s0
77 SAMPLE_G R2._x__, R4.xzxx, t0, s0
78 SAMPLE_G R2.__x_, R4.ywyy, t0, s0
10 ALU: ADDR(346) CNT(25)
79 x: MUL ____, R4.z, C31.y VEC_120
y: MULADD ____, R1.z, R13.w, -R2.y VEC_210
z: MULADD ____, R1.z, R13.w, -R2.x VEC_210
w: MULADD ____, R1.z, R13.w, -R2.z VEC_210
t: MULADD T0.x, R1.z, R13.w, -R2.w
80 x: CNDGE ____, PV79.y, 0.0f, 1.0f
y: FRACT R4.y, PV79.x
z: FRACT T0.z, R5.z
w: CNDGE T0.w, PV79.z, 0.0f, 1.0f
t: CNDGE T1.z, PV79.w, 0.0f, 1.0f
81 x: ADD R2.x, R6.y, C31.z
y: CNDGE ____, T0.x, 0.0f, 1.0f
z: MULADD R6.z, R3.z, C36.w, R9.w
w: ADD ____, -PV80.w, PV80.x
t: ADD R6.x, R6.y, C31.z
82 x: ADD ____, -T1.z, PV81.y
y: ADD R2.y, PV81.z, C31.w
z: MULADD R2.z, PV81.w, T0.z, T0.w
w: ADD R6.w, PV81.z, C31.w
t: MUL ____, R6.y, C31.x
83 x: FRACT R4.x, PS82
y: MULADD R7.y, R3.y, C36.w, R9.y
z: MUL R5.z, R6.z, C31.y
w: MULADD R4.w, PV82.x, T0.z, T1.z
t: MULADD R8.y, R3.y, C33.z, R9.y VEC_021
11 TEX: ADDR(682) CNT(6) VALID_PIX
84 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
85 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
86 SAMPLE_G R2.___x, R2.xyxx, t0, s0
87 SAMPLE_G R2.x___, R6.yzyy, t0, s0
88 SAMPLE_G R2._x__, R6.xzxx, t0, s0
89 SAMPLE_G R6.__x_, R6.ywyy, t0, s0
12 ALU: ADDR(371) CNT(35)
90 x: MULADD ____, R1.z, R13.w, -R2.x VEC_021
y: MULADD ____, R1.z, R13.w, -R6.z VEC_021
z: MULADD ____, R1.z, R13.w, -R2.w VEC_021
w: MULADD ____, R1.z, R13.w, -R2.y VEC_021
t: FRACT T0.w, R5.z
91 x: CNDGE T0.x, PV90.y, 0.0f, 1.0f
y: CNDGE T0.y, PV90.x, 0.0f, 1.0f
z: CNDGE ____, PV90.w, 0.0f, 1.0f
w: CNDGE ____, PV90.z, 0.0f, 1.0f
t: ADD ____, -R2.z, R4.w
92 x: ADD ____, -PV91.y, PV91.z
y: MUL T1.y, R7.y, C31.x
z: MULADD ____, PS91, R4.y, R2.z
w: ADD ____, -PV91.x, PV91.w
t: MULADD R7.z, R3.z, C36.z, R9.w VEC_021
93 x: ADD T0.x, R8.z, PV92.z
y: MULADD T0.y, PV92.x, R4.x, T0.y
z: MULADD ____, PV92.w, R4.x, T0.x
w: ADD R4.w, R7.y, C31.z
t: ADD R4.y, PS92, C31.w
94 x: ADD R7.x, R7.y, C31.z
y: MUL ____, R7.z, C31.y
z: FRACT R5.z, T1.y VEC_120
w: ADD ____, -PV93.y, PV93.z
t: ADD R7.w, R7.z, C31.w
95 x: FRACT R6.x, PV94.y
y: MULADD ____, PV94.w, T0.w, T0.y VEC_021
z: MULADD R8.z, R3.z, C33.z, R9.w VEC_021
w: ADD R2.w, R8.y, C31.z
t: ADD R8.x, R8.y, C31.z
96 x: MUL R2.x, R8.y, C31.x
y: ADD R2.y, PV95.z, C31.w
z: ADD R6.z, T0.x, PV95.y
w: ADD R8.w, PV95.z, C31.w
t: MUL R2.z, PV95.z, C31.y
13 TEX: ADDR(694) CNT(7) VALID_PIX
97 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
98 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
99 SAMPLE_G R4.___x, R4.wyww, t0, s0
100 SAMPLE_G R2.___x, R2.wyww, t0, s0
101 SAMPLE_G R4.x___, R7.yzyy, t0, s0
102 SAMPLE_G R2._x__, R7.xzxx, t0, s0
103 SAMPLE_G R7.__x_, R7.ywyy, t0, s0
14 ALU: ADDR(406) CNT(16)
104 x: MULADD ____, R1.z, R13.w, -R4.w VEC_210
y: MULADD ____, R1.z, R13.w, -R2.y VEC_210
z: MULADD ____, R1.z, R13.w, -R4.x VEC_210
w: MULADD ____, R1.z, R13.w, -R7.z VEC_210
t: MULADD T0.z, R1.z, R13.w, -R2.w VEC_120
105 x: CNDGE ____, PV104.y, 0.0f, 1.0f
y: CNDGE ____, PV104.x, 0.0f, 1.0f
z: CNDGE T1.z, PV104.w, 0.0f, 1.0f
w: CNDGE T0.w, PV104.z, 0.0f, 1.0f
t: FRACT R4.x, R2.x
106 x: ADD ____, -PV105.z, PV105.y
y: FRACT R7.y, R2.z
z: CNDGE R2.z, T0.z, 0.0f, 1.0f VEC_120
w: ADD ____, -PV105.w, PV105.x
107 z: MULADD R5.z, PV106.w, R5.z, T0.w
w: MULADD R2.w, PV106.x, R5.z, T1.z
15 TEX: ADDR(708) CNT(5) VALID_PIX
108 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
109 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
110 SAMPLE_G R2.x___, R8.yzyy, t0, s0
111 SAMPLE_G R2._x__, R8.xzxx, t0, s0
112 SAMPLE_G R8.__x_, R8.ywyy, t0, s0
16 ALU_PUSH_BEFORE: ADDR(422) CNT(18)
113 x: MULADD ____, R1.z, R13.w, -R2.x
y: MULADD ____, R1.z, R13.w, -R8.z
z: ADD ____, -R5.z, R2.w VEC_120
w: MULADD ____, R1.z, R13.w, -R2.y
114 x: CNDGE ____, PV113.w, 0.0f, 1.0f
y: CNDGE T0.y, PV113.x, 0.0f, 1.0f
z: MULADD ____, PV113.z, R6.x, R5.z
w: CNDGE T0.w, PV113.y, 0.0f, 1.0f
115 x: ADD T0.x, R6.z, PV114.z
y: ADD ____, -PV114.y, PV114.x
w: ADD ____, -PV114.w, R2.z
116 y: MULADD T0.y, PV115.y, R4.x, T0.y
z: MULADD ____, PV115.w, R4.x, T0.w
117 w: ADD ____, -PV116.y, PV116.z
118 y: MULADD ____, PV117.w, R7.y, T0.y
119 z: ADD R8.z, T0.x, PV118.y
120 x: CNDGE R2.x, -PV119.z, 0.0f, 1.0f
121 x: PREDNE ____, R2.x, -R2.x UPDATE_EXEC_MASK UPDATE_PRED
17 ALU_PUSH_BEFORE: ADDR(440) CNT(3)
122 y: ADD ____, R8.z, C37.x
123 x: CNDGE R2.x, PV122.y, 1.0f, 0.0f
124 x: PREDNE ____, R2.x, -R2.x UPDATE_EXEC_MASK UPDATE_PRED
18 JUMP ADDR(20) VALID_PIX
19 ALU: ADDR(443) CNT(1)
125 z: MOV R8.z, 1.0f
20 ELSE POP_CNT(1) ADDR(34) VALID_PIX
21 ALU: ADDR(444) CNT(8)
126 y: MULADD R4.y, R3.y, C38.x, R9.y
z: MULADD R4.z, R3.z, C38.y, R9.w
t: MULADD R5.y, R3.y, C37.y, R9.y VEC_021
127 x: ADD R2.x, PV126.y, C31.z
y: ADD R2.y, PV126.z, C31.w
z: MUL R6.z, PV126.y, C31.x
w: ADD R4.w, PV126.z, C31.w
t: ADD R4.x, PV126.y, C31.z
22 TEX: ADDR(718) CNT(6) VALID_PIX
128 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
129 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
130 SAMPLE_G R2.___x, R2.xyxx, t0, s0
131 SAMPLE_G R2.x___, R4.yzyy, t0, s0
132 SAMPLE_G R2._x__, R4.xzxx, t0, s0
133 SAMPLE_G R2.__x_, R4.ywyy, t0, s0
23 ALU: ADDR(452) CNT(15)
134 x: MULADD ____, R1.z, R13.w, -R2.x VEC_021
y: MULADD ____, R1.z, R13.w, -R2.w VEC_021
z: MULADD ____, R1.z, R13.w, -R2.z VEC_021
w: MULADD ____, R1.z, R13.w, -R2.y VEC_021
t: MUL R4.w, R4.z, C31.y
135 x: CNDGE R4.x, PV134.x, 0.0f, 1.0f
y: CNDGE R4.y, PV134.y, 0.0f, 1.0f
z: CNDGE R7.z, PV134.z, 0.0f, 1.0f
w: CNDGE R6.w, PV134.w, 0.0f, 1.0f
t: MULADD R5.z, R3.z, C37.z, R9.w VEC_021
136 x: ADD R2.x, R5.y, C31.z
y: ADD R2.y, PS135, C31.w
z: MUL R4.z, R5.y, C31.x
w: ADD R5.w, PS135, C31.w
t: ADD R5.x, R5.y, C31.z
24 TEX: ADDR(730) CNT(6) VALID_PIX
137 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
138 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
139 SAMPLE_G R2.___x, R2.xyxx, t0, s0
140 SAMPLE_G R2.x___, R5.yzyy, t0, s0
141 SAMPLE_G R2._x__, R5.xzxx, t0, s0
142 SAMPLE_G R2.__x_, R5.ywyy, t0, s0
25 ALU: ADDR(467) CNT(39)
143 x: MULADD ____, R1.z, R13.w, -R2.y VEC_210
y: MULADD ____, R1.z, R13.w, -R2.x VEC_210
z: MULADD ____, R1.z, R13.w, -R2.z VEC_210
w: MUL ____, R5.z, C31.y VEC_120
t: MULADD T0.w, R1.z, R13.w, -R2.w
144 x: FRACT T0.x, PV143.w
y: FRACT T0.y, R4.z
z: CNDGE T0.z, PV143.y, 0.0f, 1.0f
w: CNDGE ____, PV143.x, 0.0f, 1.0f
t: CNDGE T1.y, PV143.z, 0.0f, 1.0f
145 x: CNDGE ____, T0.w, 0.0f, 1.0f
y: ADD ____, -PV144.z, PV144.w
z: FRACT T1.z, R6.z
w: FRACT R7.w, R4.w VEC_201
t: ADD ____, -R4.x, R6.w
146 x: MULADD R11.x, PS145, PV145.z, R4.x
y: ADD ____, -T1.y, PV145.x
z: MULADD T0.z, PV145.y, T0.y, T0.z
w: ADD ____, -R7.z, R4.y VEC_021
t: MULADD R6.z, R3.z, C40.y, R9.w VEC_021
147 x: MULADD R8.x, PV146.w, T1.z, R7.z
y: ADD R7.y, PS146, C31.w
z: MUL ____, PS146, C31.y
w: MULADD ____, PV146.y, T0.y, T1.y
t: ADD R6.w, PS146, C31.w
148 x: FRACT R2.x, PV147.z
y: ADD ____, -T0.z, PV147.w
z: MULADD R5.z, R3.z, C40.w, R9.w VEC_120
t: MULADD R6.y, R3.y, C40.x, R9.y VEC_021
149 x: MULADD ____, PV148.y, T0.x, T0.z
y: MULADD R5.y, R3.y, C40.z, R9.y
z: ADD R7.z, PS148, C31.z
w: MUL ____, PS148, C31.x
t: ADD R6.x, PS148, C31.z
150 x: ADD R4.x, PV149.y, C31.z
y: FRACT R4.y, PV149.w
z: ADD R4.z, R8.z, PV149.x
w: ADD R4.w, R5.z, C31.w VEC_120
t: ADD R5.x, PV149.y, C31.z
26 TEX: ADDR(742) CNT(7) VALID_PIX
151 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
152 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
153 SAMPLE_G R2.___x, R7.zyzz, t0, s0
154 SAMPLE_G R4.___x, R4.xwxx, t0, s0
155 SAMPLE_G R4.x___, R6.yzyy, t0, s0
156 SAMPLE_G R7._x__, R6.xzxx, t0, s0
157 SAMPLE_G R6.__x_, R6.ywyy, t0, s0
27 ALU: ADDR(506) CNT(25)
158 x: MULADD ____, R1.z, R13.w, -R7.y VEC_021
y: MULADD ____, R1.z, R13.w, -R4.x VEC_021
z: MULADD ____, R1.z, R13.w, -R2.w VEC_021
w: MULADD ____, R1.z, R13.w, -R6.z VEC_021
t: ADD R5.w, R5.z, C31.w
159 x: CNDGE ____, PV158.z, 0.0f, 1.0f
y: CNDGE T0.y, PV158.w, 0.0f, 1.0f
z: CNDGE ____, PV158.x, 0.0f, 1.0f
w: CNDGE T0.w, PV158.y, 0.0f, 1.0f
t: MUL ____, R5.y, C31.x
160 x: MUL ____, R5.z, C31.y
y: MULADD ____, R1.z, R13.w, -R4.w VEC_120
z: ADD ____, -PV159.y, PV159.x
w: ADD ____, -PV159.w, PV159.z
t: FRACT R4.w, PS159
161 x: FRACT R7.x, PV160.x
y: MULADD R4.y, PV160.w, R4.y, T0.w VEC_021
z: CNDGE R6.z, PV160.y, 0.0f, 1.0f
w: MULADD ____, PV160.z, R4.y, T0.y VEC_021
t: MULADD R10.z, R3.z, C39.y, R9.w VEC_021
162 x: ADD R4.x, -PV161.y, PV161.w
y: MULADD R10.y, R3.y, C39.x, R9.y
z: ADD R9.z, PS161, C31.w
w: ADD R10.w, PS161, C31.w
t: MUL R7.z, PS161, C31.y
28 TEX: ADDR(756) CNT(6) VALID_PIX
163 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
164 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
165 SAMPLE_G R6.x___, R5.yzyy, t0, s0
166 SAMPLE_G R7._x__, R5.xzxx, t0, s0
167 SAMPLE_G R5.__x_, R5.ywyy, t0, s0
168 SAMPLE_G R5.x___, R10.yzyy, t0, s0
29 ALU: ADDR(531) CNT(33)
169 x: MULADD ____, R1.z, R13.w, -R5.z
y: MULADD ____, R1.z, R13.w, -R7.y VEC_021
z: MULADD ____, R4.x, R2.x, R4.y VEC_120
w: MULADD ____, R1.z, R13.w, -R6.x VEC_120
t: ADD R9.x, R10.y, C31.z
170 x: CNDGE T0.x, PV169.w, 0.0f, 1.0f
y: CNDGE T0.y, PV169.x, 0.0f, 1.0f
z: CNDGE ____, PV169.y, 0.0f, 1.0f
w: ADD T0.w, R4.z, PV169.z
t: ADD R10.x, R10.y, C31.z
171 x: MUL ____, R10.y, C31.x
y: ADD ____, -PV170.y, R6.z
z: MULADD ____, R1.z, R13.w, -R5.x
w: ADD ____, -PV170.x, PV170.z
t: FRACT R2.x, R7.z
172 x: MULADD T0.x, PV171.w, R4.w, T0.x
y: FRACT R7.y, PV171.x
z: MULADD ____, PV171.y, R4.w, T0.y VEC_120
w: CNDGE R2.w, PV171.z, 0.0f, 1.0f
t: MULADD R6.y, R3.y, C39.z, R9.y VEC_021
173 x: ADD R5.x, PS172, C31.z
y: ADD ____, -PV172.x, PV172.z
z: MULADD R6.z, R3.z, C39.w, R9.w
w: MUL ____, PS172, C31.x
t: ADD R6.x, PS172, C31.z
174 x: MULADD ____, PV173.y, R7.x, T0.x
y: ADD R5.y, PV173.z, C31.w
z: MUL ____, PV173.z, C31.y
w: ADD R6.w, PV173.z, C31.w
t: FRACT R8.w, PV173.w
175 x: ADD R8.x, -R11.x, R8.x
y: FRACT R2.y, PV174.z
z: ADD R7.z, T0.w, PV174.x
30 TEX: ADDR(768) CNT(6) VALID_PIX
176 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
177 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
178 SAMPLE_G R4.___x, R9.xzxx, t0, s0
179 SAMPLE_G R4._x__, R10.xzxx, t0, s0
180 SAMPLE_G R5.___x, R5.xyxx, t0, s0
181 SAMPLE_G R10.__x_, R10.ywyy, t0, s0
31 ALU: ADDR(564) CNT(13)
182 x: MULADD ____, R1.z, R13.w, -R4.y
y: MULADD ____, R1.z, R13.w, -R10.z
z: MULADD ____, R1.z, R13.w, -R4.w
w: MULADD R4.w, R8.x, R7.w, R11.x VEC_102
183 x: CNDGE ____, PV182.z, 0.0f, 1.0f
y: CNDGE T0.y, PV182.y, 0.0f, 1.0f
z: CNDGE ____, PV182.x, 0.0f, 1.0f
w: MULADD ____, R1.z, R13.w, -R5.w
184 y: CNDGE R10.y, PV183.w, 0.0f, 1.0f
z: ADD ____, -PV183.y, PV183.x
w: ADD ____, -R2.w, PV183.z
185 y: MULADD R4.y, PV184.w, R7.y, R2.w
w: MULADD R2.w, PV184.z, R7.y, T0.y
32 TEX: ADDR(780) CNT(5) VALID_PIX
186 SET_GRADIENTS_H ____, R3.xxxx, t0, s0 WHOLE_QUAD
187 SET_GRADIENTS_V ____, R3.xxxx, t0, s0 WHOLE_QUAD
188 SAMPLE_G R8.x___, R6.yzyy, t0, s0
189 SAMPLE_G R7._x__, R6.xzxx, t0, s0
190 SAMPLE_G R6.__x_, R6.ywyy, t0, s0
33 ALU_POP_AFTER: ADDR(577) CNT(18)
191 x: MULADD ____, R1.z, R13.w, -R6.z
y: MULADD ____, R1.z, R13.w, -R7.y
z: ADD ____, -R4.y, R2.w VEC_021
w: MULADD ____, R1.z, R13.w, -R8.x
192 x: CNDGE T0.x, PV191.w, 0.0f, 1.0f
y: CNDGE ____, PV191.y, 0.0f, 1.0f
z: MULADD ____, PV191.z, R2.x, R4.y
w: CNDGE T0.w, PV191.x, 0.0f, 1.0f
193 x: ADD ____, -PV192.x, PV192.y
y: ADD ____, -PV192.w, R10.y
w: ADD T1.w, R7.z, PV192.z
194 x: MULADD T0.x, PV193.x, R8.w, T0.x
z: MULADD ____, PV193.y, R8.w, T0.w
195 y: ADD ____, -PV194.x, PV194.z
196 x: MULADD ____, PV195.y, R2.y, T0.x
197 y: ADD ____, T1.w, PV196.x
198 x: ADD ____, R4.w, PV197.y
199 z: MUL R8.z, PV198.x, C37.w
34 POP (2) ADDR(35)
35 ALU_POP_AFTER: ADDR(595) CNT(2)
200 x: ADD ____, R3.w, -R8.z
201 w: MULADD R3.w, R1.w, PV200.x, R8.z
36 TEX: ADDR(790) CNT(2) VALID_PIX
202 SAMPLE R2, R1.xyxx, t2, s2
203 SAMPLE R1, R1.xyxx, t1, s1
37 ALU: ADDR(597) CNT(48)
204 x: DOT4 ____, R12.x, -C29.x
y: DOT4 ____, R11.y, -C29.y
z: DOT4 ____, R12.z, -C29.z
w: DOT4 T1.w, (0x80000000, 0.0f).x, 0.0f
t: MULADD T0.w, R2.w, R11.z, C23.x
205 x: MUL ____, C27.x, C27.x
w: MAX ____, PV204.x, 0.0f
t: LOG_sat ____, |R12.w|
206 x: MUL ____, PV205.w, R1.y
y: MUL ____, T0.w, PS205
z: MUL ____, PV205.w, R1.x
w: MUL ____, PV205.w, R1.z
t: RCP_e ____, PV205.x
207 x: MUL ____, PV206.z, C30.x
y: MUL T1.y, R11.w, PS206 CLAMP
z: MUL T0.z, PV206.w, C30.z
w: MUL ____, PV206.x, C30.y
t: EXP_e ____, PV206.y
208 x: MUL ____, R2.z, PS207
y: MUL ____, R2.y, PS207
z: MUL ____, R2.x, PS207
w: MUL ____, R3.w, PV207.x
t: MUL T0.y, R3.w, PV207.w
209 x: MUL ____, R3.w, T0.z
y: MUL ____, PV208.x, C30.z
z: MUL ____, PV208.y, C30.y
w: MUL ____, PV208.z, C30.x
t: MULADD T0.w, R1.x, R0.x, PV208.w
210 x: MUL ____, PV209.w, C25.x
y: MULADD T0.y, R1.z, R0.z, PV209.x
z: MULADD T0.z, R1.y, R0.y, T0.y
w: MUL ____, PV209.z, C25.x
t: MUL ____, PV209.y, C25.x
211 x: MUL T0.x, T1.y, C27.y
y: MULADD ____, PS210, R3.w, PV210.y
z: MULADD ____, PV210.w, R3.w, PV210.z
w: MULADD ____, PV210.x, R3.w, T0.w
212 y: CNDGE T0.y, -T1.w, T0.y, PV211.y
z: CNDGE T0.z, -T1.w, T0.z, PV211.z
w: CNDGE T0.w, -T1.w, T0.w, PV211.w
213 x: ADD ____, -PV212.y, C26.z
y: ADD ____, -PV212.z, C26.y
z: ADD ____, -PV212.w, C26.x
w: MUL R0.w, R0.w, R1.w
214 x: MULADD R0.x, T0.x, PV213.z, T0.w
y: MULADD R0.y, T0.x, PV213.y, T0.z
z: MULADD R0.z, T0.x, PV213.x, T0.y
38 EXP_DONE: PIX0, R0
__________________
I speak only for myself. |
|
|
|
|
|
|
#21 |
|
Administrator
Join Date: Mar 2005
Posts: 1,810
|
This is incorrect. You may be confused by how scheduling is prioritized - namely, "common" instructions will first be assigned to the "vector" ALUs (x,y,z,w) and only if those are occupied will they be assigned to the transcendental unit as well. Of course, transcendental ops (or stuff like INT MUL/DIV, for example) get scheduled to the trans ALU implicitly. There are also some GPR read port restrictions in place, which end up not always allowing an instruction to be scheduled there. But it does MADs just fine, and quite often, really.
__________________
A wise man commenting about a popular hero of the peoplez: that dude is so fucking ignorant, he wouldn't know if he was getting assraped by a baboon |
|
|
|
|
|
#22 |
|
Senior Member
|
Average utilization doesn't really indicate how often the t unit is being used. If you have a bunch of very scalar code, utilization may go down, but you may find the rest of the code is fully utilizing all slots.
__________________
I speak only for myself. |
|
|
|
|
|
#24 |
|
Senior Member
|
I believe it checks if a number is equal to 0. If so, chooses one of the operands, if not, chooses the other. I don't have the specs in front of me but Jawed posted a link to the instruction set specs recently.
Edit: Did you mean CNDGE? I believe that checks if a number is greater than or equal to 0 with similar behavior to what I posted above. Edit again: Compare to the cmp instruction in the Direct3D instruction specs.
__________________
I speak only for myself. |
|
|
|
|
|
#25 |
|
Regular
Join Date: Aug 2009
Posts: 21
|
Well, given that double precision multiply-adds are four (five if you count the `t' unit) times slower but only require twice the bandwidth, it's much easier to achieve high ALU utilization. ATi's implementation is almost optimal, over 200 Gflop/s (out of 240 Gflop/s peak).
No, I didn't measure more than ~444 GB/s even with all threads fetching the same value(s) over and over. Running ATi's various synthetic tests (among the samples in the SDK) gives similar results. As texture fetches are the bottleneck, it's actually impressive that the hardware manages to loose only 1% of efficiency with a more complex access pattern. |
|
|
|
![]() |
| Bookmarks |
| Thread Tools | |
| Display Modes | |
|
|