And things like internal l2-l1 bandwidth (if the bus stays the same that is).Don't forget the Frond-End with the 25% higher clock rate.
And things like internal l2-l1 bandwidth (if the bus stays the same that is).Don't forget the Frond-End with the 25% higher clock rate.
x: RCP_sat R2.x, R1.y
y: RCP_sat ____, R1.y
z: RCP_sat ____, R1.y
3 x: MUL ____, R1.x, (0x3E22F983, 0.1591549367f).x
4 x: SIN R2.x, PV3.x
y: SIN ____, PV3.x
z: SIN ____, PV3.x
x: MULLO_INT R2.x, R1.x, R1.x
y: MULLO_INT ____, R1.x, R1.x
z: MULLO_INT ____, R1.x, R1.x
w: MULLO_INT ____, R1.x, R1.x
3 x: MUL_INT24__NI R2.x, R1.x, R1.x
Any wishes? You can send me IL code and I will show you how it will look like in ISA.You sneaky chappy! Hmm, very interesting.
The drivers still state clearly that Antilles = Cayman based, not Barts.Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.
That's what I am wondering too.For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?
On a slight tangent- Was helping out a Fry's Electronics employee at my work, started chatting about his job, he does sales in computer accessories. Just asked him if he had heard about AMD releasing a new product. He said that the last week or so employees kept seeing them in the back and putting them out on the shelf so they finally locked them up. I was like damn, could have gotten one early.
I'll have to try and find one of those clueless employees a week or two before Cayman is released.
It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?
So if that's vliw-4 how many simds?It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.
It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.
And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.
How is this different to a the situation, where the t-unit used other lanes for help? I can only see benefit, if a transcendental takes up only two slots, so two in parallel could be perfromed.Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.
Double precision for Barts enabled? Would be big news for me! Or do you mean "ne member of a future GPU line from AMD".And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.
Ooh, thanks.Any wishes? You can send me IL code and I will show you how it will look like in ISA.
il_ps_2_0
dcldef_x(*)_y(*)_z(*)_w(*) r0
def c0, 0.6931471825, 1.442695022, 0.0, 0.0
dclpin_usage(color)_usageIndex(10)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn0
dclpin_usage(color)_usageIndex(26)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn1
mov r0, vPixIn0
add r1, r0, vPixIn1
log_zeroop(fltmax) r2.x___, r1.x_abs
log_zeroop(fltmax) r2._y__, r1.y_abs
log_zeroop(fltmax) r2.__z_, r1.z_abs
log_zeroop(fltmax) r2.___w, r1.w_abs
mul r1, r2, c0.x
mul r0, r0, vPixIn1
mul r0, r0, c0.y
exp r0.x___, r0.x
rcp_zeroop(infinity) r2.x___, r0.x
exp r0.x___, r0.y
rcp_zeroop(infinity) r2._y__, r0.x
exp r0.x___, r0.z
exp r0._y__, r0.w
rcp_zeroop(infinity) r2.___w, r0.y
rcp_zeroop(infinity) r2.__z_, r0.x
mul r0, r1, r2
dp4 r1.x___, r0, r0
rsq_zeroop(infinity) r1.x___, r1.x_abs
mul r37, r0, r1.x
colorclamp oC0, r37
end
I'm dead chuffed. Maybe I should retire, got something right for a changeBtw., it really looks like you were right.
Maybe I should say at least one. Unfortunately I cannot connect names to it right now.
1 cycle, for one scalar per cycle.What is the throughput of RCP when issued through the current T lane?
My original proposal was to take the look-up tables and spread them amongst the four lanes. Then use the single-cycle dependent math capability of the 4 lanes, working together, to perform all the math (Lagrange polynomials).I'm trying to think of what sequence of operations can be done across three lanes.
A successive approximation of an RCP would take 5 iterations to yield a single-precision result if using plain math ops. Perhaps a more elaborate method is used?
There was a slight problem with it, I didn't get it to compile with the "def c0" in it. The error message wasn't telling too much (unsupported opcode without saying wich one ), so I started to rip out everything (and it worked after exchanging c0 with a literal). What was left is that (edit: I just see there got a bit other stuff pasted in, but outside of the loop, so it shouldn't matter):Ooh, thanks.
Here's something old that I've tweaked for extra transcendentals:
[..]
A comparison of VLIW-5 and VLIW-4 would be fairly interesting.
il_ps_2_0
dcl_literal l1, 0.6931471825, 1.442695022, 0.0, 0.0
dcl_literal l0, 0x10000, 0xffffffff, 2.1, 0.01
mov r13.x, l0.x ; loop counter
mov r1.xyzw,l0.zzzz ; arbitrary values, r1 = 2*2-2 = 2
mov r2.xyzw,l0.wwww ; r2 = 0*0+0 = 0
mov r0, l0
add r1, r0, l0.wzyx
whileloop
break_logicalz r13 ; while(r13 > 0)
log_zeroop(fltmax) r2.x___, r1.x_abs
log_zeroop(fltmax) r2._y__, r1.y_abs
log_zeroop(fltmax) r2.__z_, r1.z_abs
log_zeroop(fltmax) r2.___w, r1.w_abs
mul r1, r2, l1.x
mul r0, r0, l0.wzyx
mul r0, r0, l1.y
exp r0.x___, r0.x
rcp_zeroop(infinity) r2.x___, r0.x
exp r0.x___, r0.y
rcp_zeroop(infinity) r2._y__, r0.x
exp r0.x___, r0.z
exp r0._y__, r0.w
rcp_zeroop(infinity) r2.___w, r0.y
rcp_zeroop(infinity) r2.__z_, r0.x
mul r0, r1, r2
dp4 r1.x___, r0, r0
rsq_zeroop(infinity) r1.x___, r1.x_abs
mul r1, r0, r1.x
iadd r13.x, r13.x, l0.y ; counter--
endloop
mov g[0], r1
end
; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(13)
0 x: MOV R0.x, (0x00010000, 9.183549616e-41f).x
y: MOV R0.y, (0xFFFFFFFF, -1.#QNANf).y
z: MOV R0.z, (0x40066666, 2.099999905f).z
w: MOV R0.w, (0x3C23D70A, 0.009999999776f).w
t: MOV R1.x, (0x3C23D70A, 0.009999999776f).w
1 x: MOV R2.x, (0x00010000, 9.183549616e-41f).x
y: MOV R1.y, (0xFFFFFFFF, -1.#QNANf).y
z: MOV R1.z, (0xFFFFFFFF, -1.#QNANf).y
w: MOV R1.w, (0x3C23D70A, 0.009999999776f).z
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX
02 ALU_BREAK: ADDR(45) CNT(1)
2 x: PREDNE_INT ____, R2.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
03 ALU: ADDR(46) CNT(44)
3 x: MUL ____, R0.z, (0xFFFFFFFF, -1.#QNANf).x
y: MUL ____, R0.y, (0x40066666, 2.099999905f).y
z: MUL ____, R0.x, (0x3C23D70A, 0.009999999776f).z
w: MUL ____, R0.w, (0x00010000, 9.183549616e-41f).w
t: LOG_sat T0.z, |R1.x|
4 x: MUL T0.x, PV3.x, (0x3FB8AA3B, 1.442695022f).x
y: MUL T0.y, PV3.y, (0x3FB8AA3B, 1.442695022f).x
z: MUL T1.z, PV3.z, (0x3FB8AA3B, 1.442695022f).x
w: MUL T0.w, PV3.w, (0x3FB8AA3B, 1.442695022f).x
t: LOG_sat ____, |R1.y|
5 x: ADD_INT R2.x, -1, R2.x
y: MUL T1.y, PS4, (0x3F317218, 0.6931471825f).x
z: MUL T0.z, T0.z, (0x3F317218, 0.6931471825f).x
t: LOG_sat ____, |R1.z|
6 x: MUL T2.x, PS5, (0x3F317218, 0.6931471825f).x
t: LOG_sat ____, |R1.w|
7 w: MUL T1.w, PS6, (0x3F317218, 0.6931471825f).x
t: EXP_e T1.z, T1.z
8 t: EXP_e T1.x, T0.y
9 t: EXP_e T2.z, T0.x
10 t: EXP_e T0.y, T0.w
11 t: RCP_e ____, T1.z
12 x: MUL R0.x, T0.z, PS11
t: RCP_e ____, T1.x
13 y: MUL R0.y, T1.y, PS12
t: RCP_e T1.x, T0.y
14 t: RCP_e ____, T2.z
15 z: MUL R0.z, T2.x, PS14
w: MUL R0.w, T1.w, T1.x
16 x: DOT4 ____, R0.x, R0.x
y: DOT4 ____, R0.y, R0.y
z: DOT4 ____, PV15.z, PV15.z
w: DOT4 ____, PV15.w, PV15.w
17 t: RSQ_e ____, |PV16.x|
18 x: MUL R1.x, R0.x, PS17
y: MUL R1.y, R0.y, PS17
z: MUL R1.z, R0.z, PS17
w: MUL R1.w, R0.w, PS17
04 ENDLOOP i0 PASS_JUMP_ADDR(2)
05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3)
06 ALU: ADDR(90) CNT(4)
19 x: MOV R0.x, 0.0f
y: MOV R0.y, 0.0f
z: MOV R0.z, 0.0f
w: MOV R0.w, 0.0f
07 EXP_DONE: PIX0, R0
END_OF_PROGRAM
; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(13)
0 x: MOV R0.x, (0x00010000, 9.183549616e-41f).x
y: MOV R0.y, (0xFFFFFFFF, -1.#QNANf).y
z: MOV R0.z, (0x40066666, 2.099999905f).z
w: MOV R0.w, (0x3C23D70A, 0.009999999776f).w
1 x: MOV R1.x, (0x3C23D70A, 0.009999999776f).x
y: MOV R1.y, (0xFFFFFFFF, -1.#QNANf).y
z: MOV R1.z, (0xFFFFFFFF, -1.#QNANf).y
w: MOV R1.w, (0x3C23D70A, 0.009999999776f).x
2 x: MOV R2.x, (0x00010000, 9.183549616e-41f).x
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX
02 ALU: ADDR(45) CNT(1)
3 x: PREDNE_INT ____, R2.x, 0.0f UPDATE_EXEC_MASK BREAK UPDATE_PRED
03 ALU: ADDR(46) CNT(71)
4 x: MUL ____, R0.z, (0xFFFFFFFF, -1.#QNANf).x
y: MUL ____, R0.y, (0x40066666, 2.099999905f).y
z: MUL ____, R0.x, (0x3C23D70A, 0.009999999776f).z
w: MUL ____, R0.w, (0x00010000, 9.183549616e-41f).w
5 x: MUL T0.x, PV4.x, (0x3FB8AA3B, 1.442695022f).x
y: MUL T0.y, PV4.w, (0x3FB8AA3B, 1.442695022f).x
z: MUL T0.z, PV4.z, (0x3FB8AA3B, 1.442695022f).x
w: MUL T0.w, PV4.y, (0x3FB8AA3B, 1.442695022f).x
6 x: LOG_sat ____, |R1.x|
y: LOG_sat ____, |R1.x|
z: LOG_sat ____, |R1.x|
7 x: LOG_sat ____, |R1.y|
y: LOG_sat ____, |R1.y|
z: LOG_sat ____, |R1.y|
w: MUL T1.w, PV6.x, (0x3F317218, 0.6931471825f).x
8 x: LOG_sat ____, |R1.z|
y: LOG_sat ____, |R1.z|
z: LOG_sat ____, |R1.z|
w: MUL T2.w, PV7.z, (0x3F317218, 0.6931471825f).x
9 x: LOG_sat ____, |R1.w|
y: LOG_sat ____, |R1.w|
z: LOG_sat ____, |R1.w|
w: MUL T3.w, PV8.y, (0x3F317218, 0.6931471825f).x
10 x: EXP_e ____, T0.z
y: EXP_e T1.y, T0.z
z: EXP_e ____, T0.z
w: MUL R0.w, PV9.x, (0x3F317218, 0.6931471825f).x
11 x: EXP_e ____, T0.w
y: EXP_e ____, T0.w
z: EXP_e T0.z, T0.w
12 x: EXP_e T0.x, T0.x
y: EXP_e ____, T0.x
z: EXP_e ____, T0.x
13 x: EXP_e ____, T0.y
y: EXP_e ____, T0.y
z: EXP_e T1.z, T0.y
14 x: RCP_e T1.x, T1.y
y: RCP_e ____, T1.y
z: RCP_e ____, T1.y
15 x: RCP_e ____, T0.z
y: RCP_e T1.y, T0.z
z: RCP_e ____, T0.z
16 x: RCP_e ____, T1.z
y: RCP_e T0.y, T1.z
z: RCP_e ____, T1.z
17 x: RCP_e ____, T0.x
y: RCP_e ____, T0.x
z: RCP_e ____, T0.x
18 x: MUL R0.x, T1.w, T1.x
y: MUL R0.y, T2.w, T1.y VEC_120
z: MUL R0.z, T3.w, PV17.x VEC_201
19 x: ADD_INT R2.x, -1, R2.x
w: MUL R0.w, R0.w, T0.y
20 x: DOT4 ____, R0.x, R0.x
y: DOT4 ____, R0.y, R0.y
z: DOT4 ____, R0.z, R0.z
w: DOT4 ____, PV19.w, PV19.w
21 x: RSQ_e ____, |PV20.x|
y: RSQ_e ____, |PV20.x|
z: RSQ_e ____, |PV20.x|
22 x: MUL R1.x, R0.x, PV21.y
y: MUL R1.y, R0.y, PV21.y
z: MUL R1.z, R0.z, PV21.y
w: MUL R1.w, R0.w, PV21.y
04 ENDLOOP i0 PASS_JUMP_ADDR(2)
05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3) VPM
06 ALU: ADDR(117) CNT(4)
23 x: MOV R0.x, 0.0f
y: MOV R0.y, 0.0f
z: MOV R0.z, 0.0f
w: MOV R0.w, 0.0f
07 EXP_DONE: PIX0, R0
08 END
END_OF_PROGRAM
Still , that isn't consistent with wavefront size , which would be 80 too .Specualtion:
Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.
Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight
a RCP will use 3 slots, the same as a SQRT or a RSQ:a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)Code:x: RCP_sat R2.x, R1.y y: RCP_sat ____, R1.y z: RCP_sat ____, R1.y
a 32bit integer MUL takes the full 4 slots:Code:3 x: MUL ____, R1.x, (0x3E22F983, 0.1591549367f).x 4 x: SIN R2.x, PV3.x y: SIN ____, PV3.x z: SIN ____, PV3.x
a 24bit integer MUL takes a single slot as expectedCode:x: MULLO_INT R2.x, R1.x, R1.x y: MULLO_INT ____, R1.x, R1.x z: MULLO_INT ____, R1.x, R1.x w: MULLO_INT ____, R1.x, R1.x
Code:3 x: MUL_INT24__NI R2.x, R1.x, R1.x
Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! (possibly utilizing the mantissa portion) ?a 32bit integer MUL takes the full 4 slots:
So all transcendentals are using x,y,z? No special tricks w lane can do?So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.