Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 18-Oct-2010, 23:01   #3401
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by Sontin View Post
Don't forget the Frond-End with the 25% higher clock rate.
And things like internal l2-l1 bandwidth (if the bus stays the same that is).
mczak is offline   Reply With Quote
Old 18-Oct-2010, 23:29   #3402
DarthShader
Member
 
Join Date: Jul 2010
Location: Land of Mu
Posts: 350
Default

nApoleon says 1120sp are official for XT: http://www.chiphell.com/thread-130363-1-1.html

While pclab.pl post more pics of XFX cards, saying the 6870 has 1280 shaders: http://pclab.pl/news43574.html . Funnily they are unsure about 6850 final clocks, either 725 or 775 mhz.

Specualtion:

Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.
DarthShader is offline   Reply With Quote
Old 18-Oct-2010, 23:33   #3403
jimbo75
Member
 
Join Date: Jan 2010
Posts: 845
Default

Honestly AMD must be pissing themselves with laughter.
jimbo75 is online now   Reply With Quote
Old 18-Oct-2010, 23:33   #3404
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight

a RCP will use 3 slots, the same as a SQRT or a RSQ:
Code:
x: RCP_sat     R2.x,  R1.y      
y: RCP_sat     ____,  R1.y      
z: RCP_sat     ____,  R1.y
a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)
Code:
          3  x: MUL         ____,  R1.x,  (0x3E22F983, 0.1591549367f).x
          4  x: SIN         R2.x,  PV3.x      
             y: SIN         ____,  PV3.x      
             z: SIN         ____,  PV3.x
a 32bit integer MUL takes the full 4 slots:
Code:
x: MULLO_INT   R2.x,  R1.x,  R1.x      
y: MULLO_INT   ____,  R1.x,  R1.x      
z: MULLO_INT   ____,  R1.x,  R1.x      
w: MULLO_INT   ____,  R1.x,  R1.x
a 24bit integer MUL takes a single slot as expected
Code:
          3  x: MUL_INT24__NI  R2.x,  R1.x,  R1.x

Last edited by Gipsel; 19-Oct-2010 at 00:29.
Gipsel is offline   Reply With Quote
Old 18-Oct-2010, 23:41   #3405
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

You sneaky chappy! Hmm, very interesting.
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 18-Oct-2010, 23:43   #3406
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by Jawed View Post
You sneaky chappy! Hmm, very interesting.
Any wishes? You can send me IL code and I will show you how it will look like in ISA.

Btw., it really looks like you were right.
Gipsel is offline   Reply With Quote
Old 18-Oct-2010, 23:48   #3407
Kaotik
yes, i'm drunk
 
Join Date: Apr 2003
Posts: 4,809
Send a message via ICQ to Kaotik
Default

Quote:
Originally Posted by DarthShader View Post
Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.
The drivers still state clearly that Antilles = Cayman based, not Barts.

For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is offline   Reply With Quote
Old 18-Oct-2010, 23:53   #3408
DarthShader
Member
 
Join Date: Jul 2010
Location: Land of Mu
Posts: 350
Default

Quote:
Originally Posted by Kaotik View Post
For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?
That's what I am wondering too.

But I think I got it: RCP = reciprocal. And if it takes three slots, starting from x, then it means the shaders have been indeed buffed, but not in a why I'd like them too. Still, could be pretty efficient, if there's a 4D-vliw structure.
DarthShader is offline   Reply With Quote
Old 18-Oct-2010, 23:58   #3409
Rebel44
Registered
 
Join Date: Feb 2007
Posts: 64
Default

Quote:
Originally Posted by LordEC911 View Post
On a slight tangent- Was helping out a Fry's Electronics employee at my work, started chatting about his job, he does sales in computer accessories. Just asked him if he had heard about AMD releasing a new product. He said that the last week or so employees kept seeing them in the back and putting them out on the shelf so they finally locked them up. I was like damn, could have gotten one early.

I'll have to try and find one of those clueless employees a week or two before Cayman is released.
As long as product isnt really locked and it already has bar code, its easier to bribe some random emploee with few $ and he will bring it to you - at least that is my experience from retail.
Rebel44 is offline   Reply With Quote
Old 19-Oct-2010, 00:07   #3410
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by Kaotik View Post
For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?
It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.

And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.
Gipsel is offline   Reply With Quote
Old 19-Oct-2010, 00:21   #3411
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by Gipsel View Post
It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.
So if that's vliw-4 how many simds?
If even RCP is 3 taking 3 slots are the other transcendentals taking all 4 (given that RCP should be the cheapest one)?
mczak is offline   Reply With Quote
Old 19-Oct-2010, 00:26   #3412
Arty
KEPLER
 
Join Date: Jun 2005
Posts: 1,892
Default

Quote:
Originally Posted by Gipsel View Post
It means that there is one member of a future GPU line from AMD which has indeed 4 slot VLIW units. The t slot got deleted and the other ones are beefed up a bit. Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.

And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.
__________________
People like you - Silent_Buddha laying an epic smackdown on XMAN26's double standards.
So you're mixing apples and oranges to calculate grapes and then compare it to apples. - silent_guy's witty retort on sweeping comparisons.
Arty is offline   Reply With Quote
Old 19-Oct-2010, 00:27   #3413
DarthShader
Member
 
Join Date: Jul 2010
Location: Land of Mu
Posts: 350
Default

Quote:
Originally Posted by Gipsel View Post
Complex operations like transcendentals (hence the name for the old t slot) are now done not by a single slot but instead by several slots working together (like already the case for double precision). A reciprocal is done by 3 slots.
How is this different to a the situation, where the t-unit used other lanes for help? I can only see benefit, if a transcendental takes up only two slots, so two in parallel could be perfromed.

Quote:
And as expected, double precision is 1/4 of SP, at least for FMA and MUL, for ADD it's 1/2.
Double precision for Barts enabled? Would be big news for me! Or do you mean "ne member of a future GPU line from AMD".
DarthShader is offline   Reply With Quote
Old 19-Oct-2010, 00:32   #3414
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Gipsel View Post
Any wishes? You can send me IL code and I will show you how it will look like in ISA.
Ooh, thanks.

Here's something old that I've tweaked for extra transcendentals:
Code:
il_ps_2_0
dcldef_x(*)_y(*)_z(*)_w(*) r0
def c0, 0.6931471825, 1.442695022, 0.0, 0.0
dclpin_usage(color)_usageIndex(10)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn0
dclpin_usage(color)_usageIndex(26)_x(*)_y(*)_z(*)_w(*)_centroid vPixIn1
mov r0, vPixIn0
add r1, r0, vPixIn1
log_zeroop(fltmax) r2.x___, r1.x_abs
log_zeroop(fltmax) r2._y__, r1.y_abs
log_zeroop(fltmax) r2.__z_, r1.z_abs
log_zeroop(fltmax) r2.___w, r1.w_abs
mul r1, r2, c0.x
mul r0, r0, vPixIn1
mul r0, r0, c0.y
exp r0.x___, r0.x
rcp_zeroop(infinity) r2.x___, r0.x
exp r0.x___, r0.y
rcp_zeroop(infinity) r2._y__, r0.x
exp r0.x___, r0.z
exp r0._y__, r0.w
rcp_zeroop(infinity) r2.___w, r0.y
rcp_zeroop(infinity) r2.__z_, r0.x
mul r0, r1, r2
dp4 r1.x___, r0, r0
rsq_zeroop(infinity) r1.x___, r1.x_abs
mul r37, r0, r1.x
colorclamp oC0, r37
end
A comparison of VLIW-5 and VLIW-4 would be fairly interesting.

Quote:
Btw., it really looks like you were right.
I'm dead chuffed. Maybe I should retire, got something right for a change
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 19-Oct-2010, 00:34   #3415
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by Arty View Post
Maybe I should say at least one. Unfortunately I cannot connect names to it right now.
Gipsel is offline   Reply With Quote
Old 19-Oct-2010, 00:36   #3416
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,089
Default

What is the throughput of RCP when issued through the current T lane?

I'm trying to think of what sequence of operations can be done across three lanes.
A successive approximation of an RCP would take 5 iterations to yield a single-precision result if using plain math ops. Perhaps a more elaborate method is used?
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 19-Oct-2010, 00:49   #3417
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,863
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by 3dilettante View Post
What is the throughput of RCP when issued through the current T lane?
1 cycle, for one scalar per cycle.

Quote:
I'm trying to think of what sequence of operations can be done across three lanes.
A successive approximation of an RCP would take 5 iterations to yield a single-precision result if using plain math ops. Perhaps a more elaborate method is used?
My original proposal was to take the look-up tables and spread them amongst the four lanes. Then use the single-cycle dependent math capability of the 4 lanes, working together, to perform all the math (Lagrange polynomials).

http://forum.beyond3d.com/showthread...26#post1417026

Back then the question became about throughput. Also, it's worth bearing in mind that for graphics the error can be quite large (it's single precision for a start).

I didn't put any trig in the shader earlier, because that gets swamped by operand-normalisation (or at least it does some of the time - not sure, something I haven't studied).
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 19-Oct-2010, 01:13   #3418
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by Jawed View Post
Ooh, thanks.

Here's something old that I've tweaked for extra transcendentals:
[..]
A comparison of VLIW-5 and VLIW-4 would be fairly interesting.
There was a slight problem with it, I didn't get it to compile with the "def c0" in it. The error message wasn't telling too much (unsupported opcode without saying wich one ), so I started to rip out everything (and it worked after exchanging c0 with a literal). What was left is that (edit: I just see there got a bit other stuff pasted in, but outside of the loop, so it shouldn't matter):
Code:
il_ps_2_0
dcl_literal l1, 0.6931471825, 1.442695022, 0.0, 0.0
dcl_literal l0, 0x10000, 0xffffffff, 2.1, 0.01
mov r13.x, l0.x ; loop counter
mov r1.xyzw,l0.zzzz ; arbitrary values, r1 = 2*2-2 = 2
mov r2.xyzw,l0.wwww ; r2 = 0*0+0 = 0
mov r0, l0
add r1, r0, l0.wzyx
whileloop
break_logicalz r13 ; while(r13 > 0)
log_zeroop(fltmax) r2.x___, r1.x_abs
log_zeroop(fltmax) r2._y__, r1.y_abs
log_zeroop(fltmax) r2.__z_, r1.z_abs
log_zeroop(fltmax) r2.___w, r1.w_abs
mul r1, r2, l1.x
mul r0, r0, l0.wzyx
mul r0, r0, l1.y
exp r0.x___, r0.x
rcp_zeroop(infinity) r2.x___, r0.x
exp r0.x___, r0.y
rcp_zeroop(infinity) r2._y__, r0.x
exp r0.x___, r0.z
exp r0._y__, r0.w
rcp_zeroop(infinity) r2.___w, r0.y
rcp_zeroop(infinity) r2.__z_, r0.x
mul r0, r1, r2
dp4 r1.x___, r0, r0
rsq_zeroop(infinity) r1.x___, r1.x_abs
mul r1, r0, r1.x
iadd r13.x, r13.x, l0.y ; counter--
endloop
mov g[0], r1
end
Still okay?
On Cypress the ISA looks like that:
Code:
; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) 
      0  x: MOV         R0.x,  (0x00010000, 9.183549616e-41f).x      
         y: MOV         R0.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R0.z,  (0x40066666, 2.099999905f).z      
         w: MOV         R0.w,  (0x3C23D70A, 0.009999999776f).w      
         t: MOV         R1.x,  (0x3C23D70A, 0.009999999776f).w      
      1  x: MOV         R2.x,  (0x00010000, 9.183549616e-41f).x      
         y: MOV         R1.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R1.z,  (0xFFFFFFFF, -1.#QNANf).y      
         w: MOV         R1.w,  (0x3C23D70A, 0.009999999776f).z      
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
    02 ALU_BREAK: ADDR(45) CNT(1) 
          2  x: PREDNE_INT  ____,  R2.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
    03 ALU: ADDR(46) CNT(44) 
          3  x: MUL         ____,  R0.z,  (0xFFFFFFFF, -1.#QNANf).x      
             y: MUL         ____,  R0.y,  (0x40066666, 2.099999905f).y      
             z: MUL         ____,  R0.x,  (0x3C23D70A, 0.009999999776f).z      
             w: MUL         ____,  R0.w,  (0x00010000, 9.183549616e-41f).w      
             t: LOG_sat     T0.z,  |R1.x|      
          4  x: MUL         T0.x,  PV3.x,  (0x3FB8AA3B, 1.442695022f).x      
             y: MUL         T0.y,  PV3.y,  (0x3FB8AA3B, 1.442695022f).x      
             z: MUL         T1.z,  PV3.z,  (0x3FB8AA3B, 1.442695022f).x      
             w: MUL         T0.w,  PV3.w,  (0x3FB8AA3B, 1.442695022f).x      
             t: LOG_sat     ____,  |R1.y|      
          5  x: ADD_INT     R2.x,  -1,  R2.x      
             y: MUL         T1.y,  PS4,  (0x3F317218, 0.6931471825f).x      
             z: MUL         T0.z,  T0.z,  (0x3F317218, 0.6931471825f).x      
             t: LOG_sat     ____,  |R1.z|      
          6  x: MUL         T2.x,  PS5,  (0x3F317218, 0.6931471825f).x      
             t: LOG_sat     ____,  |R1.w|      
          7  w: MUL         T1.w,  PS6,  (0x3F317218, 0.6931471825f).x      
             t: EXP_e       T1.z,  T1.z      
          8  t: EXP_e       T1.x,  T0.y      
          9  t: EXP_e       T2.z,  T0.x      
         10  t: EXP_e       T0.y,  T0.w      
         11  t: RCP_e       ____,  T1.z      
         12  x: MUL         R0.x,  T0.z,  PS11      
             t: RCP_e       ____,  T1.x      
         13  y: MUL         R0.y,  T1.y,  PS12      
             t: RCP_e       T1.x,  T0.y      
         14  t: RCP_e       ____,  T2.z      
         15  z: MUL         R0.z,  T2.x,  PS14      
             w: MUL         R0.w,  T1.w,  T1.x      
         16  x: DOT4        ____,  R0.x,  R0.x      
             y: DOT4        ____,  R0.y,  R0.y      
             z: DOT4        ____,  PV15.z,  PV15.z      
             w: DOT4        ____,  PV15.w,  PV15.w      
         17  t: RSQ_e       ____,  |PV16.x|      
         18  x: MUL         R1.x,  R0.x,  PS17      
             y: MUL         R1.y,  R0.y,  PS17      
             z: MUL         R1.z,  R0.z,  PS17      
             w: MUL         R1.w,  R0.w,  PS17      
04 ENDLOOP i0 PASS_JUMP_ADDR(2) 
05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3) 
06 ALU: ADDR(90) CNT(4) 
     19  x: MOV         R0.x,  0.0f      
         y: MOV         R0.y,  0.0f      
         z: MOV         R0.z,  0.0f      
         w: MOV         R0.w,  0.0f      
07 EXP_DONE: PIX0, R0
END_OF_PROGRAM
On that future 4 slot VLIW architecture like that:
Code:
; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(13) 
      0  x: MOV         R0.x,  (0x00010000, 9.183549616e-41f).x      
         y: MOV         R0.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R0.z,  (0x40066666, 2.099999905f).z      
         w: MOV         R0.w,  (0x3C23D70A, 0.009999999776f).w      
      1  x: MOV         R1.x,  (0x3C23D70A, 0.009999999776f).x      
         y: MOV         R1.y,  (0xFFFFFFFF, -1.#QNANf).y      
         z: MOV         R1.z,  (0xFFFFFFFF, -1.#QNANf).y      
         w: MOV         R1.w,  (0x3C23D70A, 0.009999999776f).x      
      2  x: MOV         R2.x,  (0x00010000, 9.183549616e-41f).x      
01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 
    02 ALU: ADDR(45) CNT(1) 
          3  x: PREDNE_INT  ____,  R2.x,  0.0f      UPDATE_EXEC_MASK BREAK UPDATE_PRED 
    03 ALU: ADDR(46) CNT(71) 
          4  x: MUL         ____,  R0.z,  (0xFFFFFFFF, -1.#QNANf).x      
             y: MUL         ____,  R0.y,  (0x40066666, 2.099999905f).y      
             z: MUL         ____,  R0.x,  (0x3C23D70A, 0.009999999776f).z      
             w: MUL         ____,  R0.w,  (0x00010000, 9.183549616e-41f).w      
          5  x: MUL         T0.x,  PV4.x,  (0x3FB8AA3B, 1.442695022f).x      
             y: MUL         T0.y,  PV4.w,  (0x3FB8AA3B, 1.442695022f).x      
             z: MUL         T0.z,  PV4.z,  (0x3FB8AA3B, 1.442695022f).x      
             w: MUL         T0.w,  PV4.y,  (0x3FB8AA3B, 1.442695022f).x      
          6  x: LOG_sat     ____,  |R1.x|      
             y: LOG_sat     ____,  |R1.x|      
             z: LOG_sat     ____,  |R1.x|      
          7  x: LOG_sat     ____,  |R1.y|      
             y: LOG_sat     ____,  |R1.y|      
             z: LOG_sat     ____,  |R1.y|      
             w: MUL         T1.w,  PV6.x,  (0x3F317218, 0.6931471825f).x      
          8  x: LOG_sat     ____,  |R1.z|      
             y: LOG_sat     ____,  |R1.z|      
             z: LOG_sat     ____,  |R1.z|      
             w: MUL         T2.w,  PV7.z,  (0x3F317218, 0.6931471825f).x      
          9  x: LOG_sat     ____,  |R1.w|      
             y: LOG_sat     ____,  |R1.w|      
             z: LOG_sat     ____,  |R1.w|      
             w: MUL         T3.w,  PV8.y,  (0x3F317218, 0.6931471825f).x      
         10  x: EXP_e       ____,  T0.z      
             y: EXP_e       T1.y,  T0.z      
             z: EXP_e       ____,  T0.z      
             w: MUL         R0.w,  PV9.x,  (0x3F317218, 0.6931471825f).x      
         11  x: EXP_e       ____,  T0.w      
             y: EXP_e       ____,  T0.w      
             z: EXP_e       T0.z,  T0.w      
         12  x: EXP_e       T0.x,  T0.x      
             y: EXP_e       ____,  T0.x      
             z: EXP_e       ____,  T0.x      
         13  x: EXP_e       ____,  T0.y      
             y: EXP_e       ____,  T0.y      
             z: EXP_e       T1.z,  T0.y      
         14  x: RCP_e       T1.x,  T1.y      
             y: RCP_e       ____,  T1.y      
             z: RCP_e       ____,  T1.y      
         15  x: RCP_e       ____,  T0.z      
             y: RCP_e       T1.y,  T0.z      
             z: RCP_e       ____,  T0.z      
         16  x: RCP_e       ____,  T1.z      
             y: RCP_e       T0.y,  T1.z      
             z: RCP_e       ____,  T1.z      
         17  x: RCP_e       ____,  T0.x      
             y: RCP_e       ____,  T0.x      
             z: RCP_e       ____,  T0.x      
         18  x: MUL         R0.x,  T1.w,  T1.x      
             y: MUL         R0.y,  T2.w,  T1.y      VEC_120 
             z: MUL         R0.z,  T3.w,  PV17.x      VEC_201 
         19  x: ADD_INT     R2.x,  -1,  R2.x      
             w: MUL         R0.w,  R0.w,  T0.y      
         20  x: DOT4        ____,  R0.x,  R0.x      
             y: DOT4        ____,  R0.y,  R0.y      
             z: DOT4        ____,  R0.z,  R0.z      
             w: DOT4        ____,  PV19.w,  PV19.w      
         21  x: RSQ_e       ____,  |PV20.x|      
             y: RSQ_e       ____,  |PV20.x|      
             z: RSQ_e       ____,  |PV20.x|      
         22  x: MUL         R1.x,  R0.x,  PV21.y      
             y: MUL         R1.y,  R0.y,  PV21.y      
             z: MUL         R1.z,  R0.z,  PV21.y      
             w: MUL         R1.w,  R0.w,  PV21.y      
04 ENDLOOP i0 PASS_JUMP_ADDR(2) 
05 MEM_EXPORT_WRITE: DWORD_PTR[0], R1, ELEM_SIZE(3)  VPM 
06 ALU: ADDR(117) CNT(4) 
     23  x: MOV         R0.x,  0.0f      
         y: MOV         R0.y,  0.0f      
         z: MOV         R0.z,  0.0f      
         w: MOV         R0.w,  0.0f      
07 EXP_DONE: PIX0, R0
08 END 
END_OF_PROGRAM
So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.

Last edited by Gipsel; 19-Oct-2010 at 01:20.
Gipsel is offline   Reply With Quote
Old 19-Oct-2010, 02:21   #3419
DavidGraham
Member
 
Join Date: Dec 2009
Posts: 582
Default

Quote:
Originally Posted by DarthShader View Post
Specualtion:

Barts chip has indeed 1280sp's, but the 6870 and 6850 are salvage parts with parts of the chip deactivated. They seem to be "good enough" anyways, while the full Barts chips can be saved for either a 6890 or a dual card (Antilles?). That would explain the architecture quirks too and would explain all this confusion.
Still , that isn't consistent with wavefront size , which would be 80 too .

Quote:
Originally Posted by Gipsel View Post
Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight

a RCP will use 3 slots, the same as a SQRT or a RSQ:
Code:
x: RCP_sat     R2.x,  R1.y      
y: RCP_sat     ____,  R1.y      
z: RCP_sat     ____,  R1.y
a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)
Code:
          3  x: MUL         ____,  R1.x,   (0x3E22F983, 0.1591549367f).x
          4  x: SIN         R2.x,  PV3.x      
             y: SIN         ____,  PV3.x      
             z: SIN         ____,  PV3.x
a 32bit integer MUL takes the full 4 slots:
Code:
x: MULLO_INT   R2.x,  R1.x,  R1.x      
y: MULLO_INT   ____,  R1.x,  R1.x      
z: MULLO_INT   ____,  R1.x,  R1.x      
w: MULLO_INT   ____,  R1.x,  R1.x
a 24bit integer MUL takes a single slot as expected
Code:
          3  x: MUL_INT24__NI  R2.x,  R1.x,   R1.x
Excellent work Mr.Gpisel , thx for the insight !

Quote:
a 32bit integer MUL takes the full 4 slots:
Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! (possibly utilizing the mantissa portion) ?
DavidGraham is offline   Reply With Quote
Old 19-Oct-2010, 02:23   #3420
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by Gipsel View Post
So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.
So all transcendentals are using x,y,z? No special tricks w lane can do?
Who handles the float to int conversion stuff?
mczak is offline   Reply With Quote
Old 19-Oct-2010, 02:33   #3421
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by DavidGraham View Post
Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! (possibly utilizing the mantissa portion) ?
Such a 32bit integer multiplication can be constructed from four 16 bit integer multiplies and a series of adds. I hoped 3 would be enough or it would be at least possible to get the full 64bits result with one VLIW instruction group (which should be possible if the adders are fast and wide enough). But it needs two VLIW bundles to get that:
Code:
          4  x: MULLO_INT   R3.x,  R2.x,  R2.x      
             y: MULLO_INT   ____,  R2.x,  R2.x      
             z: MULLO_INT   ____,  R2.x,  R2.x      
             w: MULLO_INT   ____,  R2.x,  R2.x      
          5  x: MULHI_INT   R4.x,  R2.x,  R2.x      
             y: MULHI_INT   ____,  R2.x,  R2.x      
             z: MULHI_INT   ____,  R2.x,  R2.x      
             w: MULHI_INT   ____,  R2.x,  R2.x
It's quite late here. I would call it a day now
Gipsel is offline   Reply With Quote
Old 19-Oct-2010, 02:36   #3422
Gipsel
Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 987
Default

Quote:
Originally Posted by mczak View Post
So all transcendentals are using x,y,z? No special tricks w lane can do?
Yep. I thought the w is the new t as I had seen it can't execute all instructions (like t). But afterall it looks like w can't execute the transcendentals
Quote:
Originally Posted by mczak View Post
Who handles the float to int conversion stuff?
Let me check.
Edit:
Integer to float and float to integer conversions are handled by all 4 slots. Not together, but 1 slot does one conversion. So the throughput is 4 conversions per cycle max. That's qute a bit faster than Evergreen (conversions only in t unit).
Code:
          4  x: F_TO_I      R2.x,  R1.x      
             y: F_TO_I      R2.y,  R1.y      
             z: F_TO_I      R2.z,  R1.z      
             w: F_TO_I      R2.w,  R1.w
Works with rounding, too:
Code:
          4  x: RNDNE       R2.x,  R1.x      
             y: RNDNE       R2.y,  R1.y      
             z: RNDNE       R2.z,  R1.z      
             w: RNDNE       R2.w,  R1.w
And four float32 to float16 conversions (f2f16) compile to:
Code:
          4  x: FLT32_TO_FLT16_RTZ__NI  R2.x,  R1.x      
             y: FLT32_TO_FLT16_RTZ__NI  R2.y,  R1.y      
             z: FLT32_TO_FLT16_RTZ__NI  R2.z,  R1.z      
             w: FLT32_TO_FLT16_RTZ__NI  R2.w,  R1.w

Last edited by Gipsel; 19-Oct-2010 at 02:59.
Gipsel is offline   Reply With Quote
Old 19-Oct-2010, 02:58   #3423
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Kaotik View Post
For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?
Charlie was right about the 4 symmetric alu rumour.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 19-Oct-2010, 02:59   #3424
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by DarthShader View Post
nApoleon says 1120sp are official for XT: http://www.chiphell.com/thread-130363-1-1.html
1120 is not divisible by 64, so goes aganinst whatever charlie said.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 19-Oct-2010, 03:08   #3425
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by Gipsel View Post
Yep. I thought the w is the new t as I had seen it can't execute all instructions (like t). But afterall it looks like w can't execute the transcendentals

Let me check.
Edit:
Integer to float and float to integer conversions are handled by all 4 slots. Not together, but 1 slot does one conversion. So the throughput is 4 conversions per cycle max. That's qute a bit faster than Evergreen (conversions only in t unit).
Ok. I was actually not sure that Evergreen couldn't do that already. The r700 isa doc has a nice list showing the instructions which can be executed on the different units, not so the Evergreen one - plus sometimes these restrictions aren't mentioned in the detailed instruction description neither.
mczak is offline   Reply With Quote

Reply

Tags
Барт, Кайман

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 23:38.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.