AMD: R9xx Speculation

Gipsel · Oct 19, 2010

DavidGraham said:
Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! (possibly utilizing the mantissa portion) ?

Such a 32bit integer multiplication can be constructed from four 16 bit integer multiplies and a series of adds. I hoped 3 would be enough or it would be at least possible to get the full 64bits result with one VLIW instruction group (which should be possible if the adders are fast and wide enough). But it needs two VLIW bundles to get that:

Code:

          4  x: MULLO_INT   R3.x,  R2.x,  R2.x      
             y: MULLO_INT   ____,  R2.x,  R2.x      
             z: MULLO_INT   ____,  R2.x,  R2.x      
             w: MULLO_INT   ____,  R2.x,  R2.x      
          5  x: MULHI_INT   R4.x,  R2.x,  R2.x      
             y: MULHI_INT   ____,  R2.x,  R2.x      
             z: MULHI_INT   ____,  R2.x,  R2.x      
             w: MULHI_INT   ____,  R2.x,  R2.x

It's quite late here. I would call it a day now :smile:

Gipsel · Oct 19, 2010

mczak said:
So all transcendentals are using x,y,z? No special tricks w lane can do?

Yep. I thought the w is the new t as I had seen it can't execute all instructions (like t). But afterall it looks like w can't execute the transcendentals :idea:

mczak said:
Who handles the float to int conversion stuff?

Let me check.
Edit:
Integer to float and float to integer conversions are handled by all 4 slots. Not together, but 1 slot does one conversion. So the throughput is 4 conversions per cycle max. That's qute a bit faster than Evergreen (conversions only in t unit).

Code:

          4  x: F_TO_I      R2.x,  R1.x      
             y: F_TO_I      R2.y,  R1.y      
             z: F_TO_I      R2.z,  R1.z      
             w: F_TO_I      R2.w,  R1.w

Works with rounding, too:

Code:

          4  x: RNDNE       R2.x,  R1.x      
             y: RNDNE       R2.y,  R1.y      
             z: RNDNE       R2.z,  R1.z      
             w: RNDNE       R2.w,  R1.w

And four float32 to float16 conversions (f2f16) compile to:

Code:

          4  x: FLT32_TO_FLT16_RTZ__NI  R2.x,  R1.x      
             y: FLT32_TO_FLT16_RTZ__NI  R2.y,  R1.y      
             z: FLT32_TO_FLT16_RTZ__NI  R2.z,  R1.z      
             w: FLT32_TO_FLT16_RTZ__NI  R2.w,  R1.w

rpg.314 · Oct 19, 2010

Kaotik said:
For someone who doesn't understand crap about shadercode etc, what exactly does Gipsels post mean?

Charlie was right about the 4 symmetric alu rumour.

rpg.314 · Oct 19, 2010

DarthShader said:
nApoleon says 1120sp are official for XT: http://www.chiphell.com/thread-130363-1-1.html

1120 is not divisible by 64, so goes aganinst whatever charlie said.

mczak · Oct 19, 2010

Gipsel said:
Yep. I thought the w is the new t as I had seen it can't execute all instructions (like t). But afterall it looks like w can't execute the transcendentals

Let me check.
Edit:
Integer to float and float to integer conversions are handled by all 4 slots. Not together, but 1 slot does one conversion. So the throughput is 4 conversions per cycle max. That's qute a bit faster than Evergreen (conversions only in t unit).

Ok. I was actually not sure that Evergreen couldn't do that already. The r700 isa doc has a nice list showing the instructions which can be executed on the different units, not so the Evergreen one - plus sometimes these restrictions aren't mentioned in the detailed instruction description neither.

Gipsel · Oct 19, 2010

rpg.314 said:
1120 is not divisible by 64, so goes aganinst whatever charlie said.

There are quite some ASICs enumerated in the driver. First (after the known ones) there are three with 5D shaders and only single precision (first generation Fusion, a.k.a. Wrestler, Sumo and Trinity?). After that come two ASICs with the 4 Slot design (which are both double precision capable, Cayman? + ???) and after that it goes back to the 5 slot layout (lower end GPUs?). At least the last two generations got enumerated from top to bottom.

FUDie · Oct 19, 2010

rpg.314 said:
1120 is not divisible by 64, so goes aganinst whatever charlie said.

Gipsel beat me to it.

-FUDie

rpg.314 · Oct 19, 2010

Gipsel said:
There are quite some ASICs enumerated in the driver. First (after the known ones) there are three with 5D shaders and only single precision (first generation Fusion, a.k.a. Wrestler, Sumo and Trinity?). After that come two ASICs with the 4 Slot design (which are both double precision capable, Cayman? + ???) and after that it goes back to the 5 slot layout (lower end GPUs?). At least the last two generations got enumerated from top to bottom.

Yeah, but Barts is supposed to have the new arch.

thatdude90210 · Oct 19, 2010

Gotta say, the 68xx reference cards have a nice look to them.

http://www.legitreviews.com/article/1444/1/

Edit: Anandtech have a few different pics and some stuff about fusion.
http://www.anandtech.com/show/3980/amds-radeon-hd-6800-series-llano-fusion-apu-a-story-in-pictures

chavvdarrr · Oct 19, 2010

rpg.314 said:
Charlie was right about the 4 symmetric alu rumour.

Barts is not vliw-4, just very minor tweaks to Juniper/cypress base.
Looks like 6xxx generation may become a mess of 3 different "micro-"architectures (vliw4, vliw5 gen2 and vliw5 if 5700 is renamed to 6700)
And Cayman's changes are not very impressive imo.

trinibwoy · Oct 19, 2010

What are Cayman's changes?

hkultala · Oct 19, 2010

DavidGraham;1483771 Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! [COLOR=White said:
(possibly utilizing the mantissa portion) ?[/COLOR]

(A+B) * (C+D) = A*C + A*D + B*C + B*D

When the multiplication is split into N parts, it needs N^2 partial multiplications.

neliz · Oct 19, 2010

Barts SEP:

Barts Pro $179
Barts XT $239

Kaotik · Oct 19, 2010

chavvdarrr said:
Barts is not vliw-4, just very minor tweaks to Juniper/cypress base.
Looks like 6xxx generation may become a mess of 3 different "micro-"architectures (vliw4, vliw5 gen2 and vliw5 if 5700 is renamed to 6700)
And Cayman's changes are not very impressive imo.

Then what are you suggesting to be the 2nd VLIW4 ASIC?
Cayman is one, and this should of course include Antilles too as it's utilizing 2 Cayman chips

Jawed · Oct 19, 2010

Gipsel said:
There was a slight problem with it, I didn't get it to compile with the "def c0" in it. The error message wasn't telling too much (unsupported opcode without saying wich one ), so I started to rip out everything (and it worked after exchanging c0 with a literal).

That's pretty weird, I posted a simple pixel shader:

Code:

struct vertex { 
   float4 colorA : color0; 
   float4 colorB : color1; 
}; 
float4 main(vertex IN) : COLOR { 
 float4 A = IN.colorA.xyzw; 
 float4 B = IN.colorB.xyzw; 
 return normalize((log(A+B))/(exp(A*B)));                 
}

What was left is that (edit: I just see there got a bit other stuff pasted in, but outside of the loop, so it shouldn't matter):
[...]
Still okay?

Yeah, looks OK. The ADDs have disappeared from the reference compilation I have here, but that doesn't matter.

On Cypress the ISA looks like that:
[...]
On that future 4 slot VLIW architecture like that:
[...]
So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.

Yes, 19% slower in this case. It's not as bad as I think we were expecting back then for two reasons:

transcendentals only use 3 lanes, not 4. Definitely wasn't expecting that. Perhaps each lane has a "mini-T" in it, doing some common stuff ahead of the major MULs/ADDs. Or maybe it's not based on the Lagrange approximation any more?
there is no multi-cycle transcendental. Everything is done in one cycle (well, we don't know about SIN and COS yet, and for those operand normalisation already required extra cycles, at least some of the time).

Also, I wonder how the double-precision versions will work? One of my later theories on this subject is that OpenCL requires actual precision (not woolly graphics transcendentals), so there's a strong motivation for reasonably fast transcendentals of both single- and double-precision variety (not crazy-fast). Perhaps this would partly motivate the simplification, since these more-precise transcendentals require some kind of macro (it's unclear to me whether approximate transcendentals are useful as starting-points for the precise versions). Which would lead to a de-emphasis of the relative throughput of graphics transcendentals. Well, that plus the fact that math is so cheap in this architecture, so what's a few extra cycles?

Thanks for doing that, very interesting. I'm curious to see if it's based on the old Lagrange approach...

Alexko · Oct 19, 2010

neliz said:
Barts SEP:

Barts Pro $179
Barts XT $239

Not bad at all! But that still leaves room for the GTX 460 768MB to exist around the $150 mark. The GTX 470, however, is as good as dead.

Oh, and on that topic, I don't usually advertise for anything, but there's a very very good deal on newegg right now:

GTX 460 768MB for $136.78 after promo code + MIR + shipping.

http://www.newegg.com/Product/Product.aspx?Item=N82E16814500173

Jawed · Oct 19, 2010

Aha, hadn't realised you'd edited this post:

Gipsel said:
Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight

a RCP will use 3 slots, the same as a SQRT or a RSQ:

Code:

x: RCP_sat R2.x, R1.y y: RCP_sat ____, R1.y z: RCP_sat ____, R1.y

a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)

Because the fast-approximation technique requires the input to be in the range of -Pi/2 to +PI/2. This is referenced in some of the patents on this subject.

Code:
a 32bit integer MUL takes the full 4 slots:

Code:

x: MULLO_INT R2.x, R1.x, R1.x y: MULLO_INT ____, R1.x, R1.x z: MULLO_INT ____, R1.x, R1.x w: MULLO_INT ____, R1.x, R1.x

Hmm, that's curious. I suppose it's treating each operand as two 16-bit blocks. Alternatively, it might only be using 3 lanes, but the fourth lane is locked out (a bit like the old, pre-Evergreen, DP3 locks out the fourth lane with 0 * 0).

Code:
a 24bit integer MUL takes a single slot as expected

Code:

3 x: MUL_INT24__NI R2.x, R1.x, R1.x

Interesting to see the NI tag there, not sure why it would be different from Evergreen's.

Oh, another thing. Can you try to set or reset the ASIC_ALU_REORDER flag on compilation?

chavvdarrr · Oct 19, 2010

Kaotik said:
Cayman is one, and this should of course include Antilles too as it's utilizing 2 Cayman chips

Why?
In terms of features 2x Cayman == Cayman, but that doesn't mean that in all situations Antilles will be treated same way.
I bet generated code is same for both vliw4 chips, ie they just split in order to keep it clean "this is cayman, this is 2x cayman".

Tchock · Oct 19, 2010

Slides are out:
http://www.chiphell.com/thread-130729-1-1.html (Login needed)

Key stuff?
MLAA through DirectCompute (DX9-10-11), 255mm^2, 1.7Bil Transistors, still 1 Tri/Clock, Tessellation improvements coming from thread management and buffering, ~1.5x-2x speed until ~20 Tesselation factor where it tends back to 1x.

oh and 4D!

Per farhan's request (pesky pesky)

Everything else: http://img192.imageshack.us/gal.php?g=170805bw2q2b1fwstjgbgt.jpg

Gipsel · Oct 19, 2010

rpg.314 said:
Yeah, but Barts is supposed to have the new arch.

You know obviously more than me.
Edit: Can someone post the slides Tchock linked to? Do we see wavefronts with 80 threads or does Barts also have a completly new SIMD/TMU layout?

Jawed said:
That's pretty weird, I posted a simple pixel shader

It doesn't compile for Cypress neither. It's probably the way I generate the code. I may miss some kind of preprocessing step, the syntax is stricter than what one is used to (it does compile in the SKA, but with that I don't get the ISA code for the new stuff

).

Jawed said:
Also, I wonder how the double-precision versions will work?

Double pecision will still work with software libraries for complex operations. At least there are no new instructions for that.

Jawed said:
Because the fast-approximation technique requires the input to be in the range of -Pi/2 to +PI/2. This is referenced in some of the patents on this subject.

But it is not doing a renormalization to be between -pi/2 and +pi/2, it simply divides by 2Pi. That's something one should be able to incorporate into the lookup tables, isn't it?

Jawed said:
Interesting to see the NI tag there, not sure why it would be different from Evergreen's.

Evergreen does not have signed 24bit integer multiplies (just unsigned)

Jawed said:
Oh, another thing. Can you try to set or reset the ASIC_ALU_REORDER flag on compilation?

As far as I see it is not set. I'm at work now. May have a look into it this tonight (but time may be constrained).

AMD: R9xx Speculation

Gipsel

Gipsel

rpg.314

rpg.314

mczak

Gipsel

FUDie

rpg.314

thatdude90210

chavvdarrr

trinibwoy

Meh

hkultala

neliz

GIGABYTE Man

Kaotik

Drunk Member

Jawed

Alexko

Jawed

chavvdarrr

Tchock

Gipsel

Similar threads