AMD: R9xx Speculation

Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! (possibly utilizing the mantissa portion) ?
Such a 32bit integer multiplication can be constructed from four 16 bit integer multiplies and a series of adds. I hoped 3 would be enough or it would be at least possible to get the full 64bits result with one VLIW instruction group (which should be possible if the adders are fast and wide enough). But it needs two VLIW bundles to get that:
Code:
          4  x: MULLO_INT   R3.x,  R2.x,  R2.x      
             y: MULLO_INT   ____,  R2.x,  R2.x      
             z: MULLO_INT   ____,  R2.x,  R2.x      
             w: MULLO_INT   ____,  R2.x,  R2.x      
          5  x: MULHI_INT   R4.x,  R2.x,  R2.x      
             y: MULHI_INT   ____,  R2.x,  R2.x      
             z: MULHI_INT   ____,  R2.x,  R2.x      
             w: MULHI_INT   ____,  R2.x,  R2.x
It's quite late here. I would call it a day now :smile:
 
So all transcendentals are using x,y,z? No special tricks w lane can do?
Yep. I thought the w is the new t as I had seen it can't execute all instructions (like t). But afterall it looks like w can't execute the transcendentals :idea:
Who handles the float to int conversion stuff?
Let me check.
Edit:
Integer to float and float to integer conversions are handled by all 4 slots. Not together, but 1 slot does one conversion. So the throughput is 4 conversions per cycle max. That's qute a bit faster than Evergreen (conversions only in t unit).
Code:
          4  x: F_TO_I      R2.x,  R1.x      
             y: F_TO_I      R2.y,  R1.y      
             z: F_TO_I      R2.z,  R1.z      
             w: F_TO_I      R2.w,  R1.w
Works with rounding, too:
Code:
          4  x: RNDNE       R2.x,  R1.x      
             y: RNDNE       R2.y,  R1.y      
             z: RNDNE       R2.z,  R1.z      
             w: RNDNE       R2.w,  R1.w
And four float32 to float16 conversions (f2f16) compile to:
Code:
          4  x: FLT32_TO_FLT16_RTZ__NI  R2.x,  R1.x      
             y: FLT32_TO_FLT16_RTZ__NI  R2.y,  R1.y      
             z: FLT32_TO_FLT16_RTZ__NI  R2.z,  R1.z      
             w: FLT32_TO_FLT16_RTZ__NI  R2.w,  R1.w
 
Last edited by a moderator:
Yep. I thought the w is the new t as I had seen it can't execute all instructions (like t). But afterall it looks like w can't execute the transcendentals :idea:

Let me check.
Edit:
Integer to float and float to integer conversions are handled by all 4 slots. Not together, but 1 slot does one conversion. So the throughput is 4 conversions per cycle max. That's qute a bit faster than Evergreen (conversions only in t unit).
Ok. I was actually not sure that Evergreen couldn't do that already. The r700 isa doc has a nice list showing the instructions which can be executed on the different units, not so the Evergreen one - plus sometimes these restrictions aren't mentioned in the detailed instruction description neither.
 
1120 is not divisible by 64, so goes aganinst whatever charlie said.
There are quite some ASICs enumerated in the driver. First (after the known ones) there are three with 5D shaders and only single precision (first generation Fusion, a.k.a. Wrestler, Sumo and Trinity?). After that come two ASICs with the 4 Slot design (which are both double precision capable, Cayman? + ???) and after that it goes back to the 5 slot layout (lower end GPUs?). At least the last two generations got enumerated from top to bottom.
:sleep:
 
Last edited by a moderator:
There are quite some ASICs enumerated in the driver. First (after the known ones) there are three with 5D shaders and only single precision (first generation Fusion, a.k.a. Wrestler, Sumo and Trinity?). After that come two ASICs with the 4 Slot design (which are both double precision capable, Cayman? + ???) and after that it goes back to the 5 slot layout (lower end GPUs?). At least the last two generations got enumerated from top to bottom.
:sleep:

Yeah, but Barts is supposed to have the new arch.
 
Charlie was right about the 4 symmetric alu rumour.
Barts is not vliw-4, just very minor tweaks to Juniper/cypress base.
Looks like 6xxx generation may become a mess of 3 different "micro-"architectures (vliw4, vliw5 gen2 and vliw5 if 5700 is renamed to 6700)
And Cayman's changes are not very impressive imo.
 
DavidGraham;1483771 Why the hell would that happen ? I mean isn't two slots enough for that ? the only way I see that happening is if each slot pefroms 1/4 of the operation (8-bits)! [COLOR=White said:
(possibly utilizing the mantissa portion) ?[/COLOR]

(A+B) * (C+D) = A*C + A*D + B*C + B*D

When the multiplication is split into N parts, it needs N^2 partial multiplications.
 
Barts is not vliw-4, just very minor tweaks to Juniper/cypress base.
Looks like 6xxx generation may become a mess of 3 different "micro-"architectures (vliw4, vliw5 gen2 and vliw5 if 5700 is renamed to 6700)
And Cayman's changes are not very impressive imo.

Then what are you suggesting to be the 2nd VLIW4 ASIC?
Cayman is one, and this should of course include Antilles too as it's utilizing 2 Cayman chips
 
There was a slight problem with it, I didn't get it to compile with the "def c0" in it. The error message wasn't telling too much (unsupported opcode without saying wich one :rolleyes:), so I started to rip out everything (and it worked after exchanging c0 with a literal).
That's pretty weird, I posted a simple pixel shader:

Code:
struct vertex { 
   float4 colorA : color0; 
   float4 colorB : color1; 
}; 
float4 main(vertex IN) : COLOR { 
 float4 A = IN.colorA.xyzw; 
 float4 B = IN.colorB.xyzw; 
 return normalize((log(A+B))/(exp(A*B)));                 
}

What was left is that (edit: I just see there got a bit other stuff pasted in, but outside of the loop, so it shouldn't matter):
[...]
Still okay?
Yeah, looks OK. The ADDs have disappeared from the reference compilation I have here, but that doesn't matter.

On Cypress the ISA looks like that:
[...]
On that future 4 slot VLIW architecture like that:
[...]
So the throughput for code wth a lot of transcendental suffers a bit just I was expecting it in the discussion back then.


Yes, 19% slower in this case. It's not as bad as I think we were expecting back then for two reasons:
  1. transcendentals only use 3 lanes, not 4. Definitely wasn't expecting that. Perhaps each lane has a "mini-T" in it, doing some common stuff ahead of the major MULs/ADDs. Or maybe it's not based on the Lagrange approximation any more?
  2. there is no multi-cycle transcendental. Everything is done in one cycle (well, we don't know about SIN and COS yet, and for those operand normalisation already required extra cycles, at least some of the time).
Also, I wonder how the double-precision versions will work? One of my later theories on this subject is that OpenCL requires actual precision (not woolly graphics transcendentals), so there's a strong motivation for reasonably fast transcendentals of both single- and double-precision variety (not crazy-fast). Perhaps this would partly motivate the simplification, since these more-precise transcendentals require some kind of macro (it's unclear to me whether approximate transcendentals are useful as starting-points for the precise versions). Which would lead to a de-emphasis of the relative throughput of graphics transcendentals. Well, that plus the fact that math is so cheap in this architecture, so what's a few extra cycles?

Thanks for doing that, very interesting. I'm curious to see if it's based on the old Lagrange approach...
 
Aha, hadn't realised you'd edited this post:
Anyone asking what happened to the rumored 4 slot VLIW units? Guess you will see some ISA code for that tonight :LOL:

a RCP will use 3 slots, the same as a SQRT or a RSQ:
Code:
x: RCP_sat     R2.x,  R1.y      
y: RCP_sat     ____,  R1.y      
z: RCP_sat     ____,  R1.y
a Sine is done by a succession of *(1/2PI) and another 3 slot instruction (why the hell is the rescaling with 2PI necessary?)
Because the fast-approximation technique requires the input to be in the range of -Pi/2 to +PI/2. This is referenced in some of the patents on this subject.

a 32bit integer MUL takes the full 4 slots:
Code:
x: MULLO_INT   R2.x,  R1.x,  R1.x      
y: MULLO_INT   ____,  R1.x,  R1.x      
z: MULLO_INT   ____,  R1.x,  R1.x      
w: MULLO_INT   ____,  R1.x,  R1.x
Hmm, that's curious. I suppose it's treating each operand as two 16-bit blocks. Alternatively, it might only be using 3 lanes, but the fourth lane is locked out (a bit like the old, pre-Evergreen, DP3 locks out the fourth lane with 0 * 0).

a 24bit integer MUL takes a single slot as expected
Code:
          3  x: MUL_INT24__NI  R2.x,  R1.x,  R1.x
Interesting to see the NI tag there, not sure why it would be different from Evergreen's.

Oh, another thing. Can you try to set or reset the ASIC_ALU_REORDER flag on compilation?
 
Cayman is one, and this should of course include Antilles too as it's utilizing 2 Cayman chips
Why?
In terms of features 2x Cayman == Cayman, but that doesn't mean that in all situations Antilles will be treated same way.
I bet generated code is same for both vliw4 chips, ie they just split in order to keep it clean "this is cayman, this is 2x cayman".
 
Slides are out:
http://www.chiphell.com/thread-130729-1-1.html (Login needed)

Key stuff?
MLAA through DirectCompute (DX9-10-11), 255mm^2, 1.7Bil Transistors, still 1 Tri/Clock, Tessellation improvements coming from thread management and buffering, ~1.5x-2x speed until ~20 Tesselation factor where it tends back to 1x.

oh and 4D! :LOL:

Per farhan's request (pesky pesky)
170436687h6rke7o96z9h6.jpg

Everything else: http://img192.imageshack.us/gal.php?g=170805bw2q2b1fwstjgbgt.jpg
 
Last edited by a moderator:
Yeah, but Barts is supposed to have the new arch.
You know obviously more than me.
Edit: Can someone post the slides Tchock linked to? Do we see wavefronts with 80 threads or does Barts also have a completly new SIMD/TMU layout?

That's pretty weird, I posted a simple pixel shader
It doesn't compile for Cypress neither. It's probably the way I generate the code. I may miss some kind of preprocessing step, the syntax is stricter than what one is used to (it does compile in the SKA, but with that I don't get the ISA code for the new stuff ;)).
Also, I wonder how the double-precision versions will work?
Double pecision will still work with software libraries for complex operations. At least there are no new instructions for that.
Because the fast-approximation technique requires the input to be in the range of -Pi/2 to +PI/2. This is referenced in some of the patents on this subject.
But it is not doing a renormalization to be between -pi/2 and +pi/2, it simply divides by 2Pi. That's something one should be able to incorporate into the lookup tables, isn't it?
Interesting to see the NI tag there, not sure why it would be different from Evergreen's.
Evergreen does not have signed 24bit integer multiplies (just unsigned) ;)
Oh, another thing. Can you try to set or reset the ASIC_ALU_REORDER flag on compilation?
As far as I see it is not set. I'm at work now. May have a look into it this tonight (but time may be constrained).
 
Last edited by a moderator:
Back
Top