If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
25Mar2008, 02:23  #101 
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 343

I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32bit ops or worse under bandwidth/latency increases from reading/writing wider data.

25Mar2008, 02:30  #102 
Regular

Thanks for the clarification Mike. Your explanation seems to imply that there's no "intrinsic instruction" support for double precision transcendentals  presumably they're only available with some kind macro?
Jawed 
25Mar2008, 02:37  #103  
Member
Join Date: May 2005
Location: in the shade
Posts: 152

Quote:
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux 

25Mar2008, 03:44  #104 
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 343

That is the secret sauce. If you play with the IL/ISA you will gain a little information on how things are done, but how the actual arch works I won't be able to tell you.
I just wanted to explain things for when people start playing with CAL and disassembling things so they don't get confused that the doubles aren't happening in T. As for trancendentals, sin/cos and the like, they are not native in double. I'm not actually sure if we currently ship a full set of double transcendentals, I'll have to check. 
25Mar2008, 13:29  #105  
Senior Member
Join Date: Oct 2006
Location: Germany
Posts: 1,020

Quote:
http://www.computerbase.de/news/trei...md_firestream/ But why did Giuseppe Amato, your Technical Director, say during the CeBIT, it's up to 350 GFLOP/s?
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke Eta Kooram Nah Smech! Find Chuck Norris. 

25Mar2008, 18:35  #106 
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 343

Giuseppe either spoke incorrectly or was misquoted. We do have mixed mode double+single precision applications that get 350GFlops, but the upper limit for raw double performance is 1/5th single rate as I stated above. You can grab CAL/Brook+ and an 3850/3870 and play with this for yourself.

26Mar2008, 02:46  #107 
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 343

I should also note that 1/5 is peak DP rate vs. SP rate doing MADs. There are cases, like DP add, in which we can do better, 2/5th rate, but raw peak DP flops is 1/5 SP flops. Another architect reminded me of the DP add performance as compared to SP add performance this afternoon.
And, despite what others may claim, we have actual hardware for doing this, it's not emulated... 
26Mar2008, 04:59  #108  
Member
Join Date: May 2005
Location: in the shade
Posts: 152

Quote:
That's faster than i expected... Now i'm really curious to know much extra hardware was used to achieve this (my guess is wider internal adders at least, for a 1/5 speed MUL, but i can't figure out how to get the ADD in all at the same time...). Is DP MUL also 1/5 speed?
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux 

26Mar2008, 05:42  #109 
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 343

Let the speculation begin. Since we have exposed all the way down to the metal, at least for the ISA, those willing to spend some time can figure out how a lot of stuff is done and what the performance of every instruction is, either ISA or IL.

26Mar2008, 14:34  #110 
Regular

I've just downloaded the SDK so will have a play... I like the detail I'm seeing
I want to make a guess first. When doing doubleprecision MUL on A and B the pipeline first splats the operands across the SIMD as follows: X  A.hi, B.hi Y  A.hi, B.lo Z  A.lo, B.hi W  A.lo, B.lo Each lane would need to be able to do ~27 bit multiplication. The partial products can then be added at the bottom of the pipe, which needs an adder that can sum all four lanes having first bitshifted for correct alignment in the result. This would be done as (X+Y)+(Z+W). A DP ADD would only need to occupy 2 lanes as the cumulative ~54 bits is enough for an ADD, so X+Y lanes and Z+W lanes as two independent operations: X  A.hi, B.hi Y  A.lo, B.lo Z  C.hi, D.hi W  C.lo, D.lo each pair of lanes then uses a bottom of pipe accumulator to sort out the carry from lo to hi. So, I'm proposing operandsplatting, widened lanes and two add stages. Jawed 
26Mar2008, 17:15  #111  
Member
Join Date: May 2005
Location: in the shade
Posts: 152

Quote:
However, regarding MULs, i think what you proposed is a little hard for me to buy. The complete result for this accumulate would be 108bits, of which you want to keep the top 54 or so (i don't know if the rounding is IEEE compliant, probably not). To do this in one cycle, that accumulator will be pretty massive. Not sure if they would spend that much extra hardware on it... What are the instruction latencies of the ADD and the MUL? Can you issue 2 DP ADDs or one DP MUL per cycle to the XYZW? My idea of it was more like 1 DP ADD over 2 cycles and 1 DP MUL over 4 cycles per X,Y,Z,W. That would require that each pipeline have a 54bit adder and some extra exponent logic, but none of the adds will be wider than 54bits.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux Last edited by Farhan; 26Mar2008 at 17:40. 

26Mar2008, 19:34  #112  
Regular

I'm in way over my head, so I'm really just nudging/guessing...
Quote:
Quote:
Quote:
I'm guessing that doing partial sums of the partial products you can control the precision by using the correct bitshifts between stages. Quote:
Quote:
Quote:
Quote:
I'm going to try to play with IL and GPUSA soon... I'm a bit bugged by GPUSA's focus on R600, not RV670  fingerscrossed. Jawed 

26Mar2008, 23:39  #113  
Member
Join Date: May 2005
Location: in the shade
Posts: 152

Quote:
Code:
WhiWlo ZhiZlo YhiYlo XhiXlo See my idea to minimize compute hardware was to just use 1 pipeline over 4 cycles (meaning each one would work on an independent MUL), and have a 54bit adder for the accumulate stage. So basically what you do is (((AloBlo+AloBhi)+AhiBlo)+AhiBhi), in this case you never have to do any >54bit adds, and you can start using the lower order bits to start figuring out rounding already (you don't have to keep all the bits). I was guessing that for the DP ADD they would just be register fetch limited so it would need 2 cycles. However, so would the MUL in this case, so it would probably not work in 4 cycles unless the inputs were wider. Maybe it's something in between. Maybe 2 pipelines fuse over 2 cycles for the MUL and 1 cycle for the ADD. I think that would make more sense than having 4 pipelines combine for the MUL and 2 for the ADD, and it would have enough input bandwidth to have all the operands in one cycle.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux 

27Mar2008, 02:06  #114 
Regular

Sorry, I didn't explain very well that I meant the final add to be a separate, pipelined operation. This is because this final add has to align its operands based on their exponents.
Maybe this will help: Code:
Bl Al  wWW Bh Al  zZZ  Z+W ===== zZwWW partial sum 1 ===== Bl Ah  yYY Bh Ah  xXX  X+Y ===== xXyYY partial sum 2 ===== p1 zZwWW p2 +xXyYY ======= xXyZwWW ======= When p1 and p2 are added, the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54p2+27. So if p2 is 54 significant bits, then only as far as the carry, "w" can be used, WW is dropped. So the final add is always within the precision of a 54bit adder. What I'm wondering, now, is if the multipliers in each of the four lanes have finalstage adders that can be joined in the pairings I've described. In other words, p1 can be generated as lanes Z and W do the final addition for each of their own 27bit multiplies  p1 doesn't require an additional adder stage after Z and W have generated their multiplier results. Same for p2. If that's the case then I haven't added any stages to the pipeline, because by default the pipeline performs MAD, with the final add being p1 + p2.  I've been playing with GPUSA, this is a*b in Brook+: Code:
x: MUL_64 R0.x, R0.y, R1.y y: MUL_64 R0.y, R0.y, R1.y z: MUL_64 ____, R0.y, R1.y w: MUL_64 ____, R0.x, R1.x Code:
35 x: FREXP_64 ____, R0.y y: FREXP_64 R123.y, R0.x z: FREXP_64 R123.z, R0.x w: FREXP_64 R123.w, R0.x 36 x: FLT64_TO_FLT32 R123.x, PV(35).w y: FLT64_TO_FLT32 ____, PV(35).z z: SUB_INT R126.z, 0.0f, PV(35).y 37 x: MUL_64 R127.x, R2.y, R0.y y: MUL_64 R127.y, R2.y, R0.y z: MUL_64 ____, R2.y, R0.y w: MUL_64 ____, R2.x, R0.x t: RECIP_IEEE R122.w, PV(36).x 38 z: FLT32_TO_FLT64 R123.z, PS(37).x w: FLT32_TO_FLT64 R123.w, R0.x 39 x: LDEXP_64 R126.x, PV(38).w, R126.z y: LDEXP_64 R126.y, PV(38).z, R126.z 40 x: MULADD_64 R123.x, R127.y, PV(39).y, R3.y y: MULADD_64 R123.y, R127.y, PV(39).y, R3.y z: MULADD_64 R4.z, R127.y, PV(39).y, R3.y w: MULADD_64 R4.w, R127.x, PV(39).x, R3.x 41 x: MULADD_64 R126.x, PV(40).y, R126.y, R126.y y: MULADD_64 R126.y, PV(40).y, R126.y, R126.y z: MULADD_64 R4.z, PV(40).y, R126.y, R126.y w: MULADD_64 R4.w, PV(40).x, R126.x, R126.x 42 x: MULADD_64 R123.x, R127.y, PV(41).y, R3.y y: MULADD_64 R123.y, R127.y, PV(41).y, R3.y z: MULADD_64 R4.z, R127.y, PV(41).y, R3.y w: MULADD_64 R4.w, R127.x, PV(41).x, R3.x 43 x: MULADD_64 R126.x, PV(42).y, R126.y, R126.y y: MULADD_64 R126.y, PV(42).y, R126.y, R126.y z: MULADD_64 R4.z, PV(42).y, R126.y, R126.y w: MULADD_64 R4.w, PV(42).x, R126.x, R126.x 44 x: MUL_64 R125.x, R1.y, PV(43).y y: MUL_64 R125.y, R1.y, PV(43).y z: MUL_64 ____, R1.y, PV(43).y w: MUL_64 ____, R1.x, PV(43).x 45 x: MULADD_64 R123.x, R127.y, PV(44).y, R1.y y: MULADD_64 R123.y, R127.y, PV(44).y, R1.y z: MULADD_64 R4.z, R127.y, PV(44).y, R1.y w: MULADD_64 R4.w, R127.x, PV(44).x, R1.x 46 x: MULADD_64 R1.x, PV(45).y, R126.y, R125.y y: MULADD_64 R1.y, PV(45).y, R126.y, R125.y z: MULADD_64 R4.z, PV(45).y, R126.y, R125.y w: MULADD_64 R4.w, PV(45).x, R126.x, R125.x Jawed 
27Mar2008, 02:14  #115  
Regular

Quote:
I should also post a+b: Code:
x: ADD_64 R0.x, R0.y, R1.y y: ADD_64 R0.y, R0.x, R1.x Jawed 

27Mar2008, 02:42  #116  
Member
Join Date: May 2005
Location: in the shade
Posts: 152

Quote:
Quote:
edit: yeah, it does look like it's single cycle for the MUL, using all 4 pipelines fused. Most interesting. Now i'm dying to know the internal structure. I wonder if there are more stages when doing DP...
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux Last edited by Farhan; 27Mar2008 at 02:56. 

27Mar2008, 03:53  #117  
Regular

Quote:
Quote:
If anyone works this out, do you think Mike will say? Jawed 

27Mar2008, 14:20  #118 
Regular

This page nicely explains the denormalisation of the smaller operand that's required to produce equivalent exponents in the two operands:
http://www.cs.umd.edu/class/sum2003/.../addFloat.html Jawed 
27Mar2008, 17:15  #119  
Member
Join Date: May 2005
Location: in the shade
Posts: 152

Quote:
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux 

27Mar2008, 19:23  #120  
Regular

Quote:
You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (denormalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result). Jawed 

27Mar2008, 19:33  #121 
Regular

Something different
Any chance that modifying/widening the DP4 paths will provide the requisite stages?
Jawed 
27Mar2008, 23:02  #122  
Member
Join Date: May 2005
Location: in the shade
Posts: 152

Quote:
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux 

28Mar2008, 03:58  #123  
Regular

Quote:
Code:
Blo 27 Alo 27  w55 Bhi 53 Alo 27  z81  Z+W ===== z82 partial sum 1 ===== Blo 27 Ahi 53  y81 Bhi 53 Ahi 53  107  X+Y ===== 108 partial sum 2 ===== p1 z82 p2 +108 ======= 109 ======= Doing this I think I've understood my mistake. When I said "the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54p2+27" that's wrong, it should be the difference in exponents as there's always 54 significant bits in p2.  My suggestion is the addition, p1+p2, is done on the final adder in the pipeline (in lanes X and Y). This adder is required to perform a DADD instruction, so in this case it is also used for p1+p2. Since DADD has to support two 53bit operands by being a 54bit adder, the addition of p1+p2, 27 bits + 54 bits requires no extra hardware dedicated to MUL. So, what I'm thinking is that a conventional single precision DP4 needs to perform a final ADD on 4 MULs. So the DP4 instuction requires a 4 operand adder. I'm wondering if this same adder can also support:
Does DP4 work like that, though? Jawed 

25May2008, 13:35  #124 
Member
Join Date: May 2007
Posts: 138

http://forums.amd.com/forum/messagev...&enterthread=y
AMD Stream SDK v1.1beta Now Available For Download The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1beta! The installation files are available for immediate download from: FTP Download Site For AMD Stream SDK v1.1beta The AMD Stream Computing website will be updated in the next few days to reflect this new release. With v1.1beta comes:  AMD FireStream 9170 support  Linux support (RHEL 5.1 and SLES 10 SP1)  Brook+ integer support  Brook+ #line number support for easier .br file debugging  Various bug fixes and runtime enhancements  Preliminary Microsoft Visual Studio 2008 support If you have any questions, please do not hesitate to post your question to the forum. Sincerely, AMD Stream Team 
07Jun2008, 15:22  #125  
Junior Member
Join Date: Aug 2007
Location: Houston, Texas
Posts: 79

Quote:


Thread Tools  
Display Modes  

