If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#101 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32-bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32-bit ops or worse under bandwidth/latency increases from reading/writing wider data.
|
|
|
|
|
|
#102 |
|
Regular
|
Thanks for the clarification Mike. Your explanation seems to imply that there's no "intrinsic instruction" support for double precision transcendentals - presumably they're only available with some kind macro?
Jawed |
|
|
|
|
|
#103 | |
|
Member
Join Date: May 2005
Location: in the shade
Posts: 152
|
Quote:
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux |
|
|
|
|
|
|
#104 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
That is the secret sauce.
I just wanted to explain things for when people start playing with CAL and disassembling things so they don't get confused that the doubles aren't happening in T. As for trancendentals, sin/cos and the like, they are not native in double. I'm not actually sure if we currently ship a full set of double transcendentals, I'll have to check. |
|
|
|
|
|
#105 | |
|
Senior Member
Join Date: Oct 2006
Location: Germany
Posts: 1,003
|
Quote:
http://www.computerbase.de/news/trei...md_firestream/ But why did Giuseppe Amato, your Technical Director, say during the CeBIT, it's up to 350 GFLOP/s?
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke Eta Kooram Nah Smech! Find Chuck Norris. |
|
|
|
|
|
|
#106 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
Giuseppe either spoke incorrectly or was misquoted. We do have mixed mode double+single precision applications that get 350GFlops, but the upper limit for raw double performance is 1/5th single rate as I stated above. You can grab CAL/Brook+ and an 3850/3870 and play with this for yourself.
|
|
|
|
|
|
#107 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
I should also note that 1/5 is peak DP rate vs. SP rate doing MADs. There are cases, like DP add, in which we can do better, 2/5th rate, but raw peak DP flops is 1/5 SP flops. Another architect reminded me of the DP add performance as compared to SP add performance this afternoon.
And, despite what others may claim, we have actual hardware for doing this, it's not emulated... |
|
|
|
|
|
#108 | |
|
Member
Join Date: May 2005
Location: in the shade
Posts: 152
|
Quote:
That's faster than i expected... Now i'm really curious to know much extra hardware was used to achieve this (my guess is wider internal adders at least, for a 1/5 speed MUL, but i can't figure out how to get the ADD in all at the same time...). Is DP MUL also 1/5 speed?
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux |
|
|
|
|
|
|
#109 |
|
A little of this and that
Join Date: Oct 2005
Location: Cupertino
Posts: 342
|
Let the speculation begin.
|
|
|
|
|
|
#110 |
|
Regular
|
I've just downloaded the SDK so will have a play... I like the detail I'm seeing
I want to make a guess first. When doing double-precision MUL on A and B the pipeline first splats the operands across the SIMD as follows: X - A.hi, B.hi Y - A.hi, B.lo Z - A.lo, B.hi W - A.lo, B.lo Each lane would need to be able to do ~27 bit multiplication. The partial products can then be added at the bottom of the pipe, which needs an adder that can sum all four lanes having first bit-shifted for correct alignment in the result. This would be done as (X+Y)+(Z+W). A DP ADD would only need to occupy 2 lanes as the cumulative ~54 bits is enough for an ADD, so X+Y lanes and Z+W lanes as two independent operations: X - A.hi, B.hi Y - A.lo, B.lo Z - C.hi, D.hi W - C.lo, D.lo each pair of lanes then uses a bottom of pipe accumulator to sort out the carry from lo to hi. So, I'm proposing operand-splatting, widened lanes and two add stages. Jawed |
|
|
|
|
|
#111 | |
|
Member
Join Date: May 2005
Location: in the shade
Posts: 152
|
Quote:
However, regarding MULs, i think what you proposed is a little hard for me to buy. The complete result for this accumulate would be 108bits, of which you want to keep the top 54 or so (i don't know if the rounding is IEEE compliant, probably not). To do this in one cycle, that accumulator will be pretty massive. Not sure if they would spend that much extra hardware on it... What are the instruction latencies of the ADD and the MUL? Can you issue 2 DP ADDs or one DP MUL per cycle to the XYZW? My idea of it was more like 1 DP ADD over 2 cycles and 1 DP MUL over 4 cycles per X,Y,Z,W. That would require that each pipeline have a 54bit adder and some extra exponent logic, but none of the adds will be wider than 54bits.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux Last edited by Farhan; 26-Mar-2008 at 17:40. |
|
|
|
|
|
|
#112 | |||||||
|
Regular
|
I'm in way over my head, so I'm really just nudging/guessing...
Quote:
Quote:
Quote:
I'm guessing that doing partial sums of the partial products you can control the precision by using the correct bit-shifts between stages. Quote:
Quote:
Quote:
Quote:
I'm going to try to play with IL and GPUSA soon... I'm a bit bugged by GPUSA's focus on R600, not RV670 - fingers-crossed. Jawed |
|||||||
|
|
|
|
|
#113 | |
|
Member
Join Date: May 2005
Location: in the shade
Posts: 152
|
Quote:
Code:
|Whi|Wlo |Zhi|Zlo| |Yhi|Ylo| Xhi|Xlo See my idea to minimize compute hardware was to just use 1 pipeline over 4 cycles (meaning each one would work on an independent MUL), and have a 54bit adder for the accumulate stage. So basically what you do is (((AloBlo+AloBhi)+AhiBlo)+AhiBhi), in this case you never have to do any >54bit adds, and you can start using the lower order bits to start figuring out rounding already (you don't have to keep all the bits). I was guessing that for the DP ADD they would just be register fetch limited so it would need 2 cycles. However, so would the MUL in this case, so it would probably not work in 4 cycles unless the inputs were wider. Maybe it's something in between. Maybe 2 pipelines fuse over 2 cycles for the MUL and 1 cycle for the ADD. I think that would make more sense than having 4 pipelines combine for the MUL and 2 for the ADD, and it would have enough input bandwidth to have all the operands in one cycle.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux |
|
|
|
|
|
|
#114 |
|
Regular
|
Sorry, I didn't explain very well that I meant the final add to be a separate, pipelined operation. This is because this final add has to align its operands based on their exponents.
Maybe this will help: Code:
Bl
Al
---
wWW
Bh
Al
---
zZZ
---
Z+W
=====
zZwWW partial sum 1
=====
Bl
Ah
---
yYY
Bh
Ah
---
xXX
---
X+Y
=====
xXyYY partial sum 2
=====
p1 zZwWW
p2 +xXyYY
=======
xXyZwWW
=======
When p1 and p2 are added, the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27. So if p2 is 54 significant bits, then only as far as the carry, "w" can be used, WW is dropped. So the final add is always within the precision of a 54-bit adder. What I'm wondering, now, is if the multipliers in each of the four lanes have final-stage adders that can be joined in the pairings I've described. In other words, p1 can be generated as lanes Z and W do the final addition for each of their own 27-bit multiplies - p1 doesn't require an additional adder stage after Z and W have generated their multiplier results. Same for p2. If that's the case then I haven't added any stages to the pipeline, because by default the pipeline performs MAD, with the final add being p1 + p2. --- I've been playing with GPUSA, this is a*b in Brook+: Code:
x: MUL_64 R0.x, R0.y, R1.y
y: MUL_64 R0.y, R0.y, R1.y
z: MUL_64 ____, R0.y, R1.y
w: MUL_64 ____, R0.x, R1.x
Code:
35 x: FREXP_64 ____, R0.y
y: FREXP_64 R123.y, R0.x
z: FREXP_64 R123.z, R0.x
w: FREXP_64 R123.w, R0.x
36 x: FLT64_TO_FLT32 R123.x, PV(35).w
y: FLT64_TO_FLT32 ____, PV(35).z
z: SUB_INT R126.z, 0.0f, PV(35).y
37 x: MUL_64 R127.x, R2.y, R0.y
y: MUL_64 R127.y, R2.y, R0.y
z: MUL_64 ____, R2.y, R0.y
w: MUL_64 ____, R2.x, R0.x
t: RECIP_IEEE R122.w, PV(36).x
38 z: FLT32_TO_FLT64 R123.z, PS(37).x
w: FLT32_TO_FLT64 R123.w, R0.x
39 x: LDEXP_64 R126.x, PV(38).w, R126.z
y: LDEXP_64 R126.y, PV(38).z, R126.z
40 x: MULADD_64 R123.x, R127.y, PV(39).y, R3.y
y: MULADD_64 R123.y, R127.y, PV(39).y, R3.y
z: MULADD_64 R4.z, R127.y, PV(39).y, R3.y
w: MULADD_64 R4.w, R127.x, PV(39).x, R3.x
41 x: MULADD_64 R126.x, PV(40).y, R126.y, R126.y
y: MULADD_64 R126.y, PV(40).y, R126.y, R126.y
z: MULADD_64 R4.z, PV(40).y, R126.y, R126.y
w: MULADD_64 R4.w, PV(40).x, R126.x, R126.x
42 x: MULADD_64 R123.x, R127.y, PV(41).y, R3.y
y: MULADD_64 R123.y, R127.y, PV(41).y, R3.y
z: MULADD_64 R4.z, R127.y, PV(41).y, R3.y
w: MULADD_64 R4.w, R127.x, PV(41).x, R3.x
43 x: MULADD_64 R126.x, PV(42).y, R126.y, R126.y
y: MULADD_64 R126.y, PV(42).y, R126.y, R126.y
z: MULADD_64 R4.z, PV(42).y, R126.y, R126.y
w: MULADD_64 R4.w, PV(42).x, R126.x, R126.x
44 x: MUL_64 R125.x, R1.y, PV(43).y
y: MUL_64 R125.y, R1.y, PV(43).y
z: MUL_64 ____, R1.y, PV(43).y
w: MUL_64 ____, R1.x, PV(43).x
45 x: MULADD_64 R123.x, R127.y, PV(44).y, R1.y
y: MULADD_64 R123.y, R127.y, PV(44).y, R1.y
z: MULADD_64 R4.z, R127.y, PV(44).y, R1.y
w: MULADD_64 R4.w, R127.x, PV(44).x, R1.x
46 x: MULADD_64 R1.x, PV(45).y, R126.y, R125.y
y: MULADD_64 R1.y, PV(45).y, R126.y, R125.y
z: MULADD_64 R4.z, PV(45).y, R126.y, R125.y
w: MULADD_64 R4.w, PV(45).x, R126.x, R125.x
Jawed |
|
|
|
|
|
#115 | |
|
Regular
|
Quote:
I should also post a+b: Code:
x: ADD_64 R0.x, R0.y, R1.y
y: ADD_64 R0.y, R0.x, R1.x
Jawed |
|
|
|
|
|
|
#116 | ||
|
Member
Join Date: May 2005
Location: in the shade
Posts: 152
|
Quote:
Quote:
edit: yeah, it does look like it's single cycle for the MUL, using all 4 pipelines fused. Most interesting. Now i'm dying to know the internal structure. I wonder if there are more stages when doing DP...
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux Last edited by Farhan; 27-Mar-2008 at 02:56. |
||
|
|
|
|
|
#117 | ||
|
Regular
|
Quote:
Quote:
If anyone works this out, do you think Mike will say? Jawed |
||
|
|
|
|
|
#118 |
|
Regular
|
This page nicely explains the denormalisation of the smaller operand that's required to produce equivalent exponents in the two operands:
http://www.cs.umd.edu/class/sum2003/.../addFloat.html Jawed |
|
|
|
|
|
#119 | |
|
Member
Join Date: May 2005
Location: in the shade
Posts: 152
|
Quote:
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux |
|
|
|
|
|
|
#120 | |
|
Regular
|
Quote:
You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (de-normalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result). Jawed |
|
|
|
|
|
|
#121 |
|
Regular
|
Any chance that modifying/widening the DP4 paths will provide the requisite stages?
Jawed |
|
|
|
|
|
#122 | |
|
Member
Join Date: May 2005
Location: in the shade
Posts: 152
|
Quote:
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D [03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux |
|
|
|
|
|
|
#123 | |
|
Regular
|
Quote:
Code:
Blo 27
Alo 27
---
w55
Bhi 53
Alo 27
---
z81
---
Z+W
=====
z82 partial sum 1
=====
Blo 27
Ahi 53
---
y81
Bhi 53
Ahi 53
---
107
---
X+Y
=====
108 partial sum 2
=====
p1 z82
p2 +108
=======
109
=======
Doing this I think I've understood my mistake. When I said "the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27" that's wrong, it should be the difference in exponents as there's always 54 significant bits in p2. --- My suggestion is the addition, p1+p2, is done on the final adder in the pipeline (in lanes X and Y). This adder is required to perform a DADD instruction, so in this case it is also used for p1+p2. Since DADD has to support two 53-bit operands by being a 54-bit adder, the addition of p1+p2, 27 bits + 54 bits requires no extra hardware dedicated to MUL. So, what I'm thinking is that a conventional single precision DP4 needs to perform a final ADD on 4 MULs. So the DP4 instuction requires a 4 operand adder. I'm wondering if this same adder can also support:
Does DP4 work like that, though? Jawed |
|
|
|
|
|
|
#124 |
|
Member
Join Date: May 2007
Posts: 101
|
http://forums.amd.com/forum/messagev...&enterthread=y
AMD Stream SDK v1.1-beta Now Available For Download The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta! The installation files are available for immediate download from: FTP Download Site For AMD Stream SDK v1.1-beta The AMD Stream Computing website will be updated in the next few days to reflect this new release. With v1.1-beta comes: - AMD FireStream 9170 support - Linux support (RHEL 5.1 and SLES 10 SP1) - Brook+ integer support - Brook+ #line number support for easier .br file debugging - Various bug fixes and runtime enhancements - Preliminary Microsoft Visual Studio 2008 support If you have any questions, please do not hesitate to post your question to the forum. Sincerely, AMD Stream Team |
|
|
|
|
|
#125 | |
|
Junior Member
Join Date: Aug 2007
Location: Houston, Texas
Posts: 79
|
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|