Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 25-Mar-2008, 02:23 #101 mhouston A little of this and that   Join Date: Oct 2005 Location: Cupertino Posts: 342 I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32-bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32-bit ops or worse under bandwidth/latency increases from reading/writing wider data.
 25-Mar-2008, 02:30 #102 Jawed Regular   Join Date: Oct 2004 Location: London Posts: 9,948 Thanks for the clarification Mike. Your explanation seems to imply that there's no "intrinsic instruction" support for double precision transcendentals - presumably they're only available with some kind macro? Jawed
25-Mar-2008, 02:37   #103
Farhan
Member

Join Date: May 2005
Posts: 152

Quote:
 Originally Posted by mhouston I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32-bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32-bit ops or worse under bandwidth/latency increases from reading/writing wider data.
What do you mean by "fused"? I'm really interested to know how doubles are actually handled.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

 25-Mar-2008, 03:44 #104 mhouston A little of this and that   Join Date: Oct 2005 Location: Cupertino Posts: 342 That is the secret sauce. If you play with the IL/ISA you will gain a little information on how things are done, but how the actual arch works I won't be able to tell you. I just wanted to explain things for when people start playing with CAL and disassembling things so they don't get confused that the doubles aren't happening in T. As for trancendentals, sin/cos and the like, they are not native in double. I'm not actually sure if we currently ship a full set of double transcendentals, I'll have to check.
25-Mar-2008, 13:29   #105
Arnold Beckenbauer
Senior Member

Join Date: Oct 2006
Location: Germany
Posts: 1,004

Quote:
 Originally Posted by mhouston I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32-bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32-bit ops or worse under bandwidth/latency increases from reading/writing wider data.

http://www.computerbase.de/news/trei...md_firestream/
But why did Giuseppe Amato, your Technical Director, say during the CeBIT, it's up to 350 GFLOP/s?
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.

 25-Mar-2008, 18:35 #106 mhouston A little of this and that   Join Date: Oct 2005 Location: Cupertino Posts: 342 Giuseppe either spoke incorrectly or was misquoted. We do have mixed mode double+single precision applications that get 350GFlops, but the upper limit for raw double performance is 1/5th single rate as I stated above. You can grab CAL/Brook+ and an 3850/3870 and play with this for yourself.
 26-Mar-2008, 02:46 #107 mhouston A little of this and that   Join Date: Oct 2005 Location: Cupertino Posts: 342 I should also note that 1/5 is peak DP rate vs. SP rate doing MADs. There are cases, like DP add, in which we can do better, 2/5th rate, but raw peak DP flops is 1/5 SP flops. Another architect reminded me of the DP add performance as compared to SP add performance this afternoon. And, despite what others may claim, we have actual hardware for doing this, it's not emulated...
26-Mar-2008, 04:59   #108
Farhan
Member

Join Date: May 2005
Posts: 152

Quote:
 Originally Posted by mhouston I should also note that 1/5 is peak DP rate vs. SP rate doing MADs. There are cases, like DP add, in which we can do better, 2/5th rate, but raw peak DP flops is 1/5 SP flops. Another architect reminded me of the DP add performance as compared to SP add performance this afternoon. And, despite what others may claim, we have actual hardware for doing this, it's not emulated...
1/5 speed for DP MADs :O
That's faster than i expected... Now i'm really curious to know much extra hardware was used to achieve this (my guess is wider internal adders at least, for a 1/5 speed MUL, but i can't figure out how to get the ADD in all at the same time...).
Is DP MUL also 1/5 speed?
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

 26-Mar-2008, 05:42 #109 mhouston A little of this and that   Join Date: Oct 2005 Location: Cupertino Posts: 342 Let the speculation begin. Since we have exposed all the way down to the metal, at least for the ISA, those willing to spend some time can figure out how a lot of stuff is done and what the performance of every instruction is, either ISA or IL.
 26-Mar-2008, 14:34 #110 Jawed Regular   Join Date: Oct 2004 Location: London Posts: 9,948 I've just downloaded the SDK so will have a play... I like the detail I'm seeing I want to make a guess first. When doing double-precision MUL on A and B the pipeline first splats the operands across the SIMD as follows: X - A.hi, B.hi Y - A.hi, B.lo Z - A.lo, B.hi W - A.lo, B.lo Each lane would need to be able to do ~27 bit multiplication. The partial products can then be added at the bottom of the pipe, which needs an adder that can sum all four lanes having first bit-shifted for correct alignment in the result. This would be done as (X+Y)+(Z+W). A DP ADD would only need to occupy 2 lanes as the cumulative ~54 bits is enough for an ADD, so X+Y lanes and Z+W lanes as two independent operations: X - A.hi, B.hi Y - A.lo, B.lo Z - C.hi, D.hi W - C.lo, D.lo each pair of lanes then uses a bottom of pipe accumulator to sort out the carry from lo to hi. So, I'm proposing operand-splatting, widened lanes and two add stages. Jawed
26-Mar-2008, 17:15   #111
Farhan
Member

Join Date: May 2005
Posts: 152

Quote:
 Originally Posted by Jawed I've just downloaded the SDK so will have a play... I like the detail I'm seeing I want to make a guess first. When doing double-precision MUL on A and B the pipeline first splats the operands across the SIMD as follows: X - A.hi, B.hi Y - A.hi, B.lo Z - A.lo, B.hi W - A.lo, B.lo Each lane would need to be able to do ~27 bit multiplication. The partial products can then be added at the bottom of the pipe, which needs an adder that can sum all four lanes having first bit-shifted for correct alignment in the result. This would be done as (X+Y)+(Z+W). A DP ADD would only need to occupy 2 lanes as the cumulative ~54 bits is enough for an ADD, so X+Y lanes and Z+W lanes as two independent operations: X - A.hi, B.hi Y - A.lo, B.lo Z - C.hi, D.hi W - C.lo, D.lo each pair of lanes then uses a bottom of pipe accumulator to sort out the carry from lo to hi. So, I'm proposing operand-splatting, widened lanes and two add stages. Jawed
That would mean that the adders are pretty fancy if they can be combined and do a 54bit add in a single stage for the ADD. It's cheaper than a larger multiplier i guess, and probably cheaper than having an individual 54bit add per pipe internally (which was what i was thinking).

However, regarding MULs, i think what you proposed is a little hard for me to buy.
The complete result for this accumulate would be 108bits, of which you want to keep the top 54 or so (i don't know if the rounding is IEEE compliant, probably not). To do this in one cycle, that accumulator will be pretty massive. Not sure if they would spend that much extra hardware on it... What are the instruction latencies of the ADD and the MUL? Can you issue 2 DP ADDs or one DP MUL per cycle to the XYZW? My idea of it was more like 1 DP ADD over 2 cycles and 1 DP MUL over 4 cycles per X,Y,Z,W. That would require that each pipeline have a 54bit adder and some extra exponent logic, but none of the adds will be wider than 54bits.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

Last edited by Farhan; 26-Mar-2008 at 17:40.

26-Mar-2008, 19:34   #112
Jawed
Regular

Join Date: Oct 2004
Location: London
Posts: 9,948

I'm in way over my head, so I'm really just nudging/guessing...

Quote:
 Originally Posted by Farhan That would mean that the adders are pretty fancy if they can be combined and do a 54bit add in a single stage for the ADD.
I'm assuming, naively, that the interaction of carry, sign and exponent between the two halves isn't difficult, with carry being no different from the way carry is treated within a single adder.

Quote:
 It's cheaper than a larger multiplier i guess, and probably cheaper than having an individual 54bit add per pipe internally (which was what i was thinking).
I was thinking it is prolly a bit more costly than a DP adder, but is preferable when you want to perform single precision add on each half, since each lane needs independent exponent/sign handling.

Quote:
 However, regarding MULs, i think what you proposed is a little hard for me to buy. The complete result for this accumulate would be 108bits, of which you want to keep the top 54 or so (i don't know if the rounding is IEEE compliant, probably not).
I was thinking the accumulate would be operating on shifted intermediate results. Working from MSB downwards (assuming X is most significant), all 27 bits from X, 27 bits from Y shifted right with respect to X, 27 bits from Z also shifted right and the top bits from scraggy old W shifted to the bottom of the 54 bit result.

I'm guessing that doing partial sums of the partial products you can control the precision by using the correct bit-shifts between stages.

Quote:
 To do this in one cycle, that accumulator will be pretty massive. Not sure if they would spend that much extra hardware on it...
This is why I split it up into two independent add stages.

Quote:
 What are the instruction latencies of the ADD and the MUL?
I presume all the latencies are identical. I have to admit I've added a pile of latency with my suggestion which is problematic.

Quote:
 Can you issue 2 DP ADDs or one DP MUL per cycle to the XYZW?
Yes, Mike described the issue rate for MAD as 1 per clock producing 1 scalar result, with ADD running at 2 scalar results per clock.

Quote:
 My idea of it was more like 1 DP ADD over 2 cycles and 1 DP MUL over 4 cycles per X,Y,Z,W. That would require that each pipeline have a 54bit adder and some extra exponent logic, but none of the adds will be wider than 54bits.
Aha, so what you're thinking of is "looping" with intermediate shifts and adds? So for a MUL each lane would write "least significant 32 bits" (including exponent) in one cycle then 2 cycles later the most significant 32 bits.

I'm going to try to play with IL and GPUSA soon... I'm a bit bugged by GPUSA's focus on R600, not RV670 - fingers-crossed.

Jawed

26-Mar-2008, 23:39   #113
Farhan
Member

Join Date: May 2005
Posts: 152

Quote:
 Originally Posted by Jawed I'm in way over my head, so I'm really just nudging/guessing... I'm assuming, naively, that the interaction of carry, sign and exponent between the two halves isn't difficult, with carry being no different from the way carry is treated within a single adder. I was thinking it is prolly a bit more costly than a DP adder, but is preferable when you want to perform single precision add on each half, since each lane needs independent exponent/sign handling. I was thinking the accumulate would be operating on shifted intermediate results. Working from MSB downwards (assuming X is most significant), all 27 bits from X, 27 bits from Y shifted right with respect to X, 27 bits from Z also shifted right and the top bits from scraggy old W shifted to the bottom of the 54 bit result. I'm guessing that doing partial sums of the partial products you can control the precision by using the correct bit-shifts between stages. This is why I split it up into two independent add stages. I presume all the latencies are identical. I have to admit I've added a pile of latency with my suggestion which is problematic. Yes, Mike described the issue rate for MAD as 1 per clock producing 1 scalar result, with ADD running at 2 scalar results per clock. Aha, so what you're thinking of is "looping" with intermediate shifts and adds? So for a MUL each lane would write "least significant 32 bits" (including exponent) in one cycle then 2 cycles later the most significant 32 bits. I'm going to try to play with IL and GPUSA soon... I'm a bit bugged by GPUSA's focus on R600, not RV670 - fingers-crossed. Jawed
The results from each 27bit mul have to be added like this (each hi and lo part is 27bits)...

Code:
```       |Whi|Wlo
|Zhi|Zlo|
|Yhi|Ylo|
Xhi|Xlo```
So if you want to do (X+Y)+(Z+W), each adder has to be 54 bits wide for the (X+Y) and (Z+W) part. To do the final add you need a 54+27 bit adder. I'm not sure they would spend that much extra hardware on it, and it's probably not the optimal way to do it, at least in my mind. If they want to do it all in a single adder tree then they probably have to pipeline it, and it would also cost a lot of hardware.

See my idea to minimize compute hardware was to just use 1 pipeline over 4 cycles (meaning each one would work on an independent MUL), and have a 54bit adder for the accumulate stage. So basically what you do is (((AloBlo+AloBhi)+AhiBlo)+AhiBhi), in this case you never have to do any >54bit adds, and you can start using the lower order bits to start figuring out rounding already (you don't have to keep all the bits). I was guessing that for the DP ADD they would just be register fetch limited so it would need 2 cycles. However, so would the MUL in this case, so it would probably not work in 4 cycles unless the inputs were wider.

Maybe it's something in between. Maybe 2 pipelines fuse over 2 cycles for the MUL and 1 cycle for the ADD. I think that would make more sense than having 4 pipelines combine for the MUL and 2 for the ADD, and it would have enough input bandwidth to have all the operands in one cycle.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

 27-Mar-2008, 02:06 #114 Jawed Regular   Join Date: Oct 2004 Location: London Posts: 9,948 Sorry, I didn't explain very well that I meant the final add to be a separate, pipelined operation. This is because this final add has to align its operands based on their exponents. Maybe this will help: Code: ``` Bl Al --- wWW Bh Al --- zZZ --- Z+W ===== zZwWW partial sum 1 ===== Bl Ah --- yYY Bh Ah --- xXX --- X+Y ===== xXyYY partial sum 2 ===== p1 zZwWW p2 +xXyYY ======= xXyZwWW =======``` So, that's A*B, across lanes X, Y, Z and W. I've kept track of the carry from each lane by indicating the carry with a lower case lane name, e.g. "y". When p1 and p2 are added, the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27. So if p2 is 54 significant bits, then only as far as the carry, "w" can be used, WW is dropped. So the final add is always within the precision of a 54-bit adder. What I'm wondering, now, is if the multipliers in each of the four lanes have final-stage adders that can be joined in the pairings I've described. In other words, p1 can be generated as lanes Z and W do the final addition for each of their own 27-bit multiplies - p1 doesn't require an additional adder stage after Z and W have generated their multiplier results. Same for p2. If that's the case then I haven't added any stages to the pipeline, because by default the pipeline performs MAD, with the final add being p1 + p2. --- I've been playing with GPUSA, this is a*b in Brook+: Code: ``` x: MUL_64 R0.x, R0.y, R1.y y: MUL_64 R0.y, R0.y, R1.y z: MUL_64 ____, R0.y, R1.y w: MUL_64 ____, R0.x, R1.x``` This monster is a/b (this issues a DDIV - double division - in IL): Code: ``` 35 x: FREXP_64 ____, R0.y y: FREXP_64 R123.y, R0.x z: FREXP_64 R123.z, R0.x w: FREXP_64 R123.w, R0.x 36 x: FLT64_TO_FLT32 R123.x, PV(35).w y: FLT64_TO_FLT32 ____, PV(35).z z: SUB_INT R126.z, 0.0f, PV(35).y 37 x: MUL_64 R127.x, R2.y, R0.y y: MUL_64 R127.y, R2.y, R0.y z: MUL_64 ____, R2.y, R0.y w: MUL_64 ____, R2.x, R0.x t: RECIP_IEEE R122.w, PV(36).x 38 z: FLT32_TO_FLT64 R123.z, PS(37).x w: FLT32_TO_FLT64 R123.w, R0.x 39 x: LDEXP_64 R126.x, PV(38).w, R126.z y: LDEXP_64 R126.y, PV(38).z, R126.z 40 x: MULADD_64 R123.x, R127.y, PV(39).y, R3.y y: MULADD_64 R123.y, R127.y, PV(39).y, R3.y z: MULADD_64 R4.z, R127.y, PV(39).y, R3.y w: MULADD_64 R4.w, R127.x, PV(39).x, R3.x 41 x: MULADD_64 R126.x, PV(40).y, R126.y, R126.y y: MULADD_64 R126.y, PV(40).y, R126.y, R126.y z: MULADD_64 R4.z, PV(40).y, R126.y, R126.y w: MULADD_64 R4.w, PV(40).x, R126.x, R126.x 42 x: MULADD_64 R123.x, R127.y, PV(41).y, R3.y y: MULADD_64 R123.y, R127.y, PV(41).y, R3.y z: MULADD_64 R4.z, R127.y, PV(41).y, R3.y w: MULADD_64 R4.w, R127.x, PV(41).x, R3.x 43 x: MULADD_64 R126.x, PV(42).y, R126.y, R126.y y: MULADD_64 R126.y, PV(42).y, R126.y, R126.y z: MULADD_64 R4.z, PV(42).y, R126.y, R126.y w: MULADD_64 R4.w, PV(42).x, R126.x, R126.x 44 x: MUL_64 R125.x, R1.y, PV(43).y y: MUL_64 R125.y, R1.y, PV(43).y z: MUL_64 ____, R1.y, PV(43).y w: MUL_64 ____, R1.x, PV(43).x 45 x: MULADD_64 R123.x, R127.y, PV(44).y, R1.y y: MULADD_64 R123.y, R127.y, PV(44).y, R1.y z: MULADD_64 R4.z, R127.y, PV(44).y, R1.y w: MULADD_64 R4.w, R127.x, PV(44).x, R1.x 46 x: MULADD_64 R1.x, PV(45).y, R126.y, R125.y y: MULADD_64 R1.y, PV(45).y, R126.y, R125.y z: MULADD_64 R4.z, PV(45).y, R126.y, R125.y w: MULADD_64 R4.w, PV(45).x, R126.x, R125.x``` FREXP means split double precision into fraction and exponent. LDEXP means Combine separate fraction and exponent into double precision. SUB_INT is integer subtract. Jawed
27-Mar-2008, 02:14   #115
Jawed
Regular

Join Date: Oct 2004
Location: London
Posts: 9,948

Quote:
 Originally Posted by Farhan Maybe it's something in between. Maybe 2 pipelines fuse over 2 cycles for the MUL and 1 cycle for the ADD. I think that would make more sense than having 4 pipelines combine for the MUL and 2 for the ADD, and it would have enough input bandwidth to have all the operands in one cycle.
I'm pretty much convinced everything is single cycle, but I can't think how to test this.

I should also post a+b:

Code:
```         x: ADD_64  R0.x,  R0.y,  R1.y
Note how the lanes have "reversed operands". I'm inferring that this means you can't tell anything about the internal workings from the per-lane specified operands.

Jawed

27-Mar-2008, 02:42   #116
Farhan
Member

Join Date: May 2005
Posts: 152

Quote:
 Originally Posted by Jawed Sorry, I didn't explain very well that I meant the final add to be a separate, pipelined operation. This is because this final add has to align its operands based on their exponents.
I'm a bit confused now. Are we talking about a MUL or a MAD here? If it's a MUL then you don't have to (variable) align anything.

Quote:
 Originally Posted by Jawed Maybe this will help: Code: ``` Bl Al --- wWW Bh Al --- zZZ --- Z+W ===== zZwWW partial sum 1 ===== Bl Ah --- yYY Bh Ah --- xXX --- X+Y ===== xXyYY partial sum 2 ===== p1 zZwWW p2 +xXyYY ======= xXyZwWW =======``` So, that's A*B, across lanes X, Y, Z and W. I've kept track of the carry from each lane by indicating the carry with a lower case lane name, e.g. "y". When p1 and p2 are added, the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27. So if p2 is 54 significant bits, then only as far as the carry, "w" can be used, WW is dropped. So the final add is always within the precision of a 54-bit adder. What I'm wondering, now, is if the multipliers in each of the four lanes have final-stage adders that can be joined in the pairings I've described. In other words, p1 can be generated as lanes Z and W do the final addition for each of their own 27-bit multiplies - p1 doesn't require an additional adder stage after Z and W have generated their multiplier results. Same for p2. If that's the case then I haven't added any stages to the pipeline, because by default the pipeline performs MAD, with the final add being p1 + p2. --- I've been playing with GPUSA, this is a*b in Brook+: Code: ``` x: MUL_64 R0.x, R0.y, R1.y y: MUL_64 R0.y, R0.y, R1.y z: MUL_64 ____, R0.y, R1.y w: MUL_64 ____, R0.x, R1.x```
I don't get what you mean by the "count of significant bits in p2". In the worst case, you're adding one 54bit number to a 54+27bit number. So ok, you can get away with just an incrementer for the top 27 bits, but that's still quite a bit more than just a 54 bit add. This is assuming no rounding at all, meaning i always drop the last 27 bits from p1. Also, p2 is is always going to have a 1 in the MSB or a 1 in the bit after that, simply by definition (no denormals ftw).

edit: yeah, it does look like it's single cycle for the MUL, using all 4 pipelines fused. Most interesting. Now i'm dying to know the internal structure. I wonder if there are more stages when doing DP...
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

Last edited by Farhan; 27-Mar-2008 at 02:56.

27-Mar-2008, 03:53   #117
Jawed
Regular

Join Date: Oct 2004
Location: London
Posts: 9,948

Quote:
 Originally Posted by Farhan I'm a bit confused now. Are we talking about a MUL or a MAD here? If it's a MUL then you don't have to (variable) align anything.
I've been talking about MUL, but keeping my eye on the fact that the pipeline must also be able to produce a MAD at the same rate. The way I see it, when p1 and p2 are added, each is a 54-bit operand with potentially very different exponents. Imagine p1 and p2 are operands to an ADD instruction (running on X and Y lanes).

Quote:
 I don't get what you mean by the "count of significant bits in p2". In the worst case, you're adding one 54bit number to a 54+27bit number. So ok, you can get away with just an incrementer for the top 27 bits, but that's still quite a bit more than just a 54 bit add. This is assuming no rounding at all, meaning i always drop the last 27 bits from p1. Also, p2 is is always going to have a 1 in the MSB or a 1 in the bit after that, simply by definition (no denormals ftw).
Hmm, I get the feeling I've got something backwards I think I need to sleep on this.

If anyone works this out, do you think Mike will say?

Jawed

 27-Mar-2008, 14:20 #118 Jawed Regular   Join Date: Oct 2004 Location: London Posts: 9,948 This page nicely explains the denormalisation of the smaller operand that's required to produce equivalent exponents in the two operands: http://www.cs.umd.edu/class/sum2003/.../addFloat.html Jawed
27-Mar-2008, 17:15   #119
Farhan
Member

Join Date: May 2005
Posts: 152

Quote:
 Originally Posted by Jawed This page nicely explains the denormalisation of the smaller operand that's required to produce equivalent exponents in the two operands: http://www.cs.umd.edu/class/sum2003/.../addFloat.html Jawed
Yeah, obviously you have to do that for an ADD. I was just talking about the MUL.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

27-Mar-2008, 19:23   #120
Jawed
Regular

Join Date: Oct 2004
Location: London
Posts: 9,948

Quote:
 Originally Posted by Farhan Yeah, obviously you have to do that for an ADD. I was just talking about the MUL.
What I'm proposing is that the final stage for the MUL is a pipelined-add, for p1+p2. Hence the exponent adjustment and trading of significant bits in p2 against bits of p1.

You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (de-normalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result).

Jawed

 27-Mar-2008, 19:33 #121 Jawed Regular   Join Date: Oct 2004 Location: London Posts: 9,948 Something different Any chance that modifying/widening the DP4 paths will provide the requisite stages? Jawed
27-Mar-2008, 23:02   #122
Farhan
Member

Join Date: May 2005
Posts: 152

Quote:
 Originally Posted by Jawed What I'm proposing is that the final stage for the MUL is a pipelined-add, for p1+p2. Hence the exponent adjustment and trading of significant bits in p2 against bits of p1. You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (de-normalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result). Jawed
Regardless of whether it's a pipelined add, you can't do that shifting thing for the MUL because that would be incorrect. The alignment for p1 and p2 is always fixed (they are not 2 completely independent FP numbers, think of them as having a shared exponent). The addition is always between the top 54 bits of p1 and the bottom 54 bits of p2, with the carry propagation having to go through all the way to the MSB of p2 (27 bits).
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

28-Mar-2008, 03:58   #123
Jawed
Regular

Join Date: Oct 2004
Location: London
Posts: 9,948

Quote:
 Originally Posted by Farhan Regardless of whether it's a pipelined add, you can't do that shifting thing for the MUL because that would be incorrect. The alignment for p1 and p2 is always fixed (they are not 2 completely independent FP numbers, think of them as having a shared exponent).
I've diagrammed a possible set of exponents:

Code:
```Blo         27
Alo         27
---
w55
Bhi       53
Alo       27
---
z81
---
Z+W
=====
z82    partial sum 1
=====

Blo       27
Ahi       53
---
y81
Bhi     53
Ahi     53
---
107
---
X+Y
=====
108    partial sum 2
=====

p1       z82
p2    +108
=======
109
=======```
For the sake of clarity, both A and B have exponent 53. When split into hi and lo parts, the hi parts keep their exponent, 53, while the lo parts are normalised to exponent 27 (though it could be lower for either of them). I've then worked through the multiplications and additions, calculating the maximum value of each of the resulting exponents.

Doing this I think I've understood my mistake. When I said "the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27" that's wrong, it should be the difference in exponents as there's always 54 significant bits in p2.

---

My suggestion is the addition, p1+p2, is done on the final adder in the pipeline (in lanes X and Y). This adder is required to perform a DADD instruction, so in this case it is also used for p1+p2. Since DADD has to support two 53-bit operands by being a 54-bit adder, the addition of p1+p2, 27 bits + 54 bits requires no extra hardware dedicated to MUL.

So, what I'm thinking is that a conventional single precision DP4 needs to perform a final ADD on 4 MULs. So the DP4 instuction requires a 4 operand adder. I'm wondering if this same adder can also support:
• DMUL p1, p2
C comes from A*B+C.

Does DP4 work like that, though?

Jawed

 25-May-2008, 13:35 #124 itaru Member   Join Date: May 2007 Posts: 134 http://forums.amd.com/forum/messagev...&enterthread=y AMD Stream SDK v1.1-beta Now Available For Download The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta! The installation files are available for immediate download from: FTP Download Site For AMD Stream SDK v1.1-beta The AMD Stream Computing website will be updated in the next few days to reflect this new release. With v1.1-beta comes: - AMD FireStream 9170 support - Linux support (RHEL 5.1 and SLES 10 SP1) - Brook+ integer support - Brook+ #line number support for easier .br file debugging - Various bug fixes and runtime enhancements - Preliminary Microsoft Visual Studio 2008 support If you have any questions, please do not hesitate to post your question to the forum. Sincerely, AMD Stream Team
07-Jun-2008, 15:22   #125
wingless
Junior Member

Join Date: Aug 2007
Location: Houston, Texas
Posts: 79

Quote:
 Originally Posted by itaru http://forums.amd.com/forum/messagev...&enterthread=y AMD Stream SDK v1.1-beta Now Available For Download The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta! The installation files are available for immediate download from: FTP Download Site For AMD Stream SDK v1.1-beta The AMD Stream Computing website will be updated in the next few days to reflect this new release. With v1.1-beta comes: - AMD FireStream 9170 support - Linux support (RHEL 5.1 and SLES 10 SP1) - Brook+ integer support - Brook+ #line number support for easier .br file debugging - Various bug fixes and runtime enhancements - Preliminary Microsoft Visual Studio 2008 support If you have any questions, please do not hesitate to post your question to the forum. Sincerely, AMD Stream Team
Awesome. I hope we see more ATI support in GPGPU before CUDA takes over the market.

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home News Forums     Beyond3D News         Press Releases         Beyond3D Articles Core Forums     3D Architectures & Chips         3D Beginner's Questions     3D Technology & Algorithms         3D Programming & Tools     3D Hardware, Software & Output Devices         Video Technology, Displays, & HTPC     3D & Semiconductor Industry     GPGPU Technology & Programming Embedded 3D Forums     Console Forum         Console Technology         Console Games             PC Games     Handheld Gaming     Handheld Technology     CellPerformance@B3D PC Forums     Hardware & Software Talk         Politics & Ethics of Technology         Unix, Mac, & BSD (3D)     Processor & Chipset Technology     Purchase Decisions Help     PC Games         Console Games Site Forums     General Discussion     Folding For Beyond3D Team #32377     Industry Jobs     Site Feedback Beyond3D Hall of Fame     Pre-release GPU Speculation     General 3D Technology     Consoles     Other

All times are GMT +1. The time now is 10:36.

 -- vB3D -- vBulletin Default Style Contact Us - Beyond3D - Archive - Top