Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 25-Mar-2008, 03:23   #101
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 343
Default

I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32-bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32-bit ops or worse under bandwidth/latency increases from reading/writing wider data.
mhouston is offline   Reply With Quote
Old 25-Mar-2008, 03:30   #102
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

Thanks for the clarification Mike. Your explanation seems to imply that there's no "intrinsic instruction" support for double precision transcendentals - presumably they're only available with some kind macro?

Jawed
Jawed is offline   Reply With Quote
Old 25-Mar-2008, 03:37   #103
Farhan
Member
 
Join Date: May 2005
Location: in the shade
Posts: 152
Default

Quote:
Originally Posted by mhouston View Post
I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32-bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32-bit ops or worse under bandwidth/latency increases from reading/writing wider data.
What do you mean by "fused"? I'm really interested to know how doubles are actually handled.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux
Farhan is offline   Reply With Quote
Old 25-Mar-2008, 04:44   #104
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 343
Default

That is the secret sauce. If you play with the IL/ISA you will gain a little information on how things are done, but how the actual arch works I won't be able to tell you.

I just wanted to explain things for when people start playing with CAL and disassembling things so they don't get confused that the doubles aren't happening in T. As for trancendentals, sin/cos and the like, they are not native in double. I'm not actually sure if we currently ship a full set of double transcendentals, I'll have to check.
mhouston is offline   Reply With Quote
Old 25-Mar-2008, 14:29   #105
Arnold Beckenbauer
Senior Member
 
Join Date: Oct 2006
Location: Germany
Posts: 1,021
Default

Quote:
Originally Posted by mhouston View Post
I should also note that doubles are *not* done in the "t" unit, but they are instead done in the XYZW units in "fused" manner. Thus, you can execute a double precision operation in XYZW along side a 32-bit operation in the t unit. Thus, doubles are handled at 1/4 rate in 4/5th of the units, so double precision peak is 1/5 of single precision peak. However, in practice, you the difference can be better than 1/5th depending on the scheduling of your 32-bit ops or worse under bandwidth/latency increases from reading/writing wider data.

http://www.computerbase.de/news/trei...md_firestream/
But why did Giuseppe Amato, your Technical Director, say during the CeBIT, it's up to 350 GFLOP/s?
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.
Arnold Beckenbauer is offline   Reply With Quote
Old 25-Mar-2008, 19:35   #106
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 343
Default

Giuseppe either spoke incorrectly or was misquoted. We do have mixed mode double+single precision applications that get 350GFlops, but the upper limit for raw double performance is 1/5th single rate as I stated above. You can grab CAL/Brook+ and an 3850/3870 and play with this for yourself.
mhouston is offline   Reply With Quote
Old 26-Mar-2008, 03:46   #107
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 343
Default

I should also note that 1/5 is peak DP rate vs. SP rate doing MADs. There are cases, like DP add, in which we can do better, 2/5th rate, but raw peak DP flops is 1/5 SP flops. Another architect reminded me of the DP add performance as compared to SP add performance this afternoon.

And, despite what others may claim, we have actual hardware for doing this, it's not emulated...
mhouston is offline   Reply With Quote
Old 26-Mar-2008, 05:59   #108
Farhan
Member
 
Join Date: May 2005
Location: in the shade
Posts: 152
Default

Quote:
Originally Posted by mhouston View Post
I should also note that 1/5 is peak DP rate vs. SP rate doing MADs. There are cases, like DP add, in which we can do better, 2/5th rate, but raw peak DP flops is 1/5 SP flops. Another architect reminded me of the DP add performance as compared to SP add performance this afternoon.

And, despite what others may claim, we have actual hardware for doing this, it's not emulated...
1/5 speed for DP MADs :O
That's faster than i expected... Now i'm really curious to know much extra hardware was used to achieve this (my guess is wider internal adders at least, for a 1/5 speed MUL, but i can't figure out how to get the ADD in all at the same time...).
Is DP MUL also 1/5 speed?
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux
Farhan is offline   Reply With Quote
Old 26-Mar-2008, 06:42   #109
mhouston
A little of this and that
 
Join Date: Oct 2005
Location: Cupertino
Posts: 343
Default

Let the speculation begin. Since we have exposed all the way down to the metal, at least for the ISA, those willing to spend some time can figure out how a lot of stuff is done and what the performance of every instruction is, either ISA or IL.
mhouston is offline   Reply With Quote
Old 26-Mar-2008, 15:34   #110
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

I've just downloaded the SDK so will have a play... I like the detail I'm seeing

I want to make a guess first. When doing double-precision MUL on A and B the pipeline first splats the operands across the SIMD as follows:

X - A.hi, B.hi
Y - A.hi, B.lo
Z - A.lo, B.hi
W - A.lo, B.lo

Each lane would need to be able to do ~27 bit multiplication. The partial products can then be added at the bottom of the pipe, which needs an adder that can sum all four lanes having first bit-shifted for correct alignment in the result. This would be done as (X+Y)+(Z+W).

A DP ADD would only need to occupy 2 lanes as the cumulative ~54 bits is enough for an ADD, so X+Y lanes and Z+W lanes as two independent operations:

X - A.hi, B.hi
Y - A.lo, B.lo

Z - C.hi, D.hi
W - C.lo, D.lo

each pair of lanes then uses a bottom of pipe accumulator to sort out the carry from lo to hi.

So, I'm proposing operand-splatting, widened lanes and two add stages.

Jawed
Jawed is offline   Reply With Quote
Old 26-Mar-2008, 18:15   #111
Farhan
Member
 
Join Date: May 2005
Location: in the shade
Posts: 152
Default

Quote:
Originally Posted by Jawed View Post
I've just downloaded the SDK so will have a play... I like the detail I'm seeing

I want to make a guess first. When doing double-precision MUL on A and B the pipeline first splats the operands across the SIMD as follows:

X - A.hi, B.hi
Y - A.hi, B.lo
Z - A.lo, B.hi
W - A.lo, B.lo

Each lane would need to be able to do ~27 bit multiplication. The partial products can then be added at the bottom of the pipe, which needs an adder that can sum all four lanes having first bit-shifted for correct alignment in the result. This would be done as (X+Y)+(Z+W).

A DP ADD would only need to occupy 2 lanes as the cumulative ~54 bits is enough for an ADD, so X+Y lanes and Z+W lanes as two independent operations:

X - A.hi, B.hi
Y - A.lo, B.lo

Z - C.hi, D.hi
W - C.lo, D.lo

each pair of lanes then uses a bottom of pipe accumulator to sort out the carry from lo to hi.

So, I'm proposing operand-splatting, widened lanes and two add stages.

Jawed
That would mean that the adders are pretty fancy if they can be combined and do a 54bit add in a single stage for the ADD. It's cheaper than a larger multiplier i guess, and probably cheaper than having an individual 54bit add per pipe internally (which was what i was thinking).


However, regarding MULs, i think what you proposed is a little hard for me to buy.
The complete result for this accumulate would be 108bits, of which you want to keep the top 54 or so (i don't know if the rounding is IEEE compliant, probably not). To do this in one cycle, that accumulator will be pretty massive. Not sure if they would spend that much extra hardware on it... What are the instruction latencies of the ADD and the MUL? Can you issue 2 DP ADDs or one DP MUL per cycle to the XYZW? My idea of it was more like 1 DP ADD over 2 cycles and 1 DP MUL over 4 cycles per X,Y,Z,W. That would require that each pipeline have a 54bit adder and some extra exponent logic, but none of the adds will be wider than 54bits.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

Last edited by Farhan; 26-Mar-2008 at 18:40.
Farhan is offline   Reply With Quote
Old 26-Mar-2008, 20:34   #112
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

I'm in way over my head, so I'm really just nudging/guessing...

Quote:
Originally Posted by Farhan View Post
That would mean that the adders are pretty fancy if they can be combined and do a 54bit add in a single stage for the ADD.
I'm assuming, naively, that the interaction of carry, sign and exponent between the two halves isn't difficult, with carry being no different from the way carry is treated within a single adder.

Quote:
It's cheaper than a larger multiplier i guess, and probably cheaper than having an individual 54bit add per pipe internally (which was what i was thinking).
I was thinking it is prolly a bit more costly than a DP adder, but is preferable when you want to perform single precision add on each half, since each lane needs independent exponent/sign handling.

Quote:
However, regarding MULs, i think what you proposed is a little hard for me to buy.
The complete result for this accumulate would be 108bits, of which you want to keep the top 54 or so (i don't know if the rounding is IEEE compliant, probably not).
I was thinking the accumulate would be operating on shifted intermediate results. Working from MSB downwards (assuming X is most significant), all 27 bits from X, 27 bits from Y shifted right with respect to X, 27 bits from Z also shifted right and the top bits from scraggy old W shifted to the bottom of the 54 bit result.

I'm guessing that doing partial sums of the partial products you can control the precision by using the correct bit-shifts between stages.

Quote:
To do this in one cycle, that accumulator will be pretty massive. Not sure if they would spend that much extra hardware on it...
This is why I split it up into two independent add stages.

Quote:
What are the instruction latencies of the ADD and the MUL?
I presume all the latencies are identical. I have to admit I've added a pile of latency with my suggestion which is problematic.

Quote:
Can you issue 2 DP ADDs or one DP MUL per cycle to the XYZW?
Yes, Mike described the issue rate for MAD as 1 per clock producing 1 scalar result, with ADD running at 2 scalar results per clock.

Quote:
My idea of it was more like 1 DP ADD over 2 cycles and 1 DP MUL over 4 cycles per X,Y,Z,W. That would require that each pipeline have a 54bit adder and some extra exponent logic, but none of the adds will be wider than 54bits.
Aha, so what you're thinking of is "looping" with intermediate shifts and adds? So for a MUL each lane would write "least significant 32 bits" (including exponent) in one cycle then 2 cycles later the most significant 32 bits.

I'm going to try to play with IL and GPUSA soon... I'm a bit bugged by GPUSA's focus on R600, not RV670 - fingers-crossed.

Jawed
Jawed is offline   Reply With Quote
Old 27-Mar-2008, 00:39   #113
Farhan
Member
 
Join Date: May 2005
Location: in the shade
Posts: 152
Default

Quote:
Originally Posted by Jawed View Post
I'm in way over my head, so I'm really just nudging/guessing...


I'm assuming, naively, that the interaction of carry, sign and exponent between the two halves isn't difficult, with carry being no different from the way carry is treated within a single adder.


I was thinking it is prolly a bit more costly than a DP adder, but is preferable when you want to perform single precision add on each half, since each lane needs independent exponent/sign handling.

I was thinking the accumulate would be operating on shifted intermediate results. Working from MSB downwards (assuming X is most significant), all 27 bits from X, 27 bits from Y shifted right with respect to X, 27 bits from Z also shifted right and the top bits from scraggy old W shifted to the bottom of the 54 bit result.

I'm guessing that doing partial sums of the partial products you can control the precision by using the correct bit-shifts between stages.


This is why I split it up into two independent add stages.


I presume all the latencies are identical. I have to admit I've added a pile of latency with my suggestion which is problematic.


Yes, Mike described the issue rate for MAD as 1 per clock producing 1 scalar result, with ADD running at 2 scalar results per clock.


Aha, so what you're thinking of is "looping" with intermediate shifts and adds? So for a MUL each lane would write "least significant 32 bits" (including exponent) in one cycle then 2 cycles later the most significant 32 bits.

I'm going to try to play with IL and GPUSA soon... I'm a bit bugged by GPUSA's focus on R600, not RV670 - fingers-crossed.

Jawed
The results from each 27bit mul have to be added like this (each hi and lo part is 27bits)...

Code:
       |Whi|Wlo
   |Zhi|Zlo|
   |Yhi|Ylo|
Xhi|Xlo
So if you want to do (X+Y)+(Z+W), each adder has to be 54 bits wide for the (X+Y) and (Z+W) part. To do the final add you need a 54+27 bit adder. I'm not sure they would spend that much extra hardware on it, and it's probably not the optimal way to do it, at least in my mind. If they want to do it all in a single adder tree then they probably have to pipeline it, and it would also cost a lot of hardware.

See my idea to minimize compute hardware was to just use 1 pipeline over 4 cycles (meaning each one would work on an independent MUL), and have a 54bit adder for the accumulate stage. So basically what you do is (((AloBlo+AloBhi)+AhiBlo)+AhiBhi), in this case you never have to do any >54bit adds, and you can start using the lower order bits to start figuring out rounding already (you don't have to keep all the bits). I was guessing that for the DP ADD they would just be register fetch limited so it would need 2 cycles. However, so would the MUL in this case, so it would probably not work in 4 cycles unless the inputs were wider.

Maybe it's something in between. Maybe 2 pipelines fuse over 2 cycles for the MUL and 1 cycle for the ADD. I think that would make more sense than having 4 pipelines combine for the MUL and 2 for the ADD, and it would have enough input bandwidth to have all the operands in one cycle.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux
Farhan is offline   Reply With Quote
Old 27-Mar-2008, 03:06   #114
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

Sorry, I didn't explain very well that I meant the final add to be a separate, pipelined operation. This is because this final add has to align its operands based on their exponents.

Maybe this will help:

Code:
            Bl
            Al
           ---
           wWW
          Bh
          Al
         ---
         zZZ
         ---
Z+W 
         =====
         zZwWW  partial sum 1
         =====
 
          Bl
          Ah
         ---
         yYY
        Bh
        Ah
       ---
       xXX
       ---
X+Y
       =====
       xXyYY  partial sum 2
       =====
 
p1       zZwWW
p2    +xXyYY
       =======
       xXyZwWW
       =======
So, that's A*B, across lanes X, Y, Z and W. I've kept track of the carry from each lane by indicating the carry with a lower case lane name, e.g. "y".

When p1 and p2 are added, the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27. So if p2 is 54 significant bits, then only as far as the carry, "w" can be used, WW is dropped. So the final add is always within the precision of a 54-bit adder.

What I'm wondering, now, is if the multipliers in each of the four lanes have final-stage adders that can be joined in the pairings I've described. In other words, p1 can be generated as lanes Z and W do the final addition for each of their own 27-bit multiplies - p1 doesn't require an additional adder stage after Z and W have generated their multiplier results. Same for p2.

If that's the case then I haven't added any stages to the pipeline, because by default the pipeline performs MAD, with the final add being p1 + p2.

---

I've been playing with GPUSA, this is a*b in Brook+:

Code:
         x: MUL_64  R0.x,  R0.y,  R1.y      
         y: MUL_64  R0.y,  R0.y,  R1.y      
         z: MUL_64  ____,  R0.y,  R1.y      
         w: MUL_64  ____,  R0.x,  R1.x
This monster is a/b (this issues a DDIV - double division - in IL):

Code:
     35  x: FREXP_64  ____,  R0.y      
         y: FREXP_64  R123.y,  R0.x      
         z: FREXP_64  R123.z,  R0.x      
         w: FREXP_64  R123.w,  R0.x      
     36  x: FLT64_TO_FLT32  R123.x,  PV(35).w      
         y: FLT64_TO_FLT32  ____,  PV(35).z      
         z: SUB_INT  R126.z,  0.0f,  PV(35).y      
     37  x: MUL_64  R127.x,  R2.y,  R0.y      
         y: MUL_64  R127.y,  R2.y,  R0.y      
         z: MUL_64  ____,  R2.y,  R0.y      
         w: MUL_64  ____,  R2.x,  R0.x      
         t: RECIP_IEEE  R122.w,  PV(36).x      
     38  z: FLT32_TO_FLT64  R123.z,  PS(37).x      
         w: FLT32_TO_FLT64  R123.w,  R0.x      
     39  x: LDEXP_64  R126.x,  PV(38).w,  R126.z      
         y: LDEXP_64  R126.y,  PV(38).z,  R126.z      
     40  x: MULADD_64  R123.x,  R127.y,  PV(39).y,  R3.y      
         y: MULADD_64  R123.y,  R127.y,  PV(39).y,  R3.y      
         z: MULADD_64  R4.z,  R127.y,  PV(39).y,  R3.y      
         w: MULADD_64  R4.w,  R127.x,  PV(39).x,  R3.x      
     41  x: MULADD_64  R126.x,  PV(40).y,  R126.y,  R126.y      
         y: MULADD_64  R126.y,  PV(40).y,  R126.y,  R126.y      
         z: MULADD_64  R4.z,  PV(40).y,  R126.y,  R126.y      
         w: MULADD_64  R4.w,  PV(40).x,  R126.x,  R126.x      
     42  x: MULADD_64  R123.x,  R127.y,  PV(41).y,  R3.y      
         y: MULADD_64  R123.y,  R127.y,  PV(41).y,  R3.y      
         z: MULADD_64  R4.z,  R127.y,  PV(41).y,  R3.y      
         w: MULADD_64  R4.w,  R127.x,  PV(41).x,  R3.x      
     43  x: MULADD_64  R126.x,  PV(42).y,  R126.y,  R126.y      
         y: MULADD_64  R126.y,  PV(42).y,  R126.y,  R126.y      
         z: MULADD_64  R4.z,  PV(42).y,  R126.y,  R126.y      
         w: MULADD_64  R4.w,  PV(42).x,  R126.x,  R126.x      
     44  x: MUL_64  R125.x,  R1.y,  PV(43).y      
         y: MUL_64  R125.y,  R1.y,  PV(43).y      
         z: MUL_64  ____,  R1.y,  PV(43).y      
         w: MUL_64  ____,  R1.x,  PV(43).x      
     45  x: MULADD_64  R123.x,  R127.y,  PV(44).y,  R1.y      
         y: MULADD_64  R123.y,  R127.y,  PV(44).y,  R1.y      
         z: MULADD_64  R4.z,  R127.y,  PV(44).y,  R1.y      
         w: MULADD_64  R4.w,  R127.x,  PV(44).x,  R1.x      
     46  x: MULADD_64  R1.x,  PV(45).y,  R126.y,  R125.y      
         y: MULADD_64  R1.y,  PV(45).y,  R126.y,  R125.y      
         z: MULADD_64  R4.z,  PV(45).y,  R126.y,  R125.y      
         w: MULADD_64  R4.w,  PV(45).x,  R126.x,  R125.x
FREXP means split double precision into fraction and exponent. LDEXP means Combine separate fraction and exponent into double precision. SUB_INT is integer subtract.

Jawed
Jawed is offline   Reply With Quote
Old 27-Mar-2008, 03:14   #115
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Farhan View Post
Maybe it's something in between. Maybe 2 pipelines fuse over 2 cycles for the MUL and 1 cycle for the ADD. I think that would make more sense than having 4 pipelines combine for the MUL and 2 for the ADD, and it would have enough input bandwidth to have all the operands in one cycle.
I'm pretty much convinced everything is single cycle, but I can't think how to test this.

I should also post a+b:

Code:
         x: ADD_64  R0.x,  R0.y,  R1.y      
         y: ADD_64  R0.y,  R0.x,  R1.x
Note how the lanes have "reversed operands". I'm inferring that this means you can't tell anything about the internal workings from the per-lane specified operands.

Jawed
Jawed is offline   Reply With Quote
Old 27-Mar-2008, 03:42   #116
Farhan
Member
 
Join Date: May 2005
Location: in the shade
Posts: 152
Default

Quote:
Originally Posted by Jawed View Post
Sorry, I didn't explain very well that I meant the final add to be a separate, pipelined operation. This is because this final add has to align its operands based on their exponents.
I'm a bit confused now. Are we talking about a MUL or a MAD here? If it's a MUL then you don't have to (variable) align anything.


Quote:
Originally Posted by Jawed View Post
Maybe this will help:

Code:
            Bl
            Al
           ---
           wWW
          Bh
          Al
         ---
         zZZ
         ---
Z+W 
         =====
         zZwWW  partial sum 1
         =====
 
          Bl
          Ah
         ---
         yYY
        Bh
        Ah
       ---
       xXX
       ---
X+Y
       =====
       xXyYY  partial sum 2
       =====
 
p1       zZwWW
p2    +xXyYY
       =======
       xXyZwWW
       =======
So, that's A*B, across lanes X, Y, Z and W. I've kept track of the carry from each lane by indicating the carry with a lower case lane name, e.g. "y".

When p1 and p2 are added, the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27. So if p2 is 54 significant bits, then only as far as the carry, "w" can be used, WW is dropped. So the final add is always within the precision of a 54-bit adder.

What I'm wondering, now, is if the multipliers in each of the four lanes have final-stage adders that can be joined in the pairings I've described. In other words, p1 can be generated as lanes Z and W do the final addition for each of their own 27-bit multiplies - p1 doesn't require an additional adder stage after Z and W have generated their multiplier results. Same for p2.

If that's the case then I haven't added any stages to the pipeline, because by default the pipeline performs MAD, with the final add being p1 + p2.

---

I've been playing with GPUSA, this is a*b in Brook+:

Code:
         x: MUL_64  R0.x,  R0.y,  R1.y      
         y: MUL_64  R0.y,  R0.y,  R1.y      
         z: MUL_64  ____,  R0.y,  R1.y      
         w: MUL_64  ____,  R0.x,  R1.x
I don't get what you mean by the "count of significant bits in p2". In the worst case, you're adding one 54bit number to a 54+27bit number. So ok, you can get away with just an incrementer for the top 27 bits, but that's still quite a bit more than just a 54 bit add. This is assuming no rounding at all, meaning i always drop the last 27 bits from p1. Also, p2 is is always going to have a 1 in the MSB or a 1 in the bit after that, simply by definition (no denormals ftw).

edit: yeah, it does look like it's single cycle for the MUL, using all 4 pipelines fused. Most interesting. Now i'm dying to know the internal structure. I wonder if there are more stages when doing DP...
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux

Last edited by Farhan; 27-Mar-2008 at 03:56.
Farhan is offline   Reply With Quote
Old 27-Mar-2008, 04:53   #117
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Farhan View Post
I'm a bit confused now. Are we talking about a MUL or a MAD here? If it's a MUL then you don't have to (variable) align anything.
I've been talking about MUL, but keeping my eye on the fact that the pipeline must also be able to produce a MAD at the same rate. The way I see it, when p1 and p2 are added, each is a 54-bit operand with potentially very different exponents. Imagine p1 and p2 are operands to an ADD instruction (running on X and Y lanes).

Quote:
I don't get what you mean by the "count of significant bits in p2". In the worst case, you're adding one 54bit number to a 54+27bit number. So ok, you can get away with just an incrementer for the top 27 bits, but that's still quite a bit more than just a 54 bit add. This is assuming no rounding at all, meaning i always drop the last 27 bits from p1. Also, p2 is is always going to have a 1 in the MSB or a 1 in the bit after that, simply by definition (no denormals ftw).
Hmm, I get the feeling I've got something backwards I think I need to sleep on this.

If anyone works this out, do you think Mike will say?

Jawed
Jawed is offline   Reply With Quote
Old 27-Mar-2008, 15:20   #118
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

This page nicely explains the denormalisation of the smaller operand that's required to produce equivalent exponents in the two operands:

http://www.cs.umd.edu/class/sum2003/.../addFloat.html

Jawed
Jawed is offline   Reply With Quote
Old 27-Mar-2008, 18:15   #119
Farhan
Member
 
Join Date: May 2005
Location: in the shade
Posts: 152
Default

Quote:
Originally Posted by Jawed View Post
This page nicely explains the denormalisation of the smaller operand that's required to produce equivalent exponents in the two operands:

http://www.cs.umd.edu/class/sum2003/.../addFloat.html

Jawed
Yeah, obviously you have to do that for an ADD. I was just talking about the MUL.
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux
Farhan is offline   Reply With Quote
Old 27-Mar-2008, 20:23   #120
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Farhan View Post
Yeah, obviously you have to do that for an ADD. I was just talking about the MUL.
What I'm proposing is that the final stage for the MUL is a pipelined-add, for p1+p2. Hence the exponent adjustment and trading of significant bits in p2 against bits of p1.

You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (de-normalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result).

Jawed
Jawed is offline   Reply With Quote
Old 27-Mar-2008, 20:33   #121
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default Something different

Any chance that modifying/widening the DP4 paths will provide the requisite stages?

Jawed
Jawed is offline   Reply With Quote
Old 28-Mar-2008, 00:02   #122
Farhan
Member
 
Join Date: May 2005
Location: in the shade
Posts: 152
Default

Quote:
Originally Posted by Jawed View Post
What I'm proposing is that the final stage for the MUL is a pipelined-add, for p1+p2. Hence the exponent adjustment and trading of significant bits in p2 against bits of p1.

You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (de-normalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result).

Jawed
Regardless of whether it's a pipelined add, you can't do that shifting thing for the MUL because that would be incorrect. The alignment for p1 and p2 is always fixed (they are not 2 completely independent FP numbers, think of them as having a shared exponent). The addition is always between the top 54 bits of p1 and the bottom 54 bits of p2, with the carry propagation having to go through all the way to the MSB of p2 (27 bits).
__________________
[03:44] <thefarhan> i have exactly 128 friends right now :D
[03:45] <Jollemi> you have to teach them to remember 1MB worth of data, and see if you can run Windows 9x or Linux
Farhan is offline   Reply With Quote
Old 28-Mar-2008, 04:58   #123
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,956
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Farhan View Post
Regardless of whether it's a pipelined add, you can't do that shifting thing for the MUL because that would be incorrect. The alignment for p1 and p2 is always fixed (they are not 2 completely independent FP numbers, think of them as having a shared exponent).
I've diagrammed a possible set of exponents:

Code:
Blo         27
Alo         27
           ---
           w55
Bhi       53
Alo       27
         ---
         z81
         ---
Z+W 
         =====
         z82    partial sum 1
         =====
 
Blo       27
Ahi       53
         ---
         y81
Bhi     53
Ahi     53
       ---
       107
       ---
X+Y
       =====
       108    partial sum 2
       =====
 
p1       z82
p2    +108
       =======
       109
       =======
For the sake of clarity, both A and B have exponent 53. When split into hi and lo parts, the hi parts keep their exponent, 53, while the lo parts are normalised to exponent 27 (though it could be lower for either of them). I've then worked through the multiplications and additions, calculating the maximum value of each of the resulting exponents.

Doing this I think I've understood my mistake. When I said "the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27" that's wrong, it should be the difference in exponents as there's always 54 significant bits in p2.

---

My suggestion is the addition, p1+p2, is done on the final adder in the pipeline (in lanes X and Y). This adder is required to perform a DADD instruction, so in this case it is also used for p1+p2. Since DADD has to support two 53-bit operands by being a 54-bit adder, the addition of p1+p2, 27 bits + 54 bits requires no extra hardware dedicated to MUL.

So, what I'm thinking is that a conventional single precision DP4 needs to perform a final ADD on 4 MULs. So the DP4 instuction requires a 4 operand adder. I'm wondering if this same adder can also support:
  • DADD A, B
  • DMUL p1, p2
  • DMAD p1, p2, C
C comes from A*B+C.

Does DP4 work like that, though?

Jawed
Jawed is offline   Reply With Quote
Old 25-May-2008, 13:35   #124
itaru
Member
 
Join Date: May 2007
Posts: 138
Default

http://forums.amd.com/forum/messagev...&enterthread=y
AMD Stream SDK v1.1-beta Now Available For Download

The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta!

The installation files are available for immediate download from:
FTP Download Site For AMD Stream SDK v1.1-beta

The AMD Stream Computing website will be updated in the next few days to reflect this new release.

With v1.1-beta comes:

- AMD FireStream 9170 support
- Linux support (RHEL 5.1 and SLES 10 SP1)
- Brook+ integer support
- Brook+ #line number support for easier .br file debugging
- Various bug fixes and runtime enhancements
- Preliminary Microsoft Visual Studio 2008 support


If you have any questions, please do not hesitate to post your question to the forum.

Sincerely,
AMD Stream Team
itaru is offline   Reply With Quote
Old 07-Jun-2008, 15:22   #125
wingless
Junior Member
 
Join Date: Aug 2007
Location: Houston, Texas
Posts: 79
ATI

Quote:
Originally Posted by itaru View Post
http://forums.amd.com/forum/messagev...&enterthread=y
AMD Stream SDK v1.1-beta Now Available For Download

The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta!

The installation files are available for immediate download from:
FTP Download Site For AMD Stream SDK v1.1-beta

The AMD Stream Computing website will be updated in the next few days to reflect this new release.

With v1.1-beta comes:

- AMD FireStream 9170 support
- Linux support (RHEL 5.1 and SLES 10 SP1)
- Brook+ integer support
- Brook+ #line number support for easier .br file debugging
- Various bug fixes and runtime enhancements
- Preliminary Microsoft Visual Studio 2008 support


If you have any questions, please do not hesitate to post your question to the forum.

Sincerely,
AMD Stream Team
Awesome. I hope we see more ATI support in GPGPU before CUDA takes over the market.
wingless is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:47.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.