Details trickle out on CELL processor...

Regarding the patent nAo posted... look here:

Inventors: Day, Michael Norman; (Round Rock, TX) ; Hofstee, Harm Peter; (Austin, TX) ; Johns, Charles Ray; (Austin, TX) ; Liu, Peichum Peter; (Austin, TX) ; Truong, Thuong Quang; (Austin, TX) ; Yamazaki, Takeshi; (Austin, TX)

Assignee Name and Adress: International Business Machines Corporation
Armonk
NY

Sony Computer Entertainment Inc.
Tokyo
 
Looks like it's 4GHz:

011bl.jpg
 
one said:
aaronspink said:
one said:
Brimstone said:
but I'm still very skeptical.

ISSCC is the most authoritative conference in the semiconductor academic society and a paper without an actually running sample is not accepted. Don't confuse it with self-proclaimed press releases.

Just FYI, this isn't true.

Where and why? :rolleyes:

Last year, year before. Plenty of papers with questionable/unverifiable results make it into ISSCC along with a lot of papers with a fairly large amount of hyperbole. While ISSCC is better than a press release, so assume there isn't a large amount of spin involved would be wrong.

Aaron Spink
speaking for myself inc.
 
Panajev2001a said:
Brimstone said:
That is an incredible clock speed. I don't think there is anything even remotely close to that speed in DSP's, CPU's, or whatever.

90 nm SOI manufacturing process + very optimized and relatively narrow stream processing unit = high clock-speed is possible.

Intel in 90nm will clock the ~470 mm^2 Itanium 2 (the 90 nm Montecito with 24+ MB of cache) at probably 2+ GHz and the ALUs in the 3.40 GHz Pentium 4 EE are running at 6.8 GHz, so I do not see the madness about running a 4-way SIMD engine (with logic shared between FP and FX resources and other bells and whistles, I do not want to over-simplify it too much), 128x128 bits regusters and 128 KB of SRAM (who has probably more than a 1 cycle load-use latency penalty) at over 4.0 GHz in 90 nm SOI technology.

From what I've read the P4 has 2 ALU's that are double pumped. They only perform integer operations. Everything else is running normal clock speed.
 
aaronspink said:
one said:
Where and why? :rolleyes:

Last year, year before. Plenty of papers with questionable/unverifiable results make it into ISSCC along with a lot of papers with a fairly large amount of hyperbole. While ISSCC is better than a press release, so assume there isn't a large amount of spin involved would be wrong.

OK, I got it. IBM's accepted papers have been on the top in number in ISSCC these years. All the world is now filled with great malice. Now, even a very simple clock frequency number is a target of hyperbole, even a thing which is expected to debut in the next year as an actual product under the review of people. Wear a tinfoil hat when you walk outside. :cry: :LOL:
 
Gubbi said:
This means that the DMA engine can have multiple outstanding transactions. MFA is going to like that, now he can explicitly (vertically)interleave threads at the software level to keep the DMA engine busy. :)

Cheers
Gubbi
Why do you think it will be up to software to interleave threads. Wouldn't that be better left to hardware? Or at least microcode in the Power core?
 
one said:
aaronspink said:
one said:
Where and why? :rolleyes:

Last year, year before. Plenty of papers with questionable/unverifiable results make it into ISSCC along with a lot of papers with a fairly large amount of hyperbole. While ISSCC is better than a press release, so assume there isn't a large amount of spin involved would be wrong.

OK, I got it. IBM's accepted papers have been on the top in number in ISSCC these years. All the world is now filled with great malice. Now, even a very simple clock frequency number is a target of hyperbole, even a thing which is expected to debut in the next year as an actual product under the review of people. Wear a tinfoil hat when you walk outside. :cry: :LOL:

I think aaronspink was just saying that just because it's presented at ISSCC doesn't mean everything is verified with actual working samples. Hech there aren't even any performance numbers for the single chip, just clock speed which isn't a performance number. And even a clock freqency doesn't translate to everything in the chip running at the same speed. Oh and FWIW aaronspink might even be a EE that know what goes on at ISSCC. Much more credible than a SONY fb kinda like those mocking a PhD. ;)
 
PC-Engine said:
I think aaronspink was just saying that just because it's presented at ISSCC doesn't mean everything is verified with actual working samples. Hech there aren't even any performance numbers for the single chip, just clock speed which isn't a performance number. And even a clock freqency doesn't translate to everything in the chip running at the same speed. Oh and FWIW aaronspink might even be a EE that know what goes on at ISSCC. Much more credible than a SONY fb kinda like those mocking a PhD. ;)

You remind of Hitler in the final days in his bunker.
It's not true!!! :LOL:
 
JF_Aidan_Pryde said:
PC-Engine said:
I think aaronspink was just saying that just because it's presented at ISSCC doesn't mean everything is verified with actual working samples. Hech there aren't even any performance numbers for the single chip, just clock speed which isn't a performance number. And even a clock freqency doesn't translate to everything in the chip running at the same speed. Oh and FWIW aaronspink might even be a EE that know what goes on at ISSCC. Much more credible than a SONY fb kinda like those mocking a PhD. ;)

You remind of Hitler in the final days in his bunker.
It's not true!!! :LOL:

And you remind me of sheep... :LOL:

randycat99 said:
It's official- ISSCC is now a rumor site! (courtesy of PCe) :LOL:

Considering you believe 6.2GLOPS + imaginary# > 10.5 GFLOPS, it wouldn't matter if it's presented at ISSCC or not. :LOL:
 
That would be 2 topics now you have overtly derailed with wildly irrelevant material. You must really be running scared these days to be clinging so immaturely to fractured elements of the past as some sort of defense. This is all I will address to you on this.
 
one said:
OK, I got it. IBM's accepted papers have been on the top in number in ISSCC these years. All the world is now filled with great malice. Now, even a very simple clock frequency number is a target of hyperbole, even a thing which is expected to debut in the next year as an actual product under the review of people. Wear a tinfoil hat when you walk outside. :cry: :LOL:
Maybe he is talking from experience .
 
Cryect said:
So what about the below?

20.1 An 8GHz Floating Point Multiply
8:30 AM
W. Belluomini, D. Jamsek, A. Martin, C. McDowell, R. Montoye,
T. Nguyen, H. Ngo, J. Sawada, I. Vo, R. Datta

IBM, Austin, TX

The implementation of the mantissa portion of a floating-point multiply (54x54b) is described. The 0.124mm2 multiplier is implemented using limited switch dynamic logic and operates at speeds up to 8GHz in a 90nm SOI technology. The multiplier dissipates between 150mW and 1.8W as it scales between 2GHz and 8GHz.

Guess it must be all make believe?








This might be Booth encoding.



Another ISSCC paper describes a dynamic Booth double-precision multiplier designed in 90-nm SOI technology.


Booth encoding is a method of reducing the number of summands required to produce the multiplication result. This paper compares the performance/area tradeoffs for the different Booth algorithms when trees are used as the summation network. This paper shows that the
simple non-Booth algorithm is not a viable design, and that currently Booth 2 is the best design. It also points out that in the future Booth 3 may offer the best performance/area ratio.







Multiplication is one of the basic arithmetic operations that constitute programs. In fact 8.72 % of all instructions in typical scienti c programs are multiplies [1]. Hardware designers have recognized this and have devoted considerable silicon area to building high speed multipliers.
Multiplication is achieved by the addition of a certain number of summands. Each summand is a chosen multiple of one of the operands (multiplicand), based upon the value of certain bits of the other operand (multiplier). The addition of these summands is a relatively long latency carry propagate addition (CPA). In order to reduce the total time required to produce the result a redundant form of addition, most commonly carry-save addition, is used. In carry-save addition, the summands are split into columns, in which each column's addition progresses independently from adjacent columns. Each column has a certain number of inputs called partial products. In high speed multipliers, the addition of the partial products is done in parallel using tree structures as shown in figure 1(a), in
contrast to serially as in linear arrays. The number of adders needed to reduce the partial products is the same for both trees and arrays. The only difference being that trees have more complex interconnections.
The number of summand that must be added to give the multiplications' result can be reduced by using Booth encoding [3]. In Booth encoding the number of summands is reduced by recording the multiplier bits into groups that select multiplies of the multiplicand. Higher order Booth encoding reduces the number of summands by a greater degree by encoding larger groups of multiplier bits and therefore requiring a larger group of multiples to select from and consequently a more complex selection table.



http://historical.ncstrl.org/litesite-data/stan/CSL-TR-95-684.pdf

The multiplier is speeding up Booth encoding. The encoding reduces the power consumtion of the the circuit.


If I'm wrong please explain.
 
PC-Engine said:
I think aaronspink was just saying that just because it's presented at ISSCC doesn't mean everything is verified with actual working samples. Hech there aren't even any performance numbers for the single chip, just clock speed which isn't a performance number. And even a clock freqency doesn't translate to everything in the chip running at the same speed.

You had me nodding in 110% agreement for this part.

PC-Engine said:
Oh and FWIW aaronspink might even be a EE that know what goes on at ISSCC. Much more credible than a SONY fb kinda like those mocking a PhD. ;)

Then you lost it here because we've seen you argue with EEs and programmers who have worked on the consoles themselves which means they are much more credible than "might be an EE".
 
Brimstone said:
Panajev2001a said:
Brimstone said:
That is an incredible clock speed. I don't think there is anything even remotely close to that speed in DSP's, CPU's, or whatever.

90 nm SOI manufacturing process + very optimized and relatively narrow stream processing unit = high clock-speed is possible.

Intel in 90nm will clock the ~470 mm^2 Itanium 2 (the 90 nm Montecito with 24+ MB of cache) at probably 2+ GHz and the ALUs in the 3.40 GHz Pentium 4 EE are running at 6.8 GHz, so I do not see the madness about running a 4-way SIMD engine (with logic shared between FP and FX resources and other bells and whistles, I do not want to over-simplify it too much), 128x128 bits regusters and 128 KB of SRAM (who has probably more than a 1 cycle load-use latency penalty) at over 4.0 GHz in 90 nm SOI technology.

From what I've read the P4 has 2 ALU's that are double pumped. They only perform integer operations. Everything else is running normal clock speed.

Two ALUs, two AGUs, their Register File, etc... the amount of double pumped logic is not trivial :).
 
Actually only the ALUs themselves are double pumped. The two issue ports, that feed these ALUs can issue two instructions (complete with register values) per cycle per port.

And of course in Prescott Intel abandoned the double pumped ALUs all together.

Cheers
Gubbi
 
Gubbi said:
Actually only the ALUs themselves are double pumped.


I am not sure Gubbi, I was aware that at least the AGUs were double pumped as well, but I'll check Intel's presentation again.

And of course in Prescott Intel abandoned the double pumped ALUs all together.

Uhm... what did they do ? How did they obtained the same performance for data dependent instructions (2 instructions with data dependency executed in 1 external/slow clock cycle) ? What solution did they use ?

I was aware that they improoved the slow ALU and maybe added a few kinks to the fast ALUs... did they use that "predictive" ALU that could do 95% of the time the work the double pumped ALUs could at the same speed (in terms of normal/slow clock cycles... the other 5% of the time they would have to re-do the operation) ?
 
Back
Top