Clearspeed announces CELL-like processor

This new processor by Clearspeed seems to show the viability of a CELL-type chip.

http://www.newscientist.com/news/news.jsp?id=ns99994274
New chip gives PCs supercomputing muscle


18:04 14 October 03

NewScientist.com news service

A computer chip that will enable personal computers to perform some calculations as fast as some supercomputers was unveiled on Tuesday.

Developed by ClearSpeed Technologies, based in California, the CS301 chip is capable of 25 gigaflops - 25 billion "floating point" calculations per second. These arithmetical calculations are also a common measure of computing power.

A desktop Pentium processor operates at a few hundred million flops, while some of the most powerful computers in the world operate at few hundred gigaflops. Putting around 20 ClearSpeed chips into a few personal computers could potentially provide the sort of power normally only found in a supercomputer built from hundreds of parallel processors or specialised hardware.

The CS301 works as a supplementary component to a regular processor. A chipset carrying one or two of the chips can be plugged into a normal PC like a graphics card and perform intensive calculations on behalf of the machine's normal processor. The chip is also very power-efficient, consuming only three watts and ClearSpeed is working on a version for laptop computers.

"The goal here is to enhance supercomputers at one level," says Tom Beese, CEO of ClearSpeed. "But also to deliver a power-efficiency that means you can put a few of chips inside a laptop, running along side a Pentium, and have a gigaflop laptop."

Protein modelling

The CS301 would be especially suited to arithmetically intensive scientific applications such as protein modelling or geological data analysis. Beese says the chip is fast and efficient because it has been designed almost entirely to focus on performing mathematical calculations with around 70 per cent of its surface dedicated to number crunching.

ClearSpeed plans to start selling a PC-compatible version of the microprocessor to research companies and universities within the next few months. A price has yet to be finalised but Beese says a single chip will initially cost around $16,500.

Many supercomputers are built from large arrays of off-the-shelf processors, although there is also a growing return to the use of specialised hardware. The world's fastest supercomputer, NEC's Earth Simulator, is made from specialised components. It is theoretically capable of 35 thousand gigaflops or 35 trillion floating point operations per second.

Details of the CS301 chip will be announced at the Microprocessor Forum 2003, which takes place in California this week.

Also in Wired: http://www.wired.com/news/technology/0,1282,60791,00.html

I just discovered these articles, so I haven't done much background reading yet. But for me, this erases any doubt that 1-Tflop CELL is indeed possible. (EDIT: This means I think 1-Tflop cell is possible.)

As for the price, keep in mind that this is a low-volume specialized chip made by a small company - most of the price is R&D, not production. For CELL, that will obviously not be the case.

As I said in this post I believe that a architecture that recognizes and exploits the massive data parallelism in computer graphics will exhibit much better cost-effective performance.

You can see Clearspeed's emphasis on parallel-data processing in their previous products.

http://www.eetimes.com/semi/news/OEG20010611S0119

ClearSpeed revises graphics engine to process packets

By Chris Edwards

EE Times
June 12, 2001 (6:30 p.m. ET)


LONDON — At the Embedded Processor Forum this week, ClearSpeed Technology Ltd. (Bristol, England) will detail how it has taken an architecture originally designed to process 3D graphics and modified it to handle network packet processing at OC-768 (40-Gbit/second) data rates.
ClearSpeed, the recently renamed PixelFusion Ltd., said its original Fuzion 150 design combined embedded DRAM with a parallel processing single instruction, multiple data (SIMD) array running at 400-MHz to accelerate graphics operations. ClearSpeed's modified design will run at similar speeds, but the company said it has redesigned the array to suit common networking operations such as Layer 3 and Layer 4 packet forwarding and classification, with simultaneous support for multiple protocols.

"There are a number of innovations we have made along the way. The main one is data-dependent processing," said Simon McIntosh-Smith, architecture program manager of ClearSpeed, who is due to make a presentation Thursday (June 14) at the Network Processor Forum portion of this year's Embedded Processor Forum.


In place of the Fusion 150's unified SIMD array is the revised array comprised of four independent processors, each of which controls 64 SIMD processing elements.

The array was split up to let each processor handle packets independently, which lets one unit handle in-depth processing without holding up simpler operations on the others.

As with the Fuzion 150, each element is made up of an arithmetic logic unit and its own area of memory.

"The four processors are completely independent. Inside each of these, the processing elements run off one instruction stream," McIntosh-Smith said. "The processing elements have a path where they can pass data to each other." This is useful in string searches such as those needed to classify packets based on their contents, he said.

"The data to be searched can be spread across the processing elements and searched in parallel," he said. "The processing elements are in a linear array. Each one talks to its left and right neighbor."

But the elements do not have to communicate through their neighbors. "Processing elements can access data from where they like independent of other processing elements," McIntosh-Smith said. "They can load data from completely different places."

McIntosh-Smith said the on-chip interconnect and I/O engines help speed up memory-intensive operations. The ClearConnect bus supports an aggregate bandwidth of 400 Gbits/s, providing access to off-chip packet memory and a table-lookup engine.

Instead of using content-addressable memory, the table-lookup engine has 2 Mbyte of compiled SRAM and a hardware accelerator to find forwarding addresses. An I/O engine passes the data to and from the processors, and a second one is used for external direct memory access transfers.

Each processor array has "a strong element of multithreading," McIntosh-Smith said. "We use it to overlap I/O and compute cycles. It can be processing one packet while fetching the next."

The architecture does not limit each processor to 64 SIMD processing elements. "There can be up to 256 processing elements per processor," said McIntosh-Smith.

The company has built a test chip for its architecture using a 0.15-micron process.

Chris Edwards is editor of Electronics Times, EE Times' sister publication in the United Kingdom.

More Embedded Processor Forum coverage.
 
Interesting, the 25 GFLOPS must include DP FP as that would make it more impressive: the EE+GS chip consumes 8 Watts, needs only 86 mm^2 of area and the CPU delivers 6.2 GFLOPS.

Not bad considering the surface area for the chip.

The EE portion uses about half the area so if we accepted a chip around 215 mm^2 we could fit 4 EE's and 1 GS using 90 nm technology: this would mean around 24.8 GFLOPS of performance.

Of course these guys had to start from a different position and are probably using some different trick.

With the limited production they are having $16k for a single chip is not unreasonable: the mighty Intel is selling Pentium 4 EE chips at $800 ( this is not entirely fair as Intel is making mad profits on this probably ) and those are in quite higher volume than this chips which is being sold in limited quantities to Research labs and Universities.

These guys are also trying to get NRE costs back as well while selling these chips.

PlayStation 3 will have CELL chips produced in high volumes and they will be pushing the target 65 nm manufacturing process ( relatively large die like the original GS which was 279 mm^2 of surface area ) quite a bit.

The enphasys on highly parallel processing is a good assumption for multi-media ( 3D graphics included ) and scientific processing which is one of the goals of CELL: these are the biggest areas in which such a massive performance is needed.

If a CELL CPU averages 10-30 GOPS and 10-30 GFLOPS running a port of Microsoft Word or the Mozilla web browser I would not necessarily be very saddened as that is more than what I need to run those applications and in multi-media applications we would see CELL shine ( maturity of compilers permitting ).
 
But I do agree that PC's as we know it will disappear soon. On the consumer end, PCs have diverged into two distinct functions - media/games, and productivity. Media and games are better served by an entirely new architecture that recognizes the massive data parallelism inherent in media/games. Using a general purpose CPU will become increasingly cost-ineffective. We already use graphics ASICs to offset this trend...but the more elegant solution is to discard the PC architecture altogether.

On the other end, productivity PCs have much more computing power then they will ever need, hence the rise of mini-ATX based PCs, which focus on cost, noise, and power consumption.

My guess is PCs will specialize in these two directions - going to Media Center PCs, to set-top boxes, eventually to some ultimate all-in-one solution. Slick interface, remote control is all you need. On the other end, we will see smaller, cheaper PCs meant for office tasks. This is already happening with VIA and Transmeta based PCs and notebooks. It will continue.

Once again, compliments: this is a very interesting vision and one that I can really agree with.
 
nondescript said:
I just discovered these articles, so I haven't done much background reading yet. But for me, this erases any doubt that 1-Tflop CELL is indeed possible.

Poor kid.

Just a word of advice. I first heard of Pixelfusion back in 1999 I think when they were trying to make a 250nm IC for graphics based on the same concept. Their design failed hard (ergo ClearSpeed) and was basically destroyed by the likes of Sony and nVidia with their respective ICs.

If their architecture is anything like before in their use of functional constructs, then I applaud you for actually staking a position on this. :)

It's like looking at the P10 and then blatently stating that the preformance levels seen in the NV3x and R3x00 are impossible. But, hey, to each his own.
 
Panajev2001a said:
To be honest, 250 nm was not the best technology to introduce that kin d of vision to the market.

Well, yeah. And, IMHO, you need more granularity in your constructs than what the origional PixelFusion had. Stuff like Cell or the Rxxx or nVxx are just going to destroy this processor.


Also, for my buddy Megadrive who likes the big numbers - this is the reason I told you transistor count is irelevent:

[url said:
http://forums.gaming-age.com/showthread.php?threadid=60867[/url]]As for Toshiba the cell in 2005 process line width 65 nano- meters (nano- the schedule which commercializes 10 hundred million parts 1)

10-hundred-million parts is 1Billion transistors when translated from pseudo-Japanese. The transistor count of CellPS3 is going to be huge by todays standards, but that's to be expected by design and is insignificant in toto due to the lithography and process technology involved. Which people are underestimating IMHO.
 
Vince said:
nondescript said:
I just discovered these articles, so I haven't done much background reading yet. But for me, this erases any doubt that 1-Tflop CELL is indeed possible.

Poor kid.

Just a word of advice. I first heard of Pixelfusion back in 1999 I think when they were trying to make a 250nm IC for graphics based on the same concept. Their design failed hard (ergo ClearSpeed) and was basically destroyed by the likes of Sony and nVidia with their respective ICs.

If their architecture is anything like before in their use of functional constructs, then I applaud you for actually staking a position on this. :)

It's like looking at the P10 and then blatently stating that the preformance levels seen in the NV3x and R3x00 are impossible. But, hey, to each his own.

I don't understand what you're saying - I said that I now think CELL can reach 1-Tflops ... and that this approach (the CELL-like approach) to acheive speed is exactly what I have been thinking about for the last one or two years. (That post I linked to was about 6 months ago, right after I found out about B3D)

Are we on the same page here?

Spelling it out clearly:
I claim that the incredible power consumption/performance of THIS chip, which takes the same approach to computing as CELL, is a good indicator that CELL will have the same kind of power/performance. I'm not saying that this chip is a competitor to CELL, I'm saying that this chip shows that the CELL approach is a viable one.

(Oh, and what do you mean by functional constructs? The processing units?)
 
nondescript said:
Are we on the same page here?

Evidently not. Argh... I've wrongly read what you wrote yet again. :) So, sorry about that - again. I think I've fucked up in conversing with you more in the last week than with everyone else over the last year. Thus, once again, I appologize.

I thought you ment, like, "erases any doubt you (as a skeptic) held that this could, somehow, be possible - so it's thus impossible" Argh, forget this.. I'm confusing myself again.
 
Vince said:
nondescript said:
Are we on the same page here?

Evidently not. Argh... I've wrongly read what you wrote yet again. :) So, sorry about that - again. I think I've fucked up in conversing with you more in the last week than with everyone else over the last year. Thus, once again, I appologize.

I thought you ment, like, "erases any doubt you (as a skeptic) held that this could, somehow, be possible"

;) That's what I thought - that sentence is a little convoluted, I agree. But if you thought THAT was what I meant, how did any of the rest of my post make sense !?...anyways, whatever. Glad things are cleared up.
 
I thought you meant the same "Seeing this I now know Cell will not hit 1tflops" when I saw your post earlier. I didn't comment, the sentence was confusing as hell.
 
Panajev2001a said:
That depends :p

That's not what I heard. Anyways, Did you look at the .pdf I posted a link to? I wonder who'd win in a patent fight ;)

Cell (eg. Broadband Engine) is truely this things bigger, steroid using, brother. Twice as many FPU/FXU's arranged in better constucts and hierarchies with more resources (be it Register, SRAM, eDRAM) avalable to it. And this isn't even looking at the differences in logic design, which STI ranks among the best. Much better design than this - although, in all fairness, this isn't targetted at high-end 3D anymore.
 
Vince, that processor is only running at 200 MHz and is realized in 130 nm technology.

At 65 nm you could fit twice the logic in the same die area or more and if you think about a 1-2 GHz closk-speed...

2x * 5 or 10 = ( rough estimate ) 10-20x the performance.

Thinking about increasing the chip's surface area ( adding more logic ) and bringing the clock up further, we can end with a 40x increase of performance ( still following the same design ).

40 * 25.6 = 1 TFLOPS ;)

25.6 GFLOPS at 200 MHz means 128 FP ops per cycle.

Each PE has a dual issue FPU ( peak of 2 FP ops/cycle: 1 LOAD and 1 STORE: it does not have a bus supporting true MADD instructions for the FPU as we only have 2x32 bits busses going from the Register file to the Execution units, unless R1 = R2 * R3 + <constant> is good enough for you or maybe the diagram is slightly inaccurate ) and we have 64 PEs in this chip.

64 PEs * 1 FPU/PE * 0.2 GHz * 2 FP op/cycle = 25.6 GFLOPS/s

An APU rated at 32 GFLOPS at 4 GHz does 8 operations per clock ( havin four mixed FP/FX Units it is understandable as a peak value for FP and Integer Vector MADD instructions ).

8 APUs do 64 ops/cycle and 4 PEs ( with 8 APUs each ) would do 256 FP ops/cycle.

The CELL chip would basically only need twice the FP/FX Units, more e-DRAM and SRAM ( accepting also a bigger surface area for the chip ) and a frequency boost to achieve 1 TFLOPS: looking at the companies involved in CELL, looking at the technologies ( manufacturing processes, etc... ) they are working on and the R&D budget they have, etc... I think they can pull 1 TFLOPS of peak performance for PlayStation 3 ( might be most obtained from the CELL based CPU and part from the GPU or it might be all from the CPU, we will see ).

I do not like the fact that the control portion of the chip has to manage so many independent sub-processors: I would prefer to have each PE with its own PU to manage the PE's workload.
 
Panajev2001a said:
At 65 nm you could fit twice the logic in the same die area or more and if you think about a 1-2 GHz closk-speed...

Only 1GHz? That must be a design fallacy of STI's. Because last I checked, Intel was talking 10GHz in 2005. (don't respond, this was tongue-in-cheek) ;)

Hey, have you ever seen a micrograph of the the original EmotionEngine?
 
That must be a design fallacy of STI's.

Must be.

Atleast on a purely marketing point of view. I can imagine people like Chap running around come a year going,

"only 1ghz clock? xbox2 is 3ghz... :oops: :oops: :oops:"

Although I am truely doubting a 1Ghz clock..
 
Vince said:
Panajev2001a said:
At 65 nm you could fit twice the logic in the same die area or more and if you think about a 1-2 GHz closk-speed...

Only 1GHz? That must be a design fallacy of STI's. Because last I checked, Intel was talking 10GHz in 2005. (don't respond, this was tongue-in-cheek) ;)

Hey, have you ever seen a micrograph of the the original EmotionEngine?

I have seen it, remember the .PDF( public one, from Sony.net ) I gave you the link for with the slides about CMOS4, CMOS5 and CMOS6 ? ;)
 
Back
Top