Clearspeed announces CELL-like processor

...

I now think CELL can reach 1-Tflops
You shouldn't.

... and that this approach (the CELL-like approach) to acheive speed is exactly what I have been thinking about for the last one or two years.
Let us take a moment to recall why PixelFusion died in the first place....

I claim that the incredible power consumption/performance of THIS chip
Try delivering 25 GFLOPS with LINPACK....

which takes the same approach to computing as CELL
Which makes things look bad for CELL....

I'm saying that this chip shows that the CELL approach is a viable one.
PixelFusion is the reason why CELL approach doesn't work.
 
I think that for the APU's execution units and Registers 4 GHz is possible: not all the sections of the chip need to run at that speed though ;)
 
Re: ...

DeadmeatGA said:
You shouldn't

Holy F-. He decends and tells us how it is. Thanks the the reality check bud, I needed a "grounding" from someone with a history of correct predictions and knowledge.

Let us take a moment to recall why PixelFusion died in the first place....

Um, Ok. Lack of funding, Lack of a grasp on cutting-edge process for that time. Lack of a market. Lack of Software behind it. Hell, the architectural constsructs itself sucked.

Which are the reason's I stated earlier that it's different than Cell.

PixelFusion is the reason why CELL approach doesn't work

Ahh yes, the Cell approach. In two years, explain to me the difference between the Cell approach and the ATI and nVidia approach, ok? lol. You know, like the analogue to how the VU is so different from a DX9 VS. :)
 
...

You left out this block of article.

Strauss warned that writing software for the chip's complex architecture might be a stumbling block, but the company has assured him that its compiler makes it easy to program.
PixelFusion fell once before, let me see what kind of new tricks they have come up with.
 
You forgot to quote it properly:

Strauss warned that writing software for the chip's complex architecture might be a stumbling block, but the company has assured him that its compiler makes it easy to program.
 
Can I quote it even more differently?

Strauss warned that writing software for the chip's complex architecture might be a stumbling block, but the company has assured him that its compiler makes it easy to program.

...and I hope you all learned something poignient from that. ;)
 
Re: ...

DeadmeatGA said:
Let us take a moment to recall why PixelFusion died in the first place....

Yes, why did it die? Given your obvious expertise, could you please tell me exactly how it died, as I am one of the uninformed?

I fail to see how a chip that does 25 Gflops using only three watts is a failure, or an indication of one. (However, I do think that flops in general are an abused metric, especially when used for peak performance, and 25 Gflops should be take w/ a grain of salt - but its no different than any other performance claim)
 
Panajev2001a said:
Vince, that processor is only running at 200 MHz and is realized in 130 nm technology.

At 65 nm you could fit twice the logic in the same die area or more and if you think about a 1-2 GHz closk-speed...

2x * 5 or 10 = ( rough estimate ) 10-20x the performance.

Thinking about increasing the chip's surface area ( adding more logic ) and bringing the clock up further, we can end with a 40x increase of performance ( still following the same design ).

40 * 25.6 = 1 TFLOPS ;)

25.6 GFLOPS at 200 MHz means 128 FP ops per cycle.

Each PE has a dual issue FPU ( peak of 2 FP ops/cycle: 1 LOAD and 1 STORE: it does not have a bus supporting true MADD instructions for the FPU as we only have 2x32 bits busses going from the Register file to the Execution units, unless R1 = R2 * R3 + <constant> is good enough for you or maybe the diagram is slightly inaccurate ) and we have 64 PEs in this chip.

64 PEs * 1 FPU/PE * 0.2 GHz * 2 FP op/cycle = 25.6 GFLOPS/s

An APU rated at 32 GFLOPS at 4 GHz does 8 operations per clock ( havin four mixed FP/FX Units it is understandable as a peak value for FP and Integer Vector MADD instructions ).

8 APUs do 64 ops/cycle and 4 PEs ( with 8 APUs each ) would do 256 FP ops/cycle.

The CELL chip would basically only need twice the FP/FX Units, more e-DRAM and SRAM ( accepting also a bigger surface area for the chip ) and a frequency boost to achieve 1 TFLOPS: looking at the companies involved in CELL, looking at the technologies ( manufacturing processes, etc... ) they are working on and the R&D budget they have, etc... I think they can pull 1 TFLOPS of peak performance for PlayStation 3 ( might be most obtained from the CELL based CPU and part from the GPU or it might be all from the CPU, we will see ).

I do not like the fact that the control portion of the chip has to manage so many independent sub-processors: I would prefer to have each PE with its own PU to manage the PE's workload.

Returning the compliment ... you beat me to the punch Panajev, I was digging up the specs to do a CELL comparsion. Nice post, as usual.

Just adding few points:

The core voltage of the CS301 is 1.2V. P4 at 2.2Ghz is 1.5V (also 130nm)CS301 is a 3W chip ( or 2W according to spec sheet) - either way, it's pretty small, so I expect STI could double the logic w/o difficulty.

Let's take a look at a hypothetical CELL built with today's technology at 1.5V, 2x the logic, and 2 Ghz.

Power = (1.5/1.2)^2 * 2 * 10 * 3W = 93.75 W

For a CELL chip with about 20x better performance, or 500Gflops.

Yeah, that's really high. But that's with current fab technology, at 130nm. Going to 90nm, 65nm, and lower will bring that power down very quickly. Not to mention that the chip will be build on SOI, with Toshiba eDRAM - which have non-destructive read and better retention, which both save power.

So doing a some more napkin calculations:

CELL at 1.0V, 4x logic, 2 Ghz, with a blanket power saving of 20%. (No justification right now for the 1.0V or the 20% savings - I'm headed out for dinner, and in a rush, get back to it later)

Power = (1.0/1.2)^2 *4 * 10 * 3W * 0.8 = 66.7 W

For a CELL chip with about 40x better performance, or 1Tflops.

Let me be the first to say that this a very, very, crude estimation - but it is promising.

And also, I want to say, that 1Tflops is not a make-or-break issue w/ me. A 500Gflops CELL would be perfectly acceptable to me, since I don't think peak performance is as important as sustained performance. I think the architecture is promising, and we'll be seeing some pretty awesome stuff.
 
I see them opting for this kind of configuration: 1 GHz for busses ( or 500 MHz DDR signaling ), 1 GHz for the e-DRAM ( or 500 MHz DDR signaling ), 4 GHz for the Register file, 4 GHz for the FP/FX Units, 2-4 GHz for the SRAM based LS ( 2 GHz would mean that there would be a bit more of latency involved with LS reads and writes from the Register file: 1 LS clock would be 2 FP/FX Units clocks so there would be an extra latency cycle for LOAD/STORE instructions which touch the LS, which is acceptable ) and 1-2 GHz for the PUs.

This way you see that we do not need the whole chip running at 4 GHz and this has a direct impact on the chips' overall power consumption.

Having multiple clock domains on the chip is not impossible ( Intel has been doing it for years and they are not the only ones ) and is one of the tricks I can see the STI guys pulling off ( plus tons of other tricks I am not even thinking about at the moment ) :)
 
Re: ...

Vince said:
PixelFusion is the reason why CELL approach doesn't work

Ahh yes, the Cell approach. In two years, explain to me the difference between the Cell approach and the ATI and nVidia approach, ok? lol. You know, like the analogue to how the VU is so different from a DX9 VS. :)

I've got to disagree with both of you. PizelFusion failed due to business reasons, not due to their product. The money ran out before their product got to market, and their partners got bought-out. The same thing has happened to a number of promising technologies.

And Vince, DX9/DX10 and PS2/PS3 are very different beasts. Cell isn't just a GeforceFX on steroids, it's a fundamentally different approach. Cell has more in common with the old PixelFuzion 150 chip, it's a vast blank slate of function units (you could say Cell is in concept PixelFuzion 150 on steroids).
 
So now CELL is capable of 1TFLOPS because some other company has made a 25 GFLOPS chip??? :LOL:

How many transistors are in this 25 GFLOPS chip???
 
PC-Engine said:
So now CELL is capable of 1TFLOPS because some other company has made a 25 GFLOPS chip??? :LOL:

How many transistors are in this 25 GFLOPS chip???

The transistor count can't be too high, since it's only 3W.

The reason I think CELL can do 1Tflops is because just by taking this Clearspeed chip, and clocking it to 2Ghz (10x clock speed), you can get 250 GFLOPS. That's already pretty close. Surely STI can make a chip that does four times more ops per clock cycle than a startup out of the UK...and that brings us to 1Tflop.

Obviously you don't believe it, so I'm not trying to convince you, just showing you where I'm coming from for these claims.
 
Re: ...

Josiah said:
I've got to disagree with both of you. PizelFusion failed due to business reasons, not due to their product. The money ran out before their product got to market, and their partners got bought-out. The same thing has happened to a number of promising technologies.

Um, Hi! I actually stated that as a reason (more or less), among others I believe. Yet, it you look at the market they attempted entry into (high end visualization) and the emergence of the Linux Cluster which was getting big around then - you'll see that the architecture was premature for it's time (eg. Implimentation fallacy) and ultimatly not economical (eg. Market positioning fallacy). Not to mention the origional design kinda sucked.

And Vince, DX9/DX10 and PS2/PS3 are very different beasts. Cell isn't just a GeforceFX on steroids, it's a fundamentally different approach.

Hmm, tell me that in 2 years and we'll see. And, infact, the NV3x in particular has a TCL front-end which is composed of several processing constructs that wouldn't be alien to Cell. Not to mention the day we see a Unified Shading Architecture, I think you'll see just how close they are on a fundimantal level.

you could say Cell is in concept PixelFuzion 150 on steroids.

I actually stated that too... I think I called it it's "Bigger, steroid using, brother" which I remember because I thought it sounded cute.

PCEngine said:
How many transistors are in this 25 GFLOPS chip???

Dude, when are you going to realize that transistor count is a horrible metric for this in particular? It doesn't matter and Cell's transistor will probably be hyper-inflated over it's absolute logic count due to the eDRAM. It doesn't matter, worry about area and logic density at 65nm - too bad for you it works.

I'm going to sleep, so no more replies tonight.
 
Panajev2001a said:
I see them opting for this kind of configuration: 1 GHz for busses ( or 500 MHz DDR signaling ), 1 GHz for the e-DRAM ( or 500 MHz DDR signaling ), 4 GHz for the Register file, 4 GHz for the FP/FX Units, 2-4 GHz for the SRAM based LS ( 2 GHz would mean that there would be a bit more of latency involved with LS reads and writes from the Register file: 1 LS clock would be 2 FP/FX Units clocks so there would be an extra latency cycle for LOAD/STORE instructions which touch the LS, which is acceptable ) and 1-2 GHz for the PUs.

This way you see that we do not need the whole chip running at 4 GHz and this has a direct impact on the chips' overall power consumption.

Having multiple clock domains on the chip is not impossible ( Intel has been doing it for years and they are not the only ones ) and is one of the tricks I can see the STI guys pulling off ( plus tons of other tricks I am not even thinking about at the moment ) :)

Yeah, I know ... there was a lot of hand-waving in that post, and I wanted to simplify the estimates.

There's a lot of ways to tier the memory, exactly what permutation they choose, I dunno. You guys who have backgrounds in computer graphics would have a better idea of the memory/bandwidth requirements. Multiple clock domains aren't impossible, I agree.

The interesting thing about this Clearspeed chip is it has the same kind of tiered memory scheme CELL has, 64 byte registers, 4k memory per PE, 128 SRAM scratchpad. And the fact that it is designed for massive parallel number crunching - just exciting stuff. I wouldn't mind having THAT for a coprocessor sitting next to my CPU on my motherboard...

Do you think CELL will hit 4 Ghz? 2 Ghz is all I dare to wish for - the EE was built in a era of 600 Mhz processors, so I don't expect CELL will be 1:1 with Intel in clock speed. I think STI will sacrifice some clock speed for easier design constrants and higher yield.
 
nondescript said:
PC-Engine said:
So now CELL is capable of 1TFLOPS because some other company has made a 25 GFLOPS chip??? :LOL:

How many transistors are in this 25 GFLOPS chip???

The transistor count can't be too high, since it's only 3W.

The reason I think CELL can do 1Tflops is because just by taking this Clearspeed chip, and clocking it to 2Ghz (10x clock speed), you can get 250 GFLOPS. That's already pretty close. Surely STI can make a chip that does four times more ops per clock cycle than a startup out of the UK...and that brings us to 1Tflop.

Obviously you don't believe it, so I'm not trying to convince you, just showing you where I'm coming from for these claims.

Well going by that logic you can just take a EE and scale it to 2GHz then put 8 EE cores on a single die then poof 300 GFLOPS sounds easy huh??? :LOL: ;)


Dude, when are you going to realize that transistor count is a horrible metric for this in particular? It doesn't matter and Cell's transistor will probably be hyper-inflated over it's absolute logic count due to the eDRAM. It doesn't matter, worry about area and logic density at 65nm - too bad for you it works.

True, but the Clearspeed article doesn't mention die area either so it doesn't say much if at all.
 
Re: ...

Vince said:
Josiah said:
And Vince, DX9/DX10 and PS2/PS3 are very different beasts. Cell isn't just a GeforceFX on steroids, it's a fundamentally different approach.

Hmm, tell me that in 2 years and we'll see. And, infact, the NV3x in particular has a TCL front-end which is composed of several processing constructs that wouldn't be alien to Cell. Not to mention the day we see a Unified Shading Architecture, I think you'll see just how close they are on a fundimantal level.

The unification of shaders has to do with sharing of hardware resources, something Cell does not do. Cell can be compared to a giant FPGA controlled by software. The chip itself does very little inherently, it's just a fast platform for the applets ("cells") to run on. A DirectX style GPU OTOH has a clear graphics pipeline defined by silicon. With NV3X the chip is the architechture, with Cell the software is the architechture.
 
Do you think CELL will hit 4 Ghz? 2 Ghz is all I dare to wish for - the EE was built in a era of 600 Mhz processors, so I don't expect CELL will be 1:1 with Intel in clock speed. I think STI will sacrifice some clock speed for easier design constrants and higher yield.

What you say is in-line with my post which I will quote here:

I see them opting for this kind of configuration: 1 GHz for busses ( or 500 MHz DDR signaling ), 1 GHz for the e-DRAM ( or 500 MHz DDR signaling ), 4 GHz for the Register file, 4 GHz for the FP/FX Units, 2-4 GHz for the SRAM based LS ( 2 GHz would mean that there would be a bit more of latency involved with LS reads and writes from the Register file: 1 LS clock would be 2 FP/FX Units clocks so there would be an extra latency cycle for LOAD/STORE instructions which touch the LS, which is acceptable ) and 1-2 GHz for the PUs.

This way you see that we do not need the whole chip running at 4 GHz and this has a direct impact on the chips' overall power consumption.

Having multiple clock domains on the chip is not impossible ( Intel has been doing it for years and they are not the only ones ) and is one of the tricks I can see the STI guys pulling off ( plus tons of other tricks I am not even thinking about at the moment )

If you look at the current Pentium 4 clocked at 3.2 GHz the fast ALUs, the fast AGUs, the Register file, etc... are already running with a physical clock of 6.4 GHz ( double-pumping is real and not a DDR trick ).

Only a certain portion fo the CELL chip will run at 4 GHz in such a configuration, as I explained before.
 
Panajev2001a said:
The unification of shaders has to do with sharing of hardware resources, something Cell does not do.

I am not so sure that APUs will look that much different from those unified Shader Units ;)

That's a pretty general statement. I could say "EE has on-chip cache, Pentium 2 has on-chip cache, therefore they are similar architechtures". In truth they're about as different as night and day.
 
Back
Top