IBM unveils Cell roadmap

You know the Mercury Cell board is for single-precision work (180GFLOPS) on a workstation, don't you? I think that price of the Cell board is very competitive.
http://www.clearspeed.com/acceleration/technology/
The only reason I even breathed ClearSpeed in this thread is because of double precision.

Also, you're comparing Cell's peak to ClearSpeed's "average". And a ClearSpeed board has 2 of those chips on it, not one. The peak for one board is 96GFLOPs of single or double precision.

TiTech and others will be buying it for its double precision, not its single precision.

There's also GPGPU from AMD or NVidia for single-precision - I can't find any prices though, but they're unlikely to cost as much as $8000 while offering over 300GFLOPs peak.

Jawed
 
Also, you're comparing Cell's peak to ClearSpeed's "average". And a ClearSpeed board has 2 of those chips on it, not one. The peak for one board is 96GFLOPs of single or double precision.

DGEMM is pretty much as close to peak FLOPS that you'll ever get on a CPU.

Cheers
 
There's one thing that I didn't see mentioned in this thread: efficiency. I don't have first-hand experience in using ClearSpeed cards but it seems to me that they are much more likely to reach high utilization of their execution resources than Cell so the difference in raw FP throughput may not be telling of the real differences between the two approaches.
Cell's theoretical advantage is that its memory system is a fairly radical design intended to tackle the memory wall. When you compare the 100GFLOPs DP-Cell and the 100GFLOPs ClearSpeed board, the former has about 4x the bandwidth per DP FLOP (~22GB/s versus ~6GB/s). So you'd expect Cell's utilisation to degrade more gracefully as memory access patterns become more random, or as arithmetic intensity falls-off.

That assumes two things though: that DP-Cell in 2008 won't have a faster XDR interface and that in 2008 ClearSpeed will still be using the same memory.

An associated issue is that when these boards are configured as co-processors, they're stuck behind a 4GB/s or 1GB/s interface to their node host processor. Roadrunner configures each Cell as taking a 1/8th share of a 16-lane PCI Express interface. ClearSpeed currently uses PCI-X, for 1GB/s shared by two chips (the next version will have a 16-lane PCI Express interface). So each architecture has quite different sweetspots in terms of dataset size with further constraints imposed by the quantity of memory attached to the processors.

In Roadrunner it seems that each Cell has the same amount of memory, 4GB, as its corresponding node host Opteron - so that should ameliorate the co-processor<->host bandwidth problem.

Also the architectures are very different, SPEs cannot access memory directly whilst ClearSpeed can. However I think that under certain assumptions about data coherency it would be easier to program an application with a sparse data set for Cell than for ClearSpeed which AFAIK cannot do scatter-gather loads/stores like traditional vector processors. It can access arrays with variable strides and access adjacent elements but it still relies on data being stored in a fairly regular fassion (*).
Yeah, sparse data sets seem to be something Cell is extremely good for.

Scatter/gather DMA operations with ClearSpeed seem to be much more coarse-grained: working at the page level (4KB) and to move data between the board's DDR2 and the host's system RAM.

https://support.clearspeed.com/documents/runtime_user_guide.pdf

Jawed
 
The only reason I even breathed ClearSpeed in this thread is because of double precision.

Also, you're comparing Cell's peak to ClearSpeed's "average". And a ClearSpeed board has 2 of those chips on it, not one. The peak for one board is 96GFLOPs of single or double precision.
I quoted the page for CSX600 as the page for the board doesn't mention SP at all.
http://www.clearspeed.com/products/cs_advance/
As for the GEMM performance of Cell, this paper claims Cell has 14.6 GFLOPS for DP and 204.7 GFLOPS for SP.
http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf
If the peak performance of CSX600 is 48 GFLOPS as you write, 25 GFLOPS for GEMM seems rather low.

TiTech and others will be buying it for its double precision, not its single precision.
It's you who introduced the price of the Cell board in this discussion :p The Mercury board doesn't exist outside of the market where SP performance is important. Its price looks competitive in that context.
 
http://www.clearspeed.com/products/cs_advance/
As for the GEMM performance of Cell, this paper claims Cell has 14.6 GFLOPS for DP and 204.7 GFLOPS for SP.
http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf
If the peak performance of CSX600 is 48 GFLOPS as you write, 25 GFLOPS for GEMM seems rather low.
50 GFLOPs per 25W ClearSpeed board in DGEMM versus 12.8GFLOPs per 60W board (guess: Mercury board at 2.8GHz, including 25W for 5GB of memory) on Cell. 2 GFLOPs per watt versus 0.21 GFLOPs per watt. Hmm...

It's you who introduced the price of the Cell board in this discussion :p The Mercury board doesn't exist outside of the market where SP performance is important. Its price looks competitive in that context.
I never introduced price - my argument is performance per watt for supercomputer add-in boards. For DP work Cell currently isn't very compelling.

In Roadrunner, the fact that eDP Cell drops in as a replacement for Cell is very compelling. They can build a small cluster now, get the software going, and swap-in and expand when eDP Cell is ready without changing the software a huge amount.

I've got no argument with the kind of boxes/applications Mercury is aiming at. Mercury is also competing with companies such as Peakstream and RapidMind who can deploy Cell or GPGPU based single-precision systems. Clearly Mercury's been around longer :smile: but people are doing real work with GPUs as well.

All very intriguing ... and I'm curious to see how long it is before a GPU does DP ... and whether it can be competitive in performance per watt.

Jawed
 
Last edited by a moderator:
50 GFLOPs per 25W ClearSpeed board in DGEMM versus 12.8GFLOPs per 60W board (guess: Mercury board at 2.8GHz, including 25W for 5GB of memory) on Cell. 2 GFLOPs per watt versus 0.21 GFLOPs per watt. Hmm...

Looking beyond the published numbers...
Jawed, do you know what factor(s) in the ClearSpeed architecture help to achieve the high performance-per-watt number ? Is it something proprietary ?
 
Looking beyond the published numbers...
Jawed, do you know what factor(s) in the ClearSpeed architecture help to achieve the high performance-per-watt number ? Is it something proprietary ?

You can find some information here.

To achieve a low-power design, the CS301 development team combined a carefully envisioned architecture, as seen from the top-down view, with the practical engineering detail built from the bottom up. It was a matter of taking everything back to basic principles and finding a solution that looked good when viewed from the top or the bottom. Optimizing the architecture produced the greatest gains, but it's the detail that ultimately determined which approach was best.

Probably the single most important design target for the CS301 was to minimize the number of times information had to be moved and to move it efficiently. This basic approach was woven into both the architecture and implementation, from the on-chip network-which has a very simple control structure allowing distributed arbitration and clock gating-to the fundamental structure of the multithreaded array processor. Instead of centralizing the control for decision making and processing into a single unit-as with a typical microprocessor-where possible, local units make their own decisions about what processing is required. This minimizes the flow of data, control and clock signals to only the unit that is required to implement the correct functionality.

The replicated processing element played a fundamental part in achieving both the performance and the power efficiency. The microarchitecture was critical, as each decision, from control coding and distribution through to the detail of the compute elements, needed to be evaluated for efficiency. There is no trick to power-efficient design other than making sure the team understands the goal and is inspired to sweat the details and find an optimal solution. The only shortcuts in this design were based on experience and sound engineering principles plus an integrated design environment that provided fast feedback and predictability throughout the flow.

For ClearSpeed, achieving its performance and power goals meant stripping out complexity. Finding low-transistor-count solutions to each aspect of the design allowed the team to reduce the area of each component, reducing capacitance locally and, as a by-product, reducing the capacitance associated with the global control and data flow. An essential requirement of the company's approach was the ability to rapidly take new ideas through to finished layout and to validate expectations. Some ideas looked elegant as RTL but turned out to be inefficient when realized in silicon through a semicustom flow. The ideal flow needed not only to give rapid closure but also to allow the company's engineers to understand the result and modify their design strategies to work with the tools.

Originally, the company's engineers had tried using a conventional point-tool IC design flow for the CS301, but various problems caused the team to abandon that method. Timing, signal and power integrity, and routing issues prevented it from achieving design closure. The designers suspected these problems were a result of poor initial placement. The team believed that its point-tool flow was not addressing all of the issues concurrently as was needed. In addition, the point-tool flow provided no feedback, so identifying the causes of the problems was impossible.

The development team adopted a new design flow from Magma Design Automation Inc. (Santa Clara, Calif.). With Magma's Blast Fusion APX, Blast Noise and Blast Rail, the team had an integrated flow that addressed timing, signal and power integrity, and routing issues concurrently throughout the flow. This correct-by-construction approach delivered better placement and provided insight into the design that allowed the team to reduce power significantly. With Magma's system, the team's engineers could accurately and efficiently perform timing-vs.-power and area-vs.-power trade-offs at different stages of the design flow.
 
And what they say about Advanced CELL in 2008. The photo is to blurry to read it. More SPE?


Advanced Cell in 2008 reminds me of the Emotion Engine 2 that was due out in 2002-2003, and intended for workstations only. it never materialized.

but im not implying that an Advanced Cell won't come out in 2008 for workstations / supercomputers.
 
2 PPEs
32 SPEs
45nm SOI
~ 1TFLOPs
2010

My guess is that in 2012 we will see the PS4 with a CELL on 32nm with ~ 2x that. I am curious if the PPEs will be more robust (OOOe? How much more cache?) and how much the LS in the SPEs will have grown? Likewise if there will be more synergy in the Synergistic Processing Units.


I don't think PS4's CELL CPU will go beyond 2 PPEs and 32 SPEs. What would be nice is, 36 SPEs on-die with 4 de-activated (or defective) leaving 32 SPEs running. The other PPE can do the OS stuff leaving 32 SPEs for gaming use :) LS for each SPE will probably be 1 to 2 MB.

Clockspeeds won't go up by another factor of ~10x as they did from PS1 to PS2 and PS2 to PS3. I predict only a ~2x increase in clockspeed. More importantly will be the memory subsystem. 200 GB/sec XDR2 minumum, hopefully 400 GB/sec. I was kinda disappointed that PS3 used the low-end of XDR bandwidth, since XDR1 can goto ~102 GB/sec.


Glad to see IBM is fulfilling my "bold" prediction of CELL as a platform. All the growing pains now will be offset in the future by a stable platform. Devs should be able to hit PS4 running.

I fully agree there.
 
Clockspeeds won't go up by another factor of ~10x as they did from PS1 to PS2 and PS2 to PS3. I predict only a ~2x increase in clockspeed.

What's interesting is that the roadmap predicts zero increase in clockspeed, really. Which would be a bit disappointing.

If we're limited only to area improvements and stick to similar die sizes, we indeed won't see the kind of jump we did (in terms of overall power) from last gen to this gen (ps2->ps3). Which is a bit saddening really.
 
What's interesting is that the roadmap predicts zero increase in clockspeed, really. Which would be a bit disappointing.

Where do you infer a lack of clockspeed boost? That wouldn't make much sense, as the Cell architecture from the start was made to ramp in clocks. I think the linear increases in FLops performance are just for the sake of ease of calculation.

This alone indicates that faster clocks are in the works:

The 65nm CELL Broadband Engine™ design features a dual power supply, which
enhances SRAM stability and performance using an elevated array-specific power
supply, while reducing the logic power consumption. Hardware measurements
demonstrate low-voltage operation and reduced scatter of the minimum operating
voltage. The chip operates at 6GHz at 1.3V and is fabricated in a 65nm CMOS SOI
technology.

(plus it's just logical)
 
Well, 32 of today's SPEs and 2 of today's PPEs is roughly 900Gflops at 3.2Ghz. 3.6Ghz would get us the quoted 1Tflop for that 2010 chip.

And I agree it seems strange, and I did note that quote from the 65nm paper abstract. But the clock they might get at 65nm with 8 SPEs and the clocks they might get at 45nm with 32 SPEs may not play by the same rules.

But either way, 1Tflop is the number IBM is putting out there, which doesn't seem to indicate much of any clockspeed increase.
 
I believe there will be a clockspeed increase for the actual PS4 CPU. they can do it and they will. just because the roadmap, which has very little info, doesn't mention a clock increase doesn't mean there won't be one. besides, it's only for the 2010 CPU not the actual PS4 CPU. I think they'll goto 5 GHz or so. first-gen CELL is capable of 4.6 GHz and the SPEs were tested at over 5 GHz. i think in 6 years they can get some extra clockspeed. but that of course won't be where the lion's share of the extra performance comes from, it'll be the having 4x the SPEs and hopefully much-improved SPEs (and PPEs).
 
the clearspeed's result of HPL not so good compare to the Cell which using mixed-precision.

cell_high_performance_linpack.png
 
Going back to the first page slide... it reminds me that Cell was initially shown off in early 2005, nearly 2 years before the PS3 launch and basically the same configuration ended up in the PS3. At the time a number of people indicated a strong belief there would be 2 or as many as 4 Cells (on 65nm even) in the PS3 and that Sony's fabs should be ready in time. I mention this because while I think the PS4 could/should be more, I think this may be a clue for what they are targetting. It is really too early, and the slide doesn't say much, to get a firm picture. But it wouldn't surprise me that if the PS4 Cell looked similar. Of course Sony could toss 2 Cell2s in, and when better processes become available move the two dies together. This would tive some flexibility in the chip usage as well as margins. I actually think this could be quite likely. When balancing the points of either a 4PPE/64SPE Cell or 2x 2PPE/32SPE Cells clocked higher or significantly better yields, I think it makes a lot of sense. This was not really available to the PS3 as it had only 1 PPE. But with the next Cell have quite a few replicated units this opens up some doors.

Another thing that floated through my head is that while 4x increase seems pretty meager based on console history, the slide doesn't tell us much about the architectural changes. e.g. Would a Cell with SPEs with 512K or even 1MB of Local Store and PPEs that are wider and with OOOe be a better tradeoff than the current Cell architecture with 4PPEs/64SPEs? Would the utilization of a more robust Cell architecture be a better investment; likewise do more robust SPEs (more LS, more execution units maybe?) open more avenues for new algorythms that the current SPEs are not as well designed for?

And finally, could we see MS and Sony maybe cut back on the CPU side a little and invest more heavily into the GPU, especially as they become more flexible to take on tasks like physics? As GPUs become more viable and versatile general resources maybe shifting some investment away from the CPU and to the GPU may occur? Of course the slide doesn't answer any of these questions and this is mindless speculation, but interesting to me none-the-less. One the one hand the slide is pointless, but on the other it could be pretty telling down the road. I think STI are pushing Cell as a platform and that the above slide could fit into Sony's PS4 plans in any number of ways.
 
And finally, could we see MS and Sony maybe cut back on the CPU side a little and invest more heavily into the GPU, especially as they become more flexible to take on tasks like physics? As GPUs become more viable and versatile general resources maybe shifting some investment away from the CPU and to the GPU may occur? Of course the slide doesn't answer any of these questions and this is mindless speculation, but interesting to me none-the-less. One the one hand the slide is pointless, but on the other it could be pretty telling down the road. I think STI are pushing Cell as a platform and that the above slide could fit into Sony's PS4 plans in any number of ways.

I see Sony & NVIDIA integrating future NV-IP’s into the PS4 version of the Cell processor. The SPE's as we know them; may become more Nvidia like or based ALU's versus IBM design. Thus eliminating the GPU altogether.
 
Last edited by a moderator:
And finally, could we see MS and Sony maybe cut back on the CPU side a little and invest more heavily into the GPU, especially as they become more flexible to take on tasks like physics? As GPUs become more viable and versatile general resources maybe shifting some investment away from the CPU and to the GPU may occur? Of course the slide doesn't answer any of these questions and this is mindless speculation, but interesting to me none-the-less. One the one hand the slide is pointless, but on the other it could be pretty telling down the road. I think STI are pushing Cell as a platform and that the above slide could fit into Sony's PS4 plans in any number of ways.


it should be really interesting to define what we call physic and how the code in regard of this different task is easily "vectorisable" to fit well on simd core and how the whole workload is parallelisable.
I remenber a Gubbi test taht was showing very few float type calculation during some demo, at the other hand fluids, or make a sheet of tissue waving in the wing seems very float heavy work.
I'm scientist (not sky rocket level) but I know that mechanic work easily with matrix.
So the real question at this point the game is still heavily serial it seems how much improvement can we expect in regard of parallelisation of the workload and vectorisation of the code.

Actualy I'm reading a lot of wiki links I put together in a thread (cf signature) because i want to have better understanding of some interesting thread in this forum. At this point it's already clear that it appears useless to discuss an archotecture without considering what the code who is suppose to run on this hardware and not how we woulf like it to be.

So if some serious member could come here and explain their point in regard of the code it could be interesting ;) ( i know there have been already discussed and the cell is not the so obvious solution for make game code run better depending of people opinion experience).
 
I see Sony & NVIDIA integrating future NV-IP’s into the PS4 version of the Cell processor. The SPE's as we know them; may become more Nvidia like or based ALU's versus IBM design. Thus eliminating the GPU altogether.

I more or less agree with this. That is to say that IMHO there will be no dedicated GPU in the PS4 like the PS3's RSX, I can fully see there being a 2 PPE Broadband Engine2 CPU, a single IC. The first PPE would of course be the CPU with the 8(+ -) SPE's. The second PPE would of course be a PPE whose SPE's are optimized for graphics(all that shader power says hi) with a pixel engine and CTRC. This would be similar to the cell patent's 'Figure 6' if anyone remembers, although realized on a single IC.

We'll see what happens though, one thing that is for sure is that Sony is 99% certain to reuse the Cell architecture with the PS4, not that that is a bad thing.
 
I more or less agree with this. That is to say that IMHO there will be no dedicated GPU in the PS4 like the PS3's RSX, I can fully see there being a 2 PPE Broadband Engine2 CPU, a single IC. The first PPE would of course be the CPU with the 8(+ -) SPE's. The second PPE would of course be a PPE whose SPE's are optimized for graphics(all that shader power says hi) with a pixel engine and CTRC. This would be similar to the cell patent's 'Figure 6' if anyone remembers, although realized on a single IC.

We'll see what happens though, one thing that is for sure is that Sony is 99% certain to reuse the Cell architecture with the PS4, not that that is a bad thing.


I think its a given that Cell would be used in the PS4 maybe even two Cell2's...who knows. However, I'm not sure this would translate into the absence of a dedicated GPU. GPUs of DX10 capability will be quite mature by the time PS4 comes on the scene. GPUs at that time would be hard to go toe to toe with even with an Cell2 in your. Their "usable" capabilites, improved flexibility and raw power may dictate that not having a dedicated GPU of decent caliber is not an option if one wants to compete with MS's and Nintendo's next consoles. That would be a fairly big risk to take when instead Sony could have their cake and eat it too with a Cell2 and Nvidia's current best or whatever Sony and Nvida can cook up together...there's a lot of time between now and then for Sony and Nvidia to collaborate.
 
Back
Top