IBM unveils Cell roadmap

On B3D we argue, we discuss, we try to bring facts to the table, not marketing one liners. Now if you have something substantial to counter what I wrote that's fine, otherwise stop here.
Well it's deployed in Japan (I'm sure you know this) and a second system is on the way. 10x the performance per watt of Cell.

There's more than one way to skin a cat.

Jawed
 
They will for sure have to beef up the PPE to be able to orchestrate all the SPEs, I mean if it is difficult to handle manually 8 SPEs now, I wonder how it will be to handle 32 SPEs and see that they all are in sync.

The challenges of parallelism are certainly there as always, but I don't know if shirking from larger parallelism is the answer. That doesn't seem to be the way anyone is going..everyone seems to be aiming for more.

It'll be difficult for sure, but I think it's a software problem much more than anything to do with hardware, or a PPE. What can a PPE do to solve this in the general case? I don't think it can.

By the end of this generation I'm sure there'll be some developers who'll be wondering..."hmm..what if we had 12 SPUs? What if we had 18?". Indeed some might already be wondering about that.

Where hardware, or specifically where the PPE may be more relevant here, is in software models where a PPE is spoon-feeding SPEs..then you could run into issues with whether a given PPE would be enough to 'support' n SPEs. But we already have those issues in Cell. You can certainly bottleneck the processor at the PPE depending on your approach (see the documented PhysX implementation on Cell). But again, that ties back to your software approach...and such approaches, with a heavy dependence on the PPE, are already not conducive to better performance.

All of that said, I'm sure it's quite probable there'll be a bigger PPE(s) in such an implementation..but I would not expect them to take the emphasis off SPE independence. I guess we should also note that the roadmap calls for a dual-PPE setup..
 
Last edited by a moderator:
And this really is crucial.
DP weakness as compared to SP was liveable, but the memory support is a dealbreaker. Of the two, I'd point to main memory support as by far the greater weakness for scientific codes. (Bearing in mind that generalizing about scientific code is difficult.)

The Cell already supports a lot more than 512MB, IIRC it's 4GB per chip.
It's listed in one of the IBM docs.
 
Well it's deployed in Japan (I'm sure you know this) and a second system is on the way. 10x the performance per watt of Cell.

There's more than one way to skin a cat.

Jawed
LOL, so are you going to address my points or will you repeat the same marketing line over and over again?
 
The Cell already supports a lot more than 512MB, IIRC it's 4GB per chip.
It's listed in one of the IBM docs.

That's true though the QS20 ships with 512MB per CPU. I'm sure if you wanted to order
thousands of blades IBM could sort something out, but IBM are saying that future blades
will support up to 32 GB, Roadrunner is planned to use less 8 GB IIRC. Whether this
will be on the board, XDIMMs or maybe FB-DIMMs isn't clear yet.
 
The Cell already supports a lot more than 512MB, IIRC it's 4GB per chip.
It's listed in one of the IBM docs.
Theoretical capabilities of a piece of silicon is nice, but you apply for funds in order to attack problems with real life tools. If we can't buy it, it doesn't exist. When it comes to these things, the scientific community is mercilessly pragmatist.

A more mundane example are Merom chips, which, if I don't misremember, support 38-bit adresses in hardware, but there are no chipsets for portable use that allow more than 4GB, and that doesn't look to change for the next generation. It takes support by chipset, MB and memory module makers, and OS for the feature of the CPU to be taken advantage of. And of course once all of this is in place it has to be offered as a practical solution at a somewhat competitive price.

For a lot of things, a multiCPU Opteron system running a 64-bit OS is remarkably cheap and practical.

While not all codes require large amounts of memory, chemical computation has lived in 64-bit adressing and hefty physical memories for well over a decade. YMMV obviously, but for me the Cell systems on offer have seemed very niche in terms of memory support.

The good part of this roadmap is that it tries to say that it isn't necessarily a bad idea to put some time into the platform at all.
 
Last edited by a moderator:
They will for sure have to beef up the PPE to be able to orchestrate all the SPEs, I mean if it is difficult to handle manually 8 SPEs now, I wonder how it will be to handle 32 SPEs and see that they all are in sync. Even crazier it would be if several SPEs have to work on the same task. Maybe it would work if you have large groups a engineers writing specific code for specific applications/problems but that must cost a lot of money and I don't see that being to desirable in game development. Furthermore, this would also make them more suitable for maybe having in normal PCs at some point, which is also the question, does IBM have any desire what so ever to see those things in every mans home sometime in the future or are they only targeting super computers and so on?...

If they increase # of SPE more than # of PPE in the next architecture, I assume it means that they will also tackle the PPE bottleneck issue (e.g., give even more autonomy to SPEs at the hardware level, and/or better algorithms and parallel application frameworks).

I think SPEs are pretty well separated (since their interfaces must be "clarified" via message passing), one should be able to package/reuse their SPE code better over time. So more SPEs is a definite plus to keep up with the performance per watt advances. The only other issue is scheduling (Need better scheduler) and bandwidth contention. I agree with all the gentlemen talking about the ring bus or cross-bar enhancements.

Hope they will increase the local store size though. Is 32 x larger local store (384K or 512K) viable at all ?
 
Last edited by a moderator:
Some timely news from an unexpected venue (Inquirer):

INNER SOURCES at Big Blue reckon there's a run on Cell processors for Playstation 3s due to factors beyond Sony's control.

The major problem is these Cell processors are now being deployed in Big Blue blades and demand for these servers is exceeding supply, the tale goes.

Now, the mandatory INQ PS3 doom-and-gloom aside, if this has any basis in reality I find it interesting in that it further reinforces the Cell as being an architecture gaining traction in the wider HPC space.
 
It's tangental to the discussion, I guess, but only a day or two ago another application was announced (Mentor Graphics - they're going to use it for optical proximity correction). Just first hand I know a number of researchers in universities looking to get their hands on hardware (one has ordered a PS3 in the interim :p). That surprised me, I had no idea that work was going on at all, but it was right under my nose.
 
LOL, so are you going to address my points or will you repeat the same marketing line over and over again?
Your point appears to be that Clearspeed is a toy. What's to address?

It's doing real work, 35 TFLOPs in Tokyo, clearly you're wrong.

Cell as it currently stands is useless for this application (their power consumption went down as a result of installing Clearspeed :p ).

In two years' time, I'm sure things will be different. I not arguing that a DP version of Cell will be useless, but the current version is not suited to big systems.

Furthermore, if you read IBM's detailed comments on what they're going to do with Cell for Roadrunner, they don't actually know. They haven't decided what configuration of DP Cell they're going to deploy. It's work in progress.

Jawed
 
I wonder how IBM's engineers feel about the next-generation CELL having to rely (in part) on the quaint K8 for at least part of the load.

It would probably be more of an ego thing than any reflection on the design, but I wonder if they won't try to remedy that situation in the future.
 
Your point appears to be that Clearspeed is a toy. What's to address?

It's doing real work, 35 TFLOPs in Tokyo, clearly you're wrong.

Cell as it currently stands is useless for this application (their power consumption went down as a result of installing Clearspeed :p ).

In two years' time, I'm sure things will be different. I not arguing that a DP version of Cell will be useless, but the current version is not suited to big systems.

Furthermore, if you read IBM's detailed comments on what they're going to do with Cell for Roadrunner, they don't actually know. They haven't decided what configuration of DP Cell they're going to deploy. It's work in progress.

Jawed

If Clearspeed chips are rather limited and toy-like, then who's to say they couldn't get a lot more performance out of Cells? Is that system in Tokyo hitting peak flop rates? I haven't looked into it much, but I've not seen anything that says Cell would necessarily be worse off for their application (possibly even in power consumption if they can get more out of every Cell).

No idea though. Just a thought.
 
If Clearspeed chips are rather limited and toy-like, then who's to say they couldn't get a lot more performance out of Cells?
The Clearspeed chips were designed for a specific range of tasks, for which they are very suited. The design may be limited, but toy-like would indicate it doesn't do anything useful.

Is that system in Tokyo hitting peak flop rates? I haven't looked into it much, but I've not seen anything that says Cell would necessarily be worse off for their application (possibly even in power consumption if they can get more out of every Cell).

I'm not certain the CELL is superior enough in that workload (especially with DP) to offset the fact that the Clearspeed chips clock much lower.
 
3dilettante said:
The PPE is not magically free to do whatever it wants when the SPEs are being utilized. In high-demand scenarious, a significant portion of its time is still devoted to coordination.
Key word being "technical demos". In early techdemos on PS2, majority of R5900 time was devoted to "coordinate" VUs, and that wasn't indicative of any magical link between the two units that required main CPU to do that work. Later software that achieved actual high utilization did things very differently - but as always, there was a learning curve to climb before that.

Besides there are easy ways to address the PPE bottleneck even without changing coding approach. IBM could simply replace it with, you know, something that actually performs half decent in future iterations of Cell. I hear those MIPS cpus go for cheap nowadays, maybe IBM engineers could go talk to to them. :oops:
 
Your point appears to be that Clearspeed is a toy. What's to address?

It's doing real work, 35 TFLOPs in Tokyo, clearly you're wrong.

Cell as it currently stands is useless for this application (their power consumption went down as a result of installing Clearspeed :p ).

In two years' time, I'm sure things will be different. I not arguing that a DP version of Cell will be useless, but the current version is not suited to big systems.

Furthermore, if you read IBM's detailed comments on what they're going to do with Cell for Roadrunner, they don't actually know. They haven't decided what configuration of DP Cell they're going to deploy. It's work in progress.

Jawed

The Japanese supercomputers tend to be hardware orientated inflexible approach targetted at very specific applications, while the US approach is software orientated, and more flexible. I suppose this reflects Japanese vs US technology strengths. Before the current IBM world's fastest supercomputer, the previous world's fastest supercomputer was an exotic Japanese array processor, which was designed to do very specific supercomputing tasks, unlike the other competing supercomputers which were general purpose machines.
<p>
I am not sure that comparing Cell to Clearspeed on a watt per flop basis is fair, since Cell has the PPE as a control processor, along with an on-chip ring bus, flex-io and associated logic. Clearspeed is just a DSP, and would require an external control processor and communications logic, which would consume more power. Comparing Clearspeed with SPEs with reduced local store would be more appropriate.
 
The enhanced Cell is the DP flavour only, it seems. Also no mention of smaller Cells. Presumably IBM's goals differ from Sony and Toshibas, whos roadmaps might include 1:4's and the like? Or are IBM the sole developers of new Cell breeds?
Probably Cell for consumer electronics is independently developed by Sony as it's Sony that collaborated with Transmeta. They are working on 65nm Cell at the STI Center at Austin though. From the ISSCC 2007 AP
http://www.isscc.org/isscc/2007/ap/isscc2007.advanceprogram110306.pdf
18.1 Implementation of the CELL Broadband Engineâ„¢ in a 65nm SOI Technology
Featuring Dual-Supply SRAM Arrays Supporting 6GHz at 1.3V
1:30 PM
J. Pille1, C. Adams2, T. Christensen2, S. Cottier3, S. Ehrenreich1, F. Kono4, D. Nelson2,
O. Takahashi3, S. Tokito5, O. Torreiter1, O. Wagner1, D. Wendel1
1IBM, Boeblingen, Germany
2IBM, Rochester, MN
3IBM, Austin, TX
4Toshiba American Electronic Components, Austin, TX
5Sony Computer Entertainment, Austin, TX
The 65nm CELL Broadband Engineâ„¢ design features a dual power supply, which
enhances SRAM stability and performance using an elevated array-specific power
supply, while reducing the logic power consumption. Hardware measurements
demonstrate low-voltage operation and reduced scatter of the minimum operating
voltage. The chip operates at 6GHz at 1.3V and is fabricated in a 65nm CMOS SOI
technology.

As for the consumer electronics front
http://www.forbes.com/home/feeds/afx/2006/12/05/afx3230989.html
Sony to release electronic products using Cell processor next year - report
12.05.06, 6:41 PM ET

TOKYO (XFN-ASIA) - Sony Corp may install its new Cell microprocessor in a range of products, and introduce the first items possibly by the end of 2007, the Nihon Keizai Shimbun reported, quoting Stan Glasgow, president and chief operating officer of Sony Electronics Inc.

Glasgow did not mention any specific products but said that work is progressing to embed the Cell processor in some three or four major consumer electronics products, and that these are expected to go on sale at the end of 2007 or the start of 2008, according to the business daily.

BTW the adoption of Cell in special markets is quietly in progress through Mercury and IBM...
http://www.mentor.com/company/news/calibrenmopc.cfm
 
Key word being "technical demos". In early techdemos on PS2, majority of R5900 time was devoted to "coordinate" VUs, and that wasn't indicative of any magical link between the two units that required main CPU to do that work. Later software that achieved actual high utilization did things very differently - but as always, there was a learning curve to climb before that.
I have already said there are ways of doing that, but the time spent by the PPE on tasks related to the SPEs remains non-zero. Blowing up the number of SPEs as they are now designed by four times means the two PPEs need to take a non-zero time on managing the SPEs times two.

Unless you think every software problem can be programmed to avoid using the PPE entirely, there are scenarios where the PPE is pretty heavily loaded. If either the SPEs or PPEs are not significantly altered in the future procesor, each PPE will have double the amount of time spent worrying about the SPEs.

Besides there are easy ways to address the PPE bottleneck even without changing coding approach. IBM could simply replace it with, you know, something that actually performs half decent in future iterations of Cell. I hear those MIPS cpus go for cheap nowadays, maybe IBM engineers could go talk to to them. :oops:

I've already said that earlier in the thread.
Conceptually, the most simple would be to double the number of threads per PPE and add a few extra units.
 
Last edited by a moderator:
I've already said that earlier in the thread.
Conceptually, the most simple would be to double the number of threads per PPE and add a few extra units.

Or maybe stick a Power 6 derived core in there.

Wide, OOO, high frequency (*and* with a low 1 cycle schedule-execute latency) and multithreaded.

Cheers
 
I have already said there are ways of doing that, but the time spent by the PPE on tasks related to the SPEs remains non-zero.
Very true, but I think the key question is, will the more mature software for the current CELL design more often be bottlenecked by the number of SPEs or by the PPE performance?
IBM will obviously be targeting for some common middle ground with the new design and not optimize for the extreme cases. They probably know what they are doing.

Another interesting question is what memory bandwidth the new 32 SPE design will require, 4 x the current > 100 GB/s?
 
Back
Top