IBM unveils Cell roadmap

The Japanese supercomputers tend to be hardware orientated inflexible approach targetted at very specific applications, while the US approach is software orientated, and more flexible. I suppose this reflects Japanese vs US technology strengths. Before the current IBM world's fastest supercomputer, the previous world's fastest supercomputer was an exotic Japanese array processor, which was designed to do very specific supercomputing tasks, unlike the other competing supercomputers which were general purpose machines.
That's precisely what TiTech is against. They've built a 655 node/16-core array of Opterons (10480 cores) and they've added 360 Clearspeed boards (and plan to add more), basically one board per node.
I am not sure that comparing Cell to Clearspeed on a watt per flop basis is fair, since Cell has the PPE as a control processor, along with an on-chip ring bus, flex-io and associated logic. Clearspeed is just a DSP, and would require an external control processor and communications logic, which would consume more power. Comparing Clearspeed with SPEs with reduced local store would be more appropriate.
25W for two ClearSpeed CSX600s on a board, including 1GB of memory, with each board producing a sustained 50GFLOPs in DGEMM.

The Roadrunner architecture posits Cell as a co-processor, with an Opteron as host per node. As far as I can tell this is because they want to run existing x86 code on it and hand-off FP work to Cell. Or use x86 as the glue to distribute data to the Cells.

For double-precision work, Cell as a "co-processor" doesn't currently seem to be very compelling, there's stuff out there that drops in more easily and provides more performance.

My point has always been that will change, because IBM has plans for a true DP Cell in 2008.

Cell currently seems to be deployed in applications where DP isn't important and additionally the system is designed to run a single application.

Jawed
 
25W for two ClearSpeed CSX600s on a board, including 1GB of memory, with each board producing a sustained 50GFLOPs in DGEMM.

Do they count the RAM towards the power consumption of the board? If so that would be impressive. A measurable amount of that power draw would go towards the memory.
 
I have already said there are ways of doing that, but the time spent by the PPE on tasks related to the SPEs remains non-zero. Blowing up the number of SPEs as they are now designed by four times means the two PPEs need to take a non-zero time on managing the SPEs times two.

Unless you think every software problem can be programmed to avoid using the PPE entirely, there are scenarios where the PPE is pretty heavily loaded. If either the SPEs or PPEs are not significantly altered in the future procesor, each PPE will have double the amount of time spent worrying about the SPEs.

I've already said that earlier in the thread.
Conceptually, the most simple would be to double the number of threads per PPE and add a few extra units.

I don't think this will solve the PPE bottleneck issue since as the problem size grow, that super-sized PPE will still suffer (Where should STI draw the line ? How "big" is big enough ?).

As you mentioned, what you describe seems like an algorithm problem ? e.g., problem may go away if you partition it better, or even use "manager" SPEs to manage other SPEs.

The bottleneck may also be addressed by making SPEs more powerful/flexible or having a larger local store, instead of creating a design that encourages people to use the PPE more.
 
Do they count the RAM towards the power consumption of the board? If so that would be impressive. A measurable amount of that power draw would go towards the memory.
The two processors are 10W each. I paused for thought on the power consumption of the memory, but I don't know what 1GB of DDR2 consumes :oops:

http://www.sun.com/servers/coolthreads/t2000/calc/index.jsp

Seems to indicate that about 5W for 1GB of DDR2 at max load is realistic.

Jawed
 
The challenges of parallelism are certainly there as always, but I don't know if shirking from larger parallelism is the answer. That doesn't seem to be the way anyone is going..everyone seems to be aiming for more.

It'll be difficult for sure, but I think it's a software problem much more than anything to do with hardware, or a PPE. What can a PPE do to solve this in the general case? I don't think it can.

By the end of this generation I'm sure there'll be some developers who'll be wondering..."hmm..what if we had 12 SPUs? What if we had 18?". Indeed some might already be wondering about that.

Where hardware, or specifically where the PPE may be more relevant here, is in software models where a PPE is spoon-feeding SPEs..then you could run into issues with whether a given PPE would be enough to 'support' n SPEs. But we already have those issues in Cell. You can certainly bottleneck the processor at the PPE depending on your approach (see the documented PhysX implementation on Cell). But again, that ties back to your software approach...and such approaches, with a heavy dependence on the PPE, are already not conducive to better performance.

All of that said, I'm sure it's quite probable there'll be a bigger PPE(s) in such an implementation..but I would not expect them to take the emphasis off SPE independence. I guess we should also note that the roadmap calls for a dual-PPE setup..

Don't get me wrong, I don't see any other way than more and more parallelism to tell you the truth and the concern I raized against the Cell is actually a concern towards all the future parallel designs. Whether the solution is applied on the hardware level or the software level it does not matter, but I can't see being able to manage like 50 independent threads manually will be easy or even possible, especially in the enviroment of gamedesign where costs are already too high. The optimal thing would be the programmers didn't have to be concerned at all, just write the program and the chip will do the rest. How does it work with the current multicore PCs out there right now, who tells what core should do what?...
 
Don't get me wrong, I don't see any other way than more and more parallelism to tell you the truth and the concern I raized against the Cell is actually a concern towards all the future parallel designs. Whether the solution is applied on the hardware level or the software level it does not matter, but I can't see being able to manage like 50 independent threads manually will be easy or even possible, especially in the enviroment of gamedesign where costs are already too high. The optimal thing would be the programmers didn't have to be concerned at all, just write the program and the chip will do the rest. How does it work with the current multicore PCs out there right now, who tells what core should do what?...

Yes, moving forward... I am actually more interested in SPE-friendly algorithms, Cell compiler and development framework innovation. I hope they accelerate this aspect of their business.
 
I don't think this will solve the PPE bottleneck issue since as the problem size grow, that super-sized PPE will still suffer (Where should STI draw the line ? How "big" is big enough ?).

They'd only need to increase the PPE capacity or reduce the need for the PPE until something else becomes the bottleneck first.

As you mentioned, what you describe seems like an algorithm problem ? e.g., problem may go away if you partition it better, or even use "manager" SPEs to manage other SPEs.
Sometimes you need to use an algorithm with some component that requires more coordination work. Most algorithms do have a component of overhead that would exist mostly on the PPE without some creative design.

I've already mentioned using SPEs to do some of that overhead, but doing so means those SPEs will not be doing computation. I've mentioned that it is "cheaper" to do that when you use ~4 SPEs out of 32 instead of ~2-3 out of 8 as happens now in some demos.

The bottleneck may also be addressed by making SPEs more powerful/flexible or having a larger local store, instead of creating a design that encourages people to use the PPE more.
I said that, too.

I just said that IBM would need to change something in the PPE supply/SPE demand situation to keep the proportions the same as they are now.
I said they'd need to buff up the PPEs especially if they don't do anything to the SPEs, which seems unlikely to be the case.
 
That's precisely what TiTech is against. They've built a 655 node/16-core array of Opterons (10480 cores) and they've added 360 Clearspeed boards (and plan to add more), basically one board per node.

25W for two ClearSpeed CSX600s on a board, including 1GB of memory, with each board producing a sustained 50GFLOPs in DGEMM.

The Roadrunner architecture posits Cell as a co-processor, with an Opteron as host per node. As far as I can tell this is because they want to run existing x86 code on it and hand-off FP work to Cell. Or use x86 as the glue to distribute data to the Cells.
.

10480 Opterons and only 360 x 2 CX600s? Doesn't sound right. This would only give you a measly 18 teraflops on the Clearspeed coprocessors. It certainly won't get close to the Roadrunner, which has 16,000 Opteron processors and 16,000 Cell processors to give 1548 theoretical teraflops (1.4 petaflops actual) (the current fastest is IBM's Gene Blue which does 367 theoretical (280 actual achieved) teraflops with 130,000 opteron processors. You can see the difference in power between the Cell and Opterons from those numbers.

IBM's roadrunner uses a lot of Cells and comparatively few Opterons if I remember correctly. I think the Opterons are used to limit the amount of code rewriting necessary to standard libraries, while Cell does the heavy lifting.

For double-precision work, Cell as a "co-processor" doesn't currently seem to be very compelling, there's stuff out there that drops in more easily and provides more performance.

My point has always been that will change, because IBM has plans for a true DP Cell in 2008.

IBM probably will be using the DP version of Cell for Roadrunner - it would certainly make sense to do that.

Cell currently seems to be deployed in applications where DP isn't important and additionally the system is designed to run a single application.

Jawed

HPC and games are both single application per processor scenarios. For HPC you want to break up a single task to run on many processors, not break up a single processor to run many applications.

Although PPE and Opteron can handle multi-tasking within a processor, it does not make for good performance to use it as a means of evenly distributing workload on parrellel computing tasks - it is better to use an algorithm to distribute tasks to queues for each processor for sequential execution. If multiple applications are required, it is better to run the applications sequentially or simply assign a certain number of nodes to each application. It would not make sense to multi-task different applications (ie timeshare) within either the SPEs nor Clearspeed co-processors if it can be avoided, although you could assign a number of SPEs or Clearspeeds to each application while the PPE or Opterons multi-task.

The sort of application where you would want lots of multi-tasking on a single processor is file and database servers, and this is one application you would not want to use Cell or Clearspeed on.

As far a power is concerned, you need to include the Opterons, and support chips as well, and Clearspeed will need to use far more of those than Cell. 1GB Memory consumes much power than the CPU/FPU as can be inferred from the fact that memory doesn't need to have a heat sink or be fan cooled.
 
Here is some info about an application of IBMs Cell based super-computers, somewhat related to what is being discussed here.

General-purpose computing clusters, said Skalabrin, are "just becoming unmanageable" and are consuming too much power. "There is a real need for specialized computing," he said.

"Absent innovation, we are facing a crisis in turnaround time with OPC," said Joe Sawicki, vice president and general manager of Mentor's design-to-silicon division. Even at 65 nanometers, according to Sawicki, some customers are using 1,000 processor nodes to run OPC--and taking days to do it. Some are talking about needing 2,000 nodes for 45 nm, "an unacceptable explosion in the cost of ownership," Sawicki said.
......
IBM Corp.'s multicore Cell architecture is "uniquely suited" to tackling OPC, Sawicki of Mentor said. Originally aimed at gaming applications, the Cell contains one PowerPC processor and eight "synergistic processing elements." The Cell's strength is rapid image processing. Compared with an Opteron processor, a Cell processor uses fast Fourier transforms (FFTs) to speed OPC simulation, he said.
 
Last edited by a moderator:
the current fastest is IBM's Gene Blue which does 367 theoretical (280 actual achieved) teraflops with 130,000 opteron processors. You can see the difference in power between the Cell and Opterons from those numbers.

BlueGene uses 130,000 custom PowerPC 440s, not Opterons.
 
10480 Opterons and only 360 x 2 CX600s? Doesn't sound right. This would only give you a measly 18 teraflops on the Clearspeed coprocessors.
~35 TFLOPs peak have been added with 360 boards from a base of ~50 TFLOPs peak for the x86 processors, with ~21TB of RAM.

http://www.top500.org/site/690

When the other 300 Clearspeed boards are installed then it'll move to 5th I guess.
It certainly won't get close to the Roadrunner, which has 16,000 Opteron processors and 16,000 Cell processors to give 1548 theoretical teraflops (1.4 petaflops actual) (the current fastest is IBM's Gene Blue which does 367 theoretical (280 actual achieved) teraflops with 130,000 opteron processors. You can see the difference in power between the Cell and Opterons from those numbers.
In two years' time, yes... This version of Cell doesn't exist right now, which has always been my point. People using Cell now either aren't bothered by the DP performance, don't need DP or are rewriting their code to SP or they are rewriting their code to mixed-precision SP/DP to produce DP results where needed.

As far a power is concerned, you need to include the Opterons, and support chips as well, and Clearspeed will need to use far more of those than Cell.
There's nothing about ClearSpeed that demands you deploy in the ratio that TiTech has used. e.g. you could put two boards/4 chips into a node with a single-core Opteron.

Like GPUs, ClearSpeed is a rapidly moving target. It's the only way the company is going to survive (if it does). They work on an 18-month generation as opposed to the 3-5 year generations you get with Cell. The 130nm process they're using is so far behind the curve, that even a moderate catch-up (say, to 90nm) by 2008 is still going to make a big difference. It seems entirely reasonable to expect a 2 chip, 200GFLOP peak, 30W board in 2008.

Do you think a DP Cell delivering 100GFLOPs peak (7x Cell) is going to come in at or below 75W on 65nm in 2008?

A more pertinent question has to be: why is Roadrunner using any Opterons? Cell is supposed to be capable of standing alone - yet IBM seems determined to use Opterons and treat Cell as a co-processor - because of the installed base of software, including the gubbins that holds a supercomputer together. Does a DP Cell going into Roadrunner need PPE? What about the cache? Does PPE stick around for implementation in custom systems where Cell stands alone but is unused in co-processor installations?

Jawed
 
cell_edp_on_roadrunner.png


cell_edp_on_roadrunner-2.png


there is not any Clearspeed chip inside.
 
A more pertinent question has to be: why is Roadrunner using any Opterons?
Integerish branch-heavy spaghetti code with incoherent memory behaviour? Traversing graphs, gathering data, scattering data, managing memory, managing communication, keeping stuff in sync.
Certainly not for the raw throughput.
Jawed said:
Cell is supposed to be capable of standing alone - yet IBM seems determined to use Opterons and treat Cell as a co-processor - because of the installed base of software, including the gubbins that holds a supercomputer together. Does a DP Cell going into Roadrunner need PPE? What about the cache?
That's a pretty interesting question.
It might sit idle a lot of the time if the Opterons really take over the orchestration, but maybe that's a good tradeoff. The PPE is what, a third of the die, going down in future revs? However IMO you need *something* there just to acknowledge the receipt of data packets and to control the on-board memory. Just picking the regular Cell and under-use the PPE is a pretty straightforward choice, especially if you think about economies of scale.

[going out on limb]
The PPE cache can be locked down, right? It could be used as an extra scratchpad buffer to prefetch data and/or improve memory access patterns.
[/going out on limb]
 
There's nothing about ClearSpeed that demands you deploy in the ratio that TiTech has used. e.g. you could put two boards/4 chips into a node with a single-core Opteron.

Like GPUs, ClearSpeed is a rapidly moving target. It's the only way the company is going to survive (if it does). They work on an 18-month generation as opposed to the 3-5 year generations you get with Cell. The 130nm process they're using is so far behind the curve, that even a moderate catch-up (say, to 90nm) by 2008 is still going to make a big difference. It seems entirely reasonable to expect a 2 chip, 200GFLOP peak, 30W board in 2008.
That's all cool, until you hear the economic side of things...
http://www.theregister.co.uk/2006/11/18/clearspeed_silicon_sc06/
At $8,000 per board, ClearSpeed will need to keep a close eye on how it stacks up from a price/performance perspective moving forward. It should be noted though that the company claims to offer large discounts on volume purchases.

According to the page for the ClearSpeed accelerator board at the Japanese agency, the standard price of HPB-CSX6 is 1,600,000 yen ($13,926) and the special discount price is 1,428,000 yen ($12,436).
http://www.hpc.co.jp/hit/products/clearspeed.html
 
The power consumption of the Mercury add-in board found here, sounds a bit high. 210 W for one Cell at 2.8 GHz, 1 GB XDR RAM and 4 GB DDR2 RAM?

I don't think anyone has measured above 200 W for the PS3 when running games and that includes a power supply with some loss, a CELL at 3.2 GHz and the RSX probably consuming above 50W.

They must have some pretty large safety margins. :p
 
That's a bit rich, since a PCI card with an 8-SPE functioning Cell on it seems to be rumoured to cost $8000.

Feel free to dig around for the price list, I got bored after just finding rumours.

http://www.osnews.com/comment.php?news_id=15354

Jawed
You know the Mercury Cell board is for single-precision work (180GFLOPS) on a workstation, don't you? I think that price of the Cell board is very competitive.
http://www.clearspeed.com/acceleration/technology/
The ClearSpeed CSX600

ClearSpeed's CSX600 is an embedded low power data parallel coprocessor. It provides 25 GFLOPS of sustained single or double precision floating point performance, while dissipating an average of 10 Watts.
 
HPC and games are both single application per processor scenarios. For HPC you want to break up a single task to run on many processors, not break up a single processor to run many applications.
HPC is much more like a single applications for a lot of processors, that's why clusters and workload-specific accelerators like Cell and ClearSpeed are only suitable for a subset of HPC problems. There's one thing that I didn't see mentioned in this thread: efficiency. I don't have first-hand experience in using ClearSpeed cards but it seems to me that they are much more likely to reach high utilization of their execution resources than Cell so the difference in raw FP throughput may not be telling of the real differences between the two approaches.

Also the architectures are very different, SPEs cannot access memory directly whilst ClearSpeed can. However I think that under certain assumptions about data coherency it would be easier to program an application with a sparse data set for Cell than for ClearSpeed which AFAIK cannot do scatter-gather loads/stores like traditional vector processors. It can access arrays with variable strides and access adjacent elements but it still relies on data being stored in a fairly regular fassion (*).

(*) Disclaimer: as I said I have no first hand experience so I may be wrong here even though I tried to get more details as possible from the available documentation.
 
It might sit idle a lot of the time if the Opterons really take over the orchestration, but maybe that's a good tradeoff. The PPE is what, a third of the die, going down in future revs? However IMO you need *something* there just to acknowledge the receipt of data packets and to control the on-board memory. Just picking the regular Cell and under-use the PPE is a pretty straightforward choice, especially if you think about economies of scale.

Having the Opterons in place may be a nod towards platform flexibility, and I doubt that the PPE will be sitting idle.

Even with the x86 chips taking up a lot of the slack, there will still be a large number of programs or algorithms that will need the PPE in some capacity.

It's probably why the Opterons were brought in: the PPE is still a bottleneck in a number of applications or it becomes a bottleneck when burdened by too much extraneous work. Any work not directly related to running the critical code (system overhead is larger in total for a supercomputer than it is for a console) translates into lost performance overall.
 
Back
Top