PDA

View Full Version : IBM officially announces PowerXCell 8i (DP-enhanced 65nm Cell)


one
13-May-2008, 10:33
http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=207602892


EE Times: Latest News

IBM shifts Cell to 65nm
Server chip aims at "supercomputing for the masses"




Rick Merritt
EE Times
(05/13/2008 12:01 HM EDT)

SAN JOSE, Calif. — IBM Corp. officially announces today (May 13) a next-generation version of its Cell processor, the first specifically geared for computer servers.

The PowerXCell 8i will drive the Road Runner system now under test at Los Alamos National Labs to see if it can become the world's first supercomputer to deliver sustained petaflops performance. Besides cracking the petaflops barrier, IBM hopes hundreds of users will decide to plug into their IBM servers a two-socket board housing the new Cell chips to deliver what IBM calls "supercomputing for the masses."

The new chip uses 65nm process technology to reduce the power consumption of the previous 90nm chip while maintaining the same 3.2 GHz frequency. That allows IBM to get two of the chips on to a single board while keeping board-level power consumption under 250 W required by IBM's BladeCenter servers.

The new design now supports mainstream DDR-2 memory rather than the Rambus XDR memories used in the original Cell. It has also expanded total memory capacity of the chip from 2 to 32 Gbytes to support large data sets required in many high-end technical computing applications.

IBM also expanded support for double precision floating point on the eight specialty cores used on Cell. The chip now delivers up to 190 TFlops of double precision floating point performance, five times its previous level, said Jim Comfort, vice president of workload optimized systems in IBM's Systems and Technology Group.

The older PDF about Roadrunner
http://www.lanl.gov/orgs/hpc/roadrunner/rrinfo/RR%20webPDFs/Roadrunner%20System%20Overview%20Oct%2017,%202007% 20(LAUR).pdf

V3
13-May-2008, 11:12
The chip now delivers up to 190 TFlops of double precision floating point performance

Is that a typo ?

Jawed
13-May-2008, 11:18
Ooh, that's interesting, DDR2. Anyone know the bandwidth?

Jawed

pjbliverpool
13-May-2008, 11:51
Is that a typo ?

Presumably. Its also pretty close to the single precision performance which is interesting.

Its approximatly 4x higher than the fastest quad core.

one
13-May-2008, 11:55
Is that a typo ?Haha, didn't notice it... "Five times" is probably right.

According to the previous discussion thread about this enhanced Cell B.E., it has 102Gflops (DP) while the original Cell is 25.6Gflops for DP.
http://forum.beyond3d.com/showthread.php?t=40661

Also there's a new Roadrunner article.
http://www.lanl.gov/news/index.php/fuseaction/1663.article/d/200805/id/13277
Named after the fleet-of-foot New Mexico state bird, the Roadrunner supercomputer is a hybrid, containing not one type of microprocessor but two.

Its main structure is a standard cluster of microprocessors (in this case AMD Opteron dual-core microprocessors). Nothing new here except that each chip has two compute cores instead of one. The hybrid element enters the picture when each Opteron core is internally attached to another type of chip, the enhanced Cell (the PowerXCell 8i), which has been designed specially for Roadrunner. The enhanced Cell can act like a turbocharger, potentially boosting the performance up to 25 times over that of an Opteron compute core alone.

In the end, each code achieved a substantial speedup when run on a Cell-accelerated Opteron compute node in comparison with execution on a single Opteron compute core, without the Cell. The VPIC code, which simulates plasmas in magnetic fields, is a prime example. It ran 6 times faster on the Opteron-Cell node than on the Opteron alone. That increase will allow researchers to tackle some scientific grand-challenge problems.

Successfully accelerating the Monte Carlo code called Milagro took many months, several false starts, and modification of 10 to 30 percent of the code. Monte Carlo codes, which simulate radiation transport, are very expensive computationally. As the October decision time drew near, Milagro was also executing 6 times faster with the Cell than without, a crucial achievement for the acceptance of Roadrunner.


Los Alamos scientists are now confident that Roadrunner will become the world's fastest supercomputer. It will be a tremendous asset to the computer simulations performed at the Laboratory for the nuclear weapons program as well as for scientific grand challenges. Important codes are expected to run at 200 to 500 teraflop/s. Roadrunner will also be the first computer to run the universally recognized code used to test supercomputer performance—LINPACK—at over 1 petaflop/s.

Carl B
14-May-2008, 04:40
My thoughts on the new QS22 (and I've been mulling this over all day) can be summed up with this: about time!

The next several years mark some of the most important in recent memory as far as the direction of modern computing goes, and for having at its disposal one of the few architectures able to get an early-mover advantage on the new trends of heterogeneous/massively parallel computing, the seeming inertia out of IBM wrt Cell has been puzzling to say the least. Now with the Roadrunner go-ahead from DoE and the conclusion of code-base trials with a marquee client, it seems IBM is ready to get a little more serious with the PowerXCell 8i.

IMO the countdown at present until the dawning of the next era is no less than the countdown to Larrabee; in the interim IBM has the Cell with which to position itself and gain acceptance/credibility before that chips arrival, and if it can do so to the extent that a competitive Cell 2 is able to hold and/or extend client wins into the future, it will have to be deemed a success. Competing with PowerXCell 8i in this regard for the immediate term will be the GPGPU offerings from NVidia and AMD, and truthfully I think they'll walk away with the lions share of it. But that's ok, because IBM's wares are aimed essentially at the enterprise market and big spenders alone, and so long as QS22 performance is competitive with IBM's platinum reputation and support part of the package, they should do well enough. I was heartened to see for example that the QS22 is appropriately considered in IBM's new enterprise data center initiative.

IBM has a golden opportunity in that the R&D expense and volume production ramping of a next-generation Cell are virtually assured through the relationship with Sony and Toshiba; any headway they make with the present architecture(s) today is just that much better a position to start from when the next evolution of the chip arrives tomorrow.

patsu
14-May-2008, 08:13
Software library and volume/installed base are key. Go go STI !

I am waiting for Toshiba's makeup application using SPUREngine. Let's hope Sony license it for PS3 too.

one
14-May-2008, 10:00
QS22 PR
http://www-03.ibm.com/press/us/en/pressrelease/24180.wss
It says "the IBM PowerXCell™ 8i, offers five-times the speed of the original Cell/B.E. processor".

In the Roadrunner pdf the 32-SPE Cell is shown as a successor of eDP Cell, so we may see PowerXCell™ 32i in the near future when 45nm is ready.

Carl B
14-May-2008, 14:45
They're definitely rounding up with the 'five times' though, at least in absolute performance terms. The QS22 is rated at 217 DP GFLOPS, which would seem closer to four times than five. Maybe other considerations are going into the calculus.

Shifty Geezer
14-May-2008, 15:16
Ooh, that's interesting, DDR2.Indeed. I presume they mean GDDR2, as in the consoles, which are bandwidth comparable with the Rambus in PS3.

AlStrong
14-May-2008, 15:28
GDDR2 in consoles :?:

Carl B
14-May-2008, 15:36
Indeed. I presume they mean GDDR2, as in the consoles, which are bandwidth comparable with the Rambus in PS3.

Oh no, it's DDR2 alright. The bandwidth is achieved through a massive increase in pin-out.

The thread One linked to earlier on chip has a lot of good information from its technical debut: http://forum.beyond3d.com/showthread.php?t=40661

Jawed even posted in that thread so I don't understand how he was surprised all over again. :) (and you did too Shifty!)

Jawed
14-May-2008, 17:49
Jawed even posted in that thread so I don't understand how he was surprised all over again. :)
I have a dynamic, not a static memory - so forgot as a refresh time of 1 year is just too long :smile:

Jawed

one
06-Jun-2008, 16:40
Fixstars announced GigaAccel 180, a PCI-e board sporting a PowerXCell™ 8i. So its perf is at 90GFlops (DP) @ 2.8GHz.

EDIT: 180GFlops -> 90GFlops (http://www.fixstars.com/en/pdf/GigaAccel180_EN.pdf)

http://www.fixstars.com/en/products/gigaaccel180/
“GigaAccel 180” is PCI Express board using the “PowerXCell™ 8i”.
The “PowerXCell™ 8i” offers five times the double precision performance of the previous Cell/B.E. processor theoretically, and delivers extreme performance.
Take advantage of the astonishing 180 GFLOPS performance of the latest Cell/B.E. by integrating it into PC workstations with PCI Express slots. Large scale clustering systems need not be built to run applications that have heavy arithmetic processing elements, such as financial modeling to calculate values of stocks and derivatives, scientific computing needed to recreate or predict complex systems such as physics and science, as well as for computer diagnostic imaging needed for medical image processing.

Key Applications

* Financial modeling to calculate values of stocks and derivatives
* Scientific computations needed to recreate or predict complex systems such as physics and science
* Medical image processing that are needed for computer diagnostic imaging
* Codec processing for HD-quality video and audio delivery
* Seismic signal processing for resource exploration


http://www.fixstars.com/en/products/gigaaccel180/specs.html

Specifications

Microprocessor
IBM PowerXCell™ 8i
Number of Processors Chips 1
Processor Internal clock speed 2.8GHz
Processor Memory interface speed 800MHz
Processor I/O interface speed 5.0GHz
PPU processors 1
SPU processors 8

Main memory
Memory 4GB DDR2 DRAM
Channels 2
Data path 16bytes per channel
Transfer rate 800Mbps
Bandwidth (Maximum,Total) 25.6GB/s (with 5-5-5 timings)
ECC Yes

Cell companion chip
Type IBM Southbridge DD3.0
Transfer rate 10GB/s at 2.5GHz (5GB/s each direction)

I/O
1 x 16 PCI Express bus
2 x 1 gigabit Ethernet

Software
Booting by IBM boot firmware
Network booting (Linux booting with no disk)
Test environment OS : Fedora7
SDK : IBM Software Development Kit for Multicore Acceleration v3.0 (optional)

* Integrated Development Environment
* Performance tools

Driver : PCIe virtual Ethernet driver compatible (Only with Linux hosts)

External Dimensions
Length 111mm
Width 312mm
*Doublewide box with fan heatsink

Operating environments
External temperature 0 to 40ºC (32 to 104ºF)
Relative Humidity 5 to 95% RH (non-condensing)
Altitude -400 to 3000m

Power consumption
150W

Compatible Workstation
Lenovo ThinkStation S10 (GigaAccel 180 original specifications)
CPU Intel® Core™2 Duo 3GHz
Memory Memory 4GB DDR3 1066MHz
HDD 250GB serial ATA 7200 rpm
Graphics NVIDIA Quadro FX1700 (512MB VRAM)
Network Dual Gigabit Ethernet
OS Windows Vista® 64

Carl B
06-Jun-2008, 16:54
Man Fixstars is going the distance with Cell it seems; first the application porting for Mizuho, now their own product launch. Good find One.

Shifty Geezer
06-Jun-2008, 18:53
How is this device going to be used in PC applications? I'm guessing it'll need custom code, which marginalises it considerably. For something like Maya acceleration the board would be great, but who's going to incorporate acceleration? I'm failing to appreciate the market.

one
06-Jun-2008, 19:34
How is this device going to be used in PC applications? I'm guessing it'll need custom code, which marginalises it considerably. For something like Maya acceleration the board would be great, but who's going to incorporate acceleration? I'm failing to appreciate the market.Well the page has a hint :wink: They demoed CodecSys HDCP Cell BE H.264 Encoder at NAB 2008. This application is for the professional video/broadcast industry.
http://www.fixstars.com/en/company/event/nab2008.html
http://www.investorvoices.com/bcst/2008-0417-video
Fixstars ported “CodecSys HDCP Cell BE H.264 Encoder” for the Cell/B.E. Blade to GigaAccel 180, and we demonstrated real-time video encoding.
Many people from broadcasting stations, broadcast equipment makers, broadcasting content providers, and other broadcasting related companies visited our booth.

The H.264 real time encoding software technology “CodecSys HDCP Cell BE H.264 Encoder” was very well-received for unprecedented levels of video compression technology by the visitors.
Broadcast International (BI) possesses patents for CodecSys, which is a multi-codec technology designed for real time distribution of HD-quality video. Fixstars ported the solution for the Cell/B.E. Blade to GigaAccel 180.
Using the astonishing performance by GigaAccel 180, the YUV420 image data of 720p and 30fps was encoded to H.264, compressed even into 3Mbps, and delivered in real time streaming.

Carl B
06-Jun-2008, 19:37
How is this device going to be used in PC applications?

It's not. :)

Rather, it's attemping to bring HPC performance/workloads onto a workstation friendly form factor. I'm sure once they list the price in $'s also that it will be none too cheap. But it presents a good option for institutions that want to work with the PowerXCell 8i and would rather not spend tens of thousands on a BladeCenter-based arrangement.

ShaidarHaran
06-Jun-2008, 19:54
It's 180GFLOPs in SP mode, half that in DP. Not bad, but an 8-core x86 system should outperform in both modes.

Shifty Geezer
06-Jun-2008, 20:34
It's not. :)

Rather, it's attemping to bring HPC performance/workloads onto a workstation friendly form factor. I'm sure once they list the price in $'s also that it will be none too cheap. But it presents a good option for institutions that want to work with the PowerXCell 8i and would rather not spend tens of thousands on a BladeCenter-based arrangement.Okay, we're still a ways off from Cell accelerated applications then.

It's 180GFLOPs in SP mode, half that in DP. Not bad, but an 8-core x86 system should outperform in both modes.Dunno. In terms of attained, sustained performance, especially considering power consumption, Cell may still win out in a lot of cases.

one
07-Jun-2008, 05:46
It's 180GFLOPs in SP mode, half that in DP. Not bad, but an 8-core x86 system should outperform in both modes.Thanks I should have browsed the pdf!
http://www.fixstars.com/en/pdf/GigaAccel180_EN.pdf

pjbliverpool
07-Jun-2008, 10:33
The H.264 real time encoding software technology “CodecSys HDCP Cell BE H.264 Encoder” was very well-received for unprecedented levels of video compression technology by the visitors.
Broadcast International (BI) possesses patents for CodecSys, which is a multi-codec technology designed for real time distribution of HD-quality video. Fixstars ported the solution for the Cell/B.E. Blade to GigaAccel 180.
Using the astonishing performance by GigaAccel 180, the YUV420 image data of 720p and 30fps was encoded to H.264, compressed even into 3Mbps, and delivered in real time streaming.

I'm wondering how this compares to current, or near current x86. In anands test of Nehalem its pulling 18fps in 1080p encoding.

http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3326&p=6

When taking into account the modest clock speed of the Nehalem, the unoptimised/buggy platform its running on and the fact that its encoding 1080p rather that 720p, isn't it likely that a full blown production model could also achieve the above feat?

Carl B
07-Jun-2008, 14:20
It's a very different situation; here, Cell is taking YUV420 data (rather than MPEG-2 in the Anand tests) and compressing and streaming - in real time - to H.264720p at 3Mbps... which is a hell of a lot of compression probably from the original size, and no doubt done at superb quality levels.

The stuff Anandtech is testing is consumer apps on the desktop; this performance here deals with professional encoding for broadcast purposes. It's worlds apart (though I read Anand's tests yesterday and am looking forward to the chip).

ADEX
07-Jun-2008, 14:34
It's 180GFLOPs in SP mode, half that in DP. Not bad, but an 8-core x86 system should outperform in both modes.

In theoretical figures yes, but most processors can't even get close to their theoretical figures, and in DP work they're miles away from it - Cell is unusual in that it can get very close.

Berkley University posted some figures a couple of years back comparing the original Cell to a range of processors on DP loads. It beat the others by 5-30X, this is astonishing considering how weak the DP capabilities are on the original Cell.

The 8 cores systems have to catch the original Cell first, they're nowhere near the new one.

one
08-Jun-2008, 13:12
I'm wondering how this compares to current, or near current x86. In anands test of Nehalem its pulling 18fps in 1080p encoding.

http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3326&p=618fps is the figure from the MPEG2 encoding test, the following "x264 Encoding with AutoMKV" is more relevant though it's just marginally so as the source data is unknown quality.

EDIT: It seems CodecSys is a bit more complex system than plain H.264, specifically engineered for professional broadcast.
http://www.investorvoices.com/bcst/2008-0412

Rod: It does both, it does more. If you look at the way the cell-blade is actually constructed, it has the SPEs that are out there as well, so what it's able to do is offload so much of it to the SPEs and the PowerPC and what it ends up doing is it's so very efficient inside it's internal communications bus that it's able to compress multiple versions of the video and it's able to do this in real-time. To give you an idea of what we're doing right now, two cell processors sit on one cell-blade chasis, and that one cell-blade can do 2 HD 1080p 60fps encodings simultaneously.

...

Rod: What we show, and we'll be showing this at NAB, you can see the MPEG-2 version of a video running at 19Mbit per seconds and then you can see the CodecSys video running at 3Mbits per second and you can see the quality has been preserved, this is at 720P HD. As far as which codecs are being selected from a scene-by-scene basis, sometimes we're selecting codecs 2-3 times a second. We have 1080P running at 5Mbits per second, and that 1080P is 60 frames per second.

Carl: That's incredible. What is the buffering time?

Rod: We're delayed 3-4 frames is all. So what we end up doing is doing scene changes every 3-4 frames is all.

Carl: So what it's doing is picking a codec within that 3-4 frame buffer, to determine what the best codec is.

Rod: That's correct.

Carl: Can the CodecSys cell-blade servers be scaled to encode in non-proprietary codecs such as H.264 for distribution and streaming to a larger install base?

Rod: Let's take a quick step back and just talk about CodecSys in general. The first version of CodecSys that we're doing today is H.264, we are running multiple versions of H.264, the output of our encoder is H.264, we are compatible with H.264 devices in the field.

Carl: That was one of my other questions, is that the majority of the encoding is H.264?

Rod: Right, so think of my encoder as an encoder but rather having a single H.264 codec in it, it has 12 H.264's in it and each one of those H.264s is optimized for specific events like flashes of bright light, dark content, quick panning, all of those different types of variables that when you watch a bandwidth meter as you're doing the encoding, you see that bandwidth meter keeps spiking. What we have done is we've developed codecs that are able to eliminate those spikes.