Patents
Publications
More information
Michael Gschwind
The CELL project at IBM ResearchÂ
 Project descriptionÂ
The CELL Architecture
The CELL Architecture grew from a challenge posed by Sony and Toshiba to provide power-efficient and cost-effective high-performance processing for a wide range of applications, including the most demanding consumer appliance: game consoles. CELL - also known as the Broadband Processor Architecture (BPA) - is an innovative solution whose design was based on the analysis of a broad range of workloads in areas such as cryptography, graphics transform and lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientific workloads. As an example of innovation that ensures the clients' success, a team from IBM Research joined people from IBM Systems Technology Group, Sony and Toshiba, to lead the development of a novel architecture that represents a breakthrough in performance for consumer applications. IBM Research participated throughout the entire development of the architecture, its implementation and its software enablement, ensuring the timely and efficient application of novel ideas and technology into a product that solves real challenges.
CELL is a heterogeneous chip multiprocessor consisting of a 64-bit Power core, augmented with 8 specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture called SPU (Synergistic Processor Unit), for data intensive processing as is found in cryptography, media and scientific applications. The system is integrated by a coherent on-chip bus.
Based on the analysis of available die area, cost and power budgets, and achievable performance, the best approach to achieving the performance target was the exploitation of parallelism through a high number of nodes on a chip multiprocessor. To further reduce power, the team opted for a heterogeneous configuration with a novel SIMD-centered architecture. This configuration combines the flexibility of an IBM Power core with the functionality and performance-optimized SPU SIMD cores.
Cell Block Diagram
In this organization, the SPU accelerators operate from a local storage which contains instruction and data for a single SPU. This local storage is the only memory directly addressable by the SPU.
The SPU architecture was built with the goals to
• provide a large register file,
• simplify code generation,
• reduce the size and power consumption by unifying resources, and
• simplify decode and dispatch.
These goals were achieved by architecting a novel SIMD-based architecture with 32 bit wide instructions encoding a 3-operand instruction format. Designing a new instruction set architecture (ISA) allowed us to streamline the instruction side, and provide 7-bit register operand specifiers to directly address 128 registers from all instructions using a single pervasive SIMD computation approach for both scalar and vector data. In this approach, a unified 128 entry 128bit SIMD register file provides scalar, condition and address operands, such as for conditional operations, branches, and memory accesses.
While the SPU ISA is a novel architecture, the operations selected for the SPU are closely aligned with the functionality of the Power VMX unit. This facilitates and simplifies code portability between the Power main processor and the SPU SIMD-based co-processors. However, the range of data types supported in the SPU has been reduced for most computation formats. While VMX supports a number of densely packed saturating integer data types, these data types lead to a loss of dynamic range which typically degrades computation results. The preferred computation approach is to widen integer data types for intermediate operations and perform them without saturation. Unpack and saturating pack operations allow memory bandwidth and memory footprint to be reduced while maintaining high data integrity.
Floating point data types automatically support a wide dynamic data range and saturation, so no additional data conditioning is required. To reduce area and power requirements, floating point arithmetic is restricted to the most common and useful modes. As a result, denormalized numbers are automatically flushed to 0 when presented as input, and when a denormalized result is generated. Also, a single rounding mode is supported.
Single precision floating point computation is geared for throughput of media and 3D graphics objects. In this vein, the decision to support only a subset of IEEE floating point arithmetic and sacrifice full IEEE compliance was driven by the target applications. Thus, multiple rounding modes and IEEE-compliant exceptions are typically unimportant for these workloads, and are not supported. This design decision is based the real time nature of game workloads and other media applications: most often, saturation is mathematically the right solution. Also, occasional small display glitches caused by saturation in a display frame is tolerable. On the other hand, incomplete rendering of a display frame, missing objects or tearing video due to long exception handling is objectionable.
Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the Power processor or an SPU. The DMA-based interface uses the Power page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure. A high-performance on-chip bus connects the SPU and Power computing elements.
The SPU is an in-order dual-issue statically scheduled architecture. Two SIMD instructions can be issued per cycle: one compute instruction and one memory operation. The SPU branch architecture does not include dynamic branch prediction, but instead relies on compiler-generated branch prediction using "prepare-to-branch" instructions to redirect instruction prefetch to branch targets.
The SPU was designed with a compiled code focus from the beginning, and early availability of SIMD-optimized compilers allowed development of high-performance graphics and media libraries for the Broadband Architecture entirely in the C programming language.
Based on these decisions to share compute semantics, data types, and virtual memory model, the SPUs synergistically exploit and amplify the advantages when combined with the Power architecture to form the Broadband Processor Architecture.
The IBM Research division grew its partnership in the development of the Broadband Processor Architecture beyond the initial definition of the architecture. During the course of this partnership with the STI Design Center, members of the original CELL team developed the first SPU compiler which was a guiding force for the definition of the SPU architecture and the SPU programming environment, and sample code to exploit the strengths of the Broadband Processor Architecture. The extended partnership led to further contributions by IBM Research, including the development of an advanced parallelizing compiler with auto-SIMDization features based on IBM XL compiler technology, the design of the high-frequency Power core at the center of the CELL architecture, and a full-system simulation infrastructure.
CELL is not limited to game systems. IBM has announced a CELL-based "blade" leveraging the investment into the high-performance CELL architecture. Other future uses include HDTV sets, home servers, game servers, and supercomputers. Also, CELL is not limited to a single chip, but is a scalable system. The number of attached SPUs can be varied, to achieve different power/performance and price/performance points. Also, the CELL architecture was conceived as a modular, extendible system where multiple CELL subsystems each with a Power core and attached SPUs, can form a symmetric multiprocessor system.
Cell Prototype Die
Some CELL statistics:
• Observed clock speed: > 4 GHz
• Peak performance (single precision): > 256 GFlops
• Peak performance (double precision): >26 GFlops
• Local storage size per SPU: 256KB
• Area: 221 mm?
• Technology 90nm SOI
• Total number of transistors: 234M
CELL received the 2004 Microprocessor Report Analysts' Choice Award for Best Technology.
 PatentsÂ
SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode
M. Gschwind, P. Hofstee, M. Hopkins
1/4/2005 Issued as US patent 6839828
Token-based DMA
P. Hofstee, R. Nair, J. Wellman
11/16/2004 Issued as US patent 6820142
Symmetric multi-processing system with attached processing units being able to access a shared memory without being structurally configured with an address translation mechanism
E. Altman, P. Capek, M. Gschwind, P. Hofstee, J. Kahle, R. Nair, S. Sathaye, J. Wellman, M. Suzuoki, T. Yamazaki
8/17/2004 Issued as US patent 6779049
Pipeline control for high-frequency pipelined designs
M.K. Gschwind
02/20/2000 Issued as US patent 6192466
 PublicationsÂ
A 1.0-GHz single-issue 64-bit powerPC integer processor
Journal of Solid State Circuits, Vol. 33, No. 11, Nov 1998. (J. Silberman et al.)
Exploring realtime multimedia content creation in video games
6th Workshop on Media and Streaming Processors in conjunction with MICRO 36, December 2004. (B. Matthews, J.D. Wellman, M. Gschwind)
The Design and Implementation of a First-Generation CELL Processor
ISSCC 2005, February 2005. (D. Pham et al.)
The Microarchitecture of the Streaming Processor for a CELL Processor
ISSCC 2005, February 2005. (B. Flachs et al.)
Power Efficient Architecture and the Cell Processor
HPCA-11, February 2005. (P. Hofstee)