The Cell architecture

Innovation Matters


The Cell architecture grew from a challenge posed by Sony and Toshiba to provide power-efficient and cost-effective high-performance processing for a wide range of applications, including the most demanding consumer appliance: game consoles. Cell - also known as the Cell Broadband Engine Architecture (CBEA) - is an innovative solution whose design was based on the analysis of a broad range of workloads in areas such as cryptography, graphics transform and lighting, physics, fast-Fourier transforms (FFT), matrix operations and scientific workloads. As an example of innovation that ensures the clients' success, a team from IBM Research joined colleagues from IBM Systems Technology Group, Sony and Toshiba, to lead the development of a novel architecture that represents a breakthrough in performance for consumer applications. IBM Research participated throughout the development, implementation and software enablement of the architecture, ensuring the timely and efficient application of novel ideas and technology into a product that solves real challenges.


Cell is a heterogeneous chip multiprocessor that consists of an IBM 64-bit Power Architecture™ core, augmented with eight specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture called Synergistic Processor Unit (SPU), which is for data-intensive processing, like that found in cryptography, media and scientific applications. The system is integrated by a coherent on-chip bus.

Based on the analysis of available die area, cost and power budgets, and achievable performance, the best approach to achieving the performance target was to exploit parallelism through a high number of nodes on a chip multiprocessor. To further reduce power, the team opted for a heterogeneous configuration with a novel SIMD-centered architecture. This configuration combines the flexibility of an IBM 64-bit Power Architecture™ core with the functionality and performance-optimized SPU SIMD cores.


Cell Block Diagram

In this organization, the SPU accelerators operate from a local storage that contains instruction and data for a single SPU. This local storage is the only memory directly addressable by the SPU.
The SPU architecture was built to:
  • provide a large register file,
  • simplify code generation,
  • reduce the size and power consumption by unifying resources, and
  • simplify decode and dispatch.

These goals were achieved by building a novel SIMD-based architecture with 32-bit wide instructions encoding a three-operand instruction format. Designing a new instruction set architecture (ISA) allowed IBM Research to streamline the instruction side, and provide seven-bit register operand specifiers to directly address 128 registers from all instructions, using a single pervasive SIMD computation approach for both scalar and vector data. In this approach, a unified 128 entry 128-bit SIMD register file provides scalar, condition and address operands, such as for conditional operations, branches and memory accesses.

While the SPU ISA is a novel architecture, the operations selected for the SPU are closely aligned with the functionality of the Power™ VMX unit. This facilitates and simplifies code portability between the IBM 64-bit Power Architecture™ processor and the SPU SIMD-based processors. However, the range of data types supported in the SPU has been reduced for most computation formats. While VMX supports a number of densely packed saturating integer data types, these data types lead to a loss of dynamic range which typically degrades computation results. The preferred computation approach is to widen integer data types for intermediate operations and perform them without saturation. Unpack and saturating pack operations allow memory bandwidth and memory footprint to be reduced while maintaining high data integrity.

Floating point data types automatically support a wide dynamic data range and saturation, so no additional data conditioning is required. To reduce area and power requirements, floating point arithmetic is restricted to the most common and useful modes. As a result, denormalized numbers are automatically flushed to 0 when presented as input, and when a denormalized result is generated. Also, a single rounding mode is supported.

Single precision floating point computation is geared for throughput of media and three-dimensional graphics objects. In this vein, the decision to support only a subset of IEEE floating point arithmetic and sacrifice full IEEE compliance was driven by the target applications. Thus, multiple rounding modes and IEEE-compliant exceptions are typically unimportant for these workloads, and are not supported. This design decision is based the real time nature of game workloads and other media applications: most often, saturation is mathematically the right solution. Also, occasional small display glitches caused by saturation in a display frame is tolerable. On the other hand, incomplete rendering of a display frame, missing objects or tearing video due to long exception handling is objectionable.

Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the IBM Power™ processor or an SPU. The DMA-based interface uses the Power Architecture™ page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure. A high-performance on-chip bus connects the SPU and Power Architecture™ computing elements.

The SPU is an in-order dual-issue statically scheduled architecture. Two SIMD instructions can be issued per cycle: one compute instruction and one memory operation. The SPU branch architecture does not include dynamic branch prediction, but instead relies on compiler-generated branch prediction using "prepare-to-branch" instructions to redirect instruction prefetch to branch targets.

The SPU was designed with a compiled code focus from the beginning, and early availability of SIMD-optimized compilers allowed development of high-performance graphics and media libraries for the broadband architecture entirely in the C programming language.


Based on these decisions to share compute semantics, data types, and virtual memory model, the SPUs synergistically exploit and amplify the advantages when combined with the IBM Power Architecture™ to form the Cell Broadband Engine Architecture.

IBM Research grew its partnership in the development of the broadband processor architecture beyond its initial definition. During the course of this partnership with the STI Design Center, members of the original Cell team developed the first SPU compiler, which was a guiding force for the definition of the SPU architecture and the SPU programming environment, and sample code to exploit the strengths of the broadband processor architecture. The extended partnership led to further contributions by IBM Research, including the development of an advanced parallelizing compiler with auto-SIMDization features based on IBM XL compiler technology, the design of the high-frequency Power Architecture™ core at the center of the Cell architecture, and a full-system simulation infrastructure.

Cell is not limited to game systems. IBM has announced a Cell-based blade, which leverages the investment in the high-performance Cell architecture. Other future uses may include HDTV sets, home servers, game servers and supercomputers. Also, Cell is not limited to a single chip, but is a scalable system. The number of attached SPUs can be varied, to achieve different power/performance and price/performance points. And, the Cell architecture was conceived as a modular, extendible system where multiple Cell subsystems each with a Power Architecture™ core and attached SPUs, can form a symmetric multiprocessor system.


Cell Die

Some Cell statistics:
  • Observed clock speed: > 4 GHz
  • Peak performance (single precision): > 256 GFlops
  • Peak performance (double precision): >26 GFlops
  • Local storage size per SPU: 256KB
  • Area: 221 mm˛
  • Technology: 90nm SOI
  • Total number of transistors: 234M


Related Publications

An Open Source Environment for Cell Broadband Engine System Software (IEEE Computer, June 2007, pdf) (M. Gschwind, D. Erb, S. Manning, M. Nutter)

The Cell Broadband Engine: Exploiting multiple levels of parallelism in a chip multiprocessor (International Journal of Parallel Programming, June 2007, pdf) (M. Gschwind)

Cell Broadband Engine: Enabling density computing for data-rich environmnents (ISCA 2006 tutorial, Boston, MA, pdf)

Synergistic processing in Cell's multicore architecture (IEEE Micro, March 2006, pdf) (M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, T. Yamazaki)

Chip multiprocessing and the Cell Broadband Engine (IBM, February 28, 2006, pdf) (M. Gschwind)

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine™ architecture (IBM Systems Journal, Volume 45, Number 1, 2006. HTML and pdf formats)

A novel SIMD architecture for the Cell heterogeneous chip multiprocessor (IBM, 2005, pdf)

Optimizing compiler for the Cell processor (IEEE, September 2005, pdf)

Power efficient architecture and the Cell processor (HPCA-11, February 2005.) (P. Hofstee)


Selected Patents

Method and apparatus for creating and executing integrated executables in a heterogeneous architecture
M. Gschwind, K. O'Brien, K. O'Brien, V. Salapura
07/10/2007. Issued as US patent 7243333

Method and apparatus for overlay management within an integrated executable for a heterogeneous architecture
M. Gschwind, K. O'Brien, K. O'Brien, V. Salapura
05/22/2007. Issued as US Patent 7222332

Method and apparatus for mapping debugging information when debugging integrated executables in a heterogeneous architecture
M. Gschwind, K. O'Brien, K. O'Brien, V. Salapura
05/01/2007. Issued as US patent 7213123

Method and apparatus for enabling access to global data by a plurality of codes in an integrated executable for a heterogeneous
M. Gschwind, K. O'Brien, K. O'Brien, V. Salapura
04/03/2007. Issued as US patent 7200840

Method and system for maintaining coherency in a multiprocessor system by broadcasting TLB invalidated entry instructions
E. Altman, P. Capek, M. Gschwind, P. Hofstee, J. Kahle, R. Nair, S. Sathaye, J. Wellman
11/29/2005. Issued as US patent 6970982

Method and apparatus for software-assisted thermal management for electronic systems
M. Gschwind, V. Salapura
09/20/2005. Issued as US patent 6948082

Symmetric multi-processing system utilizing a DMAC to allow address translation for attached processors
E. Altman, P. Capek, M. Gschwind, P. Hofstee, J. Kahle, R. Nair, S. Sathaye, J. Wellman
06/14/2005. Issued as US patent 6907477

Method and apparatus for aligning memory write data in a microprocessor
M. Gschwind, M. Hopkins, P. Hofstee
05/23/2005 Issued as US patent 7051168

Reduction of interrupts in remote procedure calls
P. Hofstee, R. Nair
03/08/2005. Issued as US Patent 6865631

SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode
M. Gschwind, P. Hofstee, M. Hopkins
01/04/2005. Issued as US patent 6839828

Token-based DMA
P. Hofstee, R. Nair, J. Wellman
11/16/2004. Issued as US patent 6820142


Symmetric multi-processing system with attached processing units being able to access a shared memory without being structurally configured with an address translation mechanism
E. Altman, P. Capek, M. Gschwind, P. Hofstee, J. Kahle, R. Nair, S. Sathaye, J. Wellman, M. Suzuoki, T. Yamazaki
8/17/2004. Issued as US patent 6779049


Pipeline control for high-frequency pipelined designs
M.K. Gschwind
02/20/2000. Issued as US patent 6192466


News and Information

Cell BE challenge '07: Beyond gaming! (IBM, 2007)

Cell's nine processors make it a supercomputer on a chip
Emerging technology winner. (IEEE Spectrum, January 2006)

Cell moves into the limelight (In-Stat, February 14, 2005)

Rate this article

Innovator's corner  

Michael GschwindMichael Gschwind Researcher

What is the most exciting potential future use for the work you're doing?
I've been involved with the architecture and evaluation of innovative microprocessors since I joined IBM in 1997. With our DAISY and BOA work, we redefined the boundaries of architecture and software to allow software to optimize programs that have been shipped to adapt to the specific customer workloads. This was done by observing the program doing its work, and then re-optimizing the software in the client's system based on the specific customer's program usage. We also helped to evaluate zSeries microarchitecture performance in light of emerging workloads in preparation of the release of Linux for mainframes. But the absolutely most exciting work was the Cell project, because it not only redefined what a microprocessor could do, but what an IBM Research project could do. By working closely with our partners in the product groups and IBM's partners SONY and Toshiba, we defined a new system architecture around our standard Power architecture which is miles ahead of where competitors are even today. Where other systems have one or two processors on a microprocessor chip, we put nine. Where other processors work on one data item most of the time, our new accelerators always process at least four pieces of data. And the results were immediate -- our partners were thrilled with this project, and we proceeded to jointly develop this into a product in the STI Design Center.

Cell is a broadband architecture, targeted at systems which provide connectivity for future entertainment use: networked games where people can play games over the network with others they have never met, Videoconferencing to allow people to connect with their families via broadband connections using advanced video compression and decompression, and a host of traditional entertainment taking new forms because of the processing power of this architecture. The potential to bring entertainment to millions of families and enrich their lives is just a unique experience for a computer architect who is more used to business applications of computer systems.

What is the most interesting part of your research?
The most exciting part about our work is to see how pieces of technology come together when a team clicks. This was the case in Cell. Everybody brought ideas, technologies, insights and experiences together, and they melded into a single system that was more than the sum of its parts. Cell was a holistic system effort which included new processor architectures and novel packaging and software approaches. And all came together seamlessly. The energy of this team was just amazing. We covered more ground in a few months than our competitors have done in years. They are still talking about building high-throughput chip-multiprocessors in the future, while ours are running in the lab and we are ready to ship them. And all of this for a consumer device like a game console which will ship in volumes outstripping most other computer-based systems. The thought of how a high performance system in every household could revolutionize what is entertainment and how we consume it, is just fascinating. Seeing the result of our work come to fruition in a product so quickly after we conceived it is very gratifying.

What inspired you to go into this field?
I have always had a great interest in how systems work. Whether computer systems, human societies, systems devised by humans, or biological systems, there is always such a fine line between a system that works and a system that fails. And systems always consist of different aspects, and they are all very important, but no part alone, no matter how good, can make the whole work by itself. This attracted me to computer architecture, which is really the field that encompasses how all the different components, such as circuit design, packaging, compiler technology and software, come together.

What is your favorite invention of all time?
I think no other invention has ever done as much to save mankind from illness and misery as vaccinations.

Research team  

Erik Altman

Erik Altman

Peter Capek

Peter Capek

Michael Gschwind

Michael Gschwind

Peter Hofstee

Peter Hofstee

Ravi Nair

Ravi Nair

Sumedh Sathaye

Sumedh Sathaye

JD Wellman

JD Wellman

Related Research