| Cell Architecture |
The implementation of a first-generation Cell processor [ISSCC05a] includes a Power-Architecture processor and 8 attached synergistic processor elements connected by an internal, high-bandwidth Element Interconnect Bus (EIB). The figure below shows the organization of the Cell elements and the key bandwidths interconnecting them.

The Power Processor Element (PPE) consists of a 64-bit, multi-threaded Power-Architecture processor with two levels of on-chip cache. The cache preserves global coherence across the system. The processor also supports IBM's Vector Multimedia eXtensions (VMX) to accelerate multimedia applications using its VMX Single Instruction Multiple Data (SIMD) units.
A major source of compute power is provided by the eight on-chip Synergistic Processor Elements (SPEs) [ISSCC05b]. An SPE consists of a new processor designed to accelerate media and streaming workloads, its local, non-coherent memory, and its globally coherent DMA engine. The units of an SPE and key bandwidths are also shown in figure above.
Most instructions are designed to operate in a simd fashion on 128 bits of data representing either 2 64-bit double floats or long integers, 4 32-bit single float or integers, 8 16-bit shorts, or 16 8-bit chars. The 128-bit operands are stored in a 128-entry unified register file. Instructions may source up to three operands and produce one result. The register file has a total of 6 read and 2 write ports.
The memory instructions also access 128 bits of data, with the additional constraint that the accessed data must reside at addresses that are multiples of 16 bytes. Thus, when addressing memory with a vector load or store instructions, the lower 4 bits of the byte addresses are simply ignored. To facilitate the loading/storing of individual values, such as a char or an integer, there is additional support to extract/merge an individual value from/into a 128-bit register.
An SPE can dispatch up to two instructions per cycle to seven execution units, which are organized into an even and odd instruction pipes. Instructions are issued in order and routed to their corresponding even/odd pipe. Independent instructions are detected by the issue logic and are dual-issued provided they satisfy the following condition: the first instruction must come from an even word address and use the even pipe, and the second instruction must come from an odd word address and use the odd pipe. When this condition is not satisfied, the two instructions are executed sequentially.
The SPE's 256K-bytes local memory supports fully-pipelined 16-byte accesses (for memory instructions) and 128-byte accesses (for instruction fetch and DMA transfers). Because the memory is single ported, instruction fetches, DMA, and memory instructions compete for the same port. Instruction fetches occur during idle memory cycles; and up to 3.5 fetches may be buffered in the instruction fetch buffer to better tolerate bursty peak memory usages.
The branch architecture allows for a branch hint instruction to override the default branch prediction policy. In addition, the branch hint instruction prefetches up to 32 instructions starting from the branch target, so that a correctly-hinted taken branch incurs no penalty. One of the instruction fetch buffers is reserved for the branch hint mechanism. In addition, there is extended support for eliminating short branches using bit-wise select instructions.
Data are transferred between the local memory and the DMA engine in chunks of 128 bytes. The DMA engine can support up to 16 concurrent requests of up to 16K bytes originating either locally or remotely. The DMA engine is part of the globally coherent memory address space; addresses of local DMA requests are translated by an MMU before being sent on the bus. Programs interface with the DMA unit via a channel interface and may initiate blocking as well as non-blocking requests.
