Cell (processor)

The Cell microprocessor has been jointly developed by Sony, Toshiba, and IBM. The Cell architecture is intended to be scalable through the use of vector processing. The first major commercial application of Cell is in Sony's upcoming PlayStation 3 game console. Cell is a shorthand for Cell Broadband Engine Architecture. The project's budget is estimated at $400 million. Cell was developed to be a general purpose processor and also a multimedia processor.

History

In 2000, Sony Computer Entertainment, Toshiba Corp., and IBM formed an alliance ("STI") to design and build the processor. The STI Design Center in Austin, Texas opened in March 2001. ^[1] The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM's design centers. ^[2]

On May 17, 2005, Sony Computer Entertainment confirmed some specifications of the Cell processor that would be shipping in the forthcoming PlayStation 3 console. This Cell will have one POWER processing element (PPE) on the core, with seven SPEs and one SPE reserved for redundancy (to help increase manufacturing yield). All of these are clocked at 3.2 GHz. The chips will be fabricated using a 90 nanometre SOI process, at IBM's facility in East Fishkill, New York.

On June 28 2005, IBM and Mercury Computer Systems announced a partnership agreement to build Cell-based computer systems for embedded applications such as medical imaging, industrial inspection, aerospace and defense, seismic processing, and telecommunications.

Overview

The 'Cell Broadband Engine'(TM) or as it is more commonly known the Cell processor is a microprocessor designed to bridge the gap between conventional desktop processors (Pentium, PowerPC etc) and more specialised high performance processors (eg Nvidia and ATi graphics chips). The name belies its intended use - namely as a component in current and future digital distribution systems - as such it may be utilised in high definition displays and recording equipment, as well as computer entertainment systems for the 'Hi Def' era. Additionally the processor should be well suited to digital imaging systems (Medical, Scientific etc) as well as physical simulation (eg Scientific and Structural Engineering modelling).

In a simple analysis the Cell processor can be split into four components - external input and ouput structures, the main processor called the Power Processing Element PPE (a two-way SMT multithreaded Power 970 architecture compliant core), eight fully functional co-processors called the Synergystic Processing Elements or SPEs and a specialised high bandwidth circular data bus connecting the PPE, input/output elements and the SPEs - called the Element Interconnect Bus or EIB.

To achieve the high performance needed for mathematically intensive tasks such as decoding/encoding MPEG streams, generating or transforming three dimensional data or undertaking Fourier analysis of data the Cell processor simply marries the SPEs and the PPE via the EIB to give both access to main memory or other external data storage. The PPE which is capable of running a conventional operating system has control over the SPEs and can start, stop, interrupt and schedule processes running on the SPEs. To this end the PPE has additional instructions relating to control of the SPEs. Despite having Turing complete architectures the SPEs are not fully autonomous and require the PPE to initiate them before they can do any useful work. Most of the 'horsepower' of the system comes from the synergistic processing elements.

The PPE and bus architecture includes various modes of operation giving different levels of protection, allowing areas of memory to be protected from access by specific processes running on the SPEs or PPE.

Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format. The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit VMX register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 128-bits in size or for SIMD computations on a variety of integer and floating point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values for a theoretic address range of 2^64 bytes. In practice, not all of these bits are implemented in hardware; the address space is extremely large nevertheless. Local store addresses internal to the SPU processor are expressed as a 32-bit word. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.

Architecture

While the Cell chip can have a number of different configurations, the basic configuration is composed of one "Power Processor Element" ("PPE") (sometimes called "Processing Element", or "PE"), and multiple "Synergistic Processing Elements" ("SPE") ^[3]. The PPE and SPEs are linked together by an internal high speed bus dubbed "Element Interconnect Bus" ("EIB"). Due to the nature of its applications, Cell is optimized towards single precision floating point computation. The SPEs are capable of performing double precision calculations, albeit with an order of magnitude performance penalty. More general purpose computing tasks can be done on the PPE.

Power Processor Element

The PPE is based on the POWER Architecture, which is the basis of IBM's line of POWER and PowerPC offerings. The PPE is not intended to perform all primary processing for the system, but rather to act as a controller for the other eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64-bit PowerPC processors, and because the SPEs are designed for vectorized floating point code execution. The PPE contains a 16 KB instruction and data Level 1 cache and a 512 KB Level 2 cache. Additionally, IBM has included a VMX (AltiVec) unit in the Cell PPE. ^[4]

Synergistic Processing Elements (SPE)

Each SPE is composed of a "Streaming Processing Unit" ("SPU"), and an SMF unit (DMA, MMU, and bus interface). ^[5] An SPE is a RISC processor with 128-bit SIMD organization ^[6] for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256 KiB instruction and data local memory area (called "local store") which is visible to the PPE and can be addressed directly by software. Each of these SPE can support up to 4 GB of local store memory. The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict what data to load. The SPEs contain a 128 × 128 register file and measure 14.5 mm² on a 90 nm process. An SPE can operate on 16 8-bit integers, 8 16-bit integers, 4 32-bit integers, or 4 single precision floating-point numbers in a single clock cycle. Note that the SPU processor can not directly access system memory; the 64-bit memory addresses formed by the SPU must be passed from the SPU processor to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.

In one typical usage scenario, the system will load the SPEs with small programs (similar to threads), chaining the SPEs together to handle each step in a complex operation. For instance, a set-top box might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. The PPE's VMX (AltiVec) unit is fully pipelined for double precision floating point and can complete two double precision operations per clock cycle, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle, which translates to 25.6 GFLOPS at 3.2 GHz^[7].

Compared to a modern personal computer, the relatively high overall floating point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in desktop CPUs like the Pentium 4 and the Athlon 64. But, comparing only floating point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. Also, Cell is optimized for single-precision calculations; for double-precision, as used in personal computers, Cell performance drops by an order of magnitude to levels similar to desktops.

Recent tests by IBM [8] show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication.

Differences between VMX and SPU

The VMX technology is conceptually similar to the vector model provided by the SPU processors, but there are many significant differences.

**VMX to SPU Comparison**^[9]
*unfinished*
feature	VMX	SPU
word size	32 bits	32 bits
number of registers	32	128
register width	128 bit quadword	128 bit quadword
integer formats	8, 16, 32	8, 16, 32
saturation support	yes	no
byte ordering	big (default), little	big endian
floating point modes	Java, non-Java	single precision, IEEE double
memory alignment	quadword only	quadword only

The VMX Java mode conforms to the Java Language Specification 1 subset of the default IEEE standard, extended to include IEEE and C9X compliance where the Java standard falls silent. Non-Java mode might or might not be faster, might or might not be non-compliant.

Quadword (ie Four times a 32 bit word or 128 bits) alignment is on 16 Byte (128 bit) boundarys (ie the low four address bits are zero).

The IBM PPE Vector/SIMD manual does not define operations for double precision floating point, though IBM has published material implying certain double precision performance numbers associated with the Cell PPE VMX technology.

Porting VMX code for SPU

There is a great body of code which has been developed for other IBM Power processors that could potentially be adapted and recompiled to run on the SPU. This code base includes VMX code that runs under the PowerPC version of Apple's OS X, where it is better known as Altivec. Depending on how many VMX specific features are involved, the adaptation involved can range anywhere from straightforward, to onerous, to completely impractical. The most important workloads for the SPU generally map quite well.

In some cases it is possible to port existing VMX code directly. If the VMX code is highly generic (makes few assumptions about the execution environment) the translation can be relatively straightforward. The two processors specify a different binary code format, so recompilation is required at a minimum. Even where instructions exist with the same behaviours, they do not have the same instruction names, so this must be mapped as well. IBM provides compiler intrinsics which take care of this mapping transparently as part of the development toolkit.

In many cases, however, a directly equivalent instruction does not exist. The workaround might be obvious or it might not. For example, if saturation behaviour is required on the SPU, it can be coded by adding additional SPU instructions to accomplish this (with some loss of efficiency). At the other extreme, if Java floating point semantics are required, this is almost impossible to achieve on the SPU processor. To achieve the same computation on the SPU might require an entirely different algorithm which needs to be written from scratch.

The most important conceptual similarity between VMX and the SPU architecture is supporting the same vectorization model. For this reason, mosts algorithms successfully adapted to Altivec will usually adapt successfully to the SPU architecture as well.

Element Interconnect Bus (EIB)

The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as 'units'.

The EIB is presently implemented as a circular ring comprised of four 16B-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature it is unrealistic to simply scale this number by processor clock speed. The arbitration unit imposes additional constraints which are discussed in the Bandwidth Assessment section below.

IBM Senior Engineer David Krolak, EIB lead designer, explains the concurrency model:

A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it doesn't work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track.

Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.

Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency.

Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.

David Krolak explains:

Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is architected, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just wasn't enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.

Bandwidth Assessment

For the sake of quoting performance numbers, we will assume a Cell processor running at 3.2 GHz, the clock speed most often cited.

At this clock frequency each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300 GB/s". This number reflects the peak instantaneous EIB bandwidth blithely scaled by processor frequency.

However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explains:

Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8GB/s.

This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.

In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.

To add further to the confusion, some older publications cite EIB bandwidth assuming a 4 GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384 GB/s and an arbitration-limited bandwidth figure of 256 GB/s.

All things considered the theoretic 204.8 GB/s number most often cited is the best one to bear in mind. The IBM Systems Performance group has demonstrated SPU-centric data flows achieving 197 GB/s on a Cell processor running at 3.2 GHz so this number is a fair reflection on practice as well.

Memory controller and I/O

Cell contains a dual channel next-generation Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32 bit channels can provide a theoretical maximum of 25.6 GB/s.

The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 "lanes," each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point path are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.

Broadband Engine

Much less information is available about the 'broadband engine', most coming from patent applications. It is believed that Cell allows for multiple processing cores to be put onto one die, and the patent shows four cores on one die. Sony, Toshiba, and IBM have claimed that they intend to scale the processor for various uses, both low-end and high-end, by varying the number of cores on the chip, the number of units in a single core, and by linking multiple chips to each other via network or memory bus.

Architecture compared

In some ways the Cell system resembles early Seymour Cray designs in reverse. The famed CDC 6600 used a single very fast processor to handle the mathematical calculations, while a series of ten slower systems were given smaller programs to keep the main memory fed with data. In the Cell the problem has been reversed: reading the data is no longer the difficult problem due to the complex encodings used in industry; today the problem is efficiently decoding that data into an ever-less-compressed version as quickly as possible.

Modern graphics cards have multiple elements very similar to the SPE's, known as vertex shader units, with an attached high speed memory. Programs, known as shaders, are loaded onto the units to process the basic geometry fed from the computer's CPU, apply styles and display it.

The main differences are that the Cell's SPEs are much more general purpose than shader units, and the ability to chain the SPEs under program control offers considerably more flexibility, allowing the Cell to handle graphics, sound, or anything else.

Possible applications

Blade server

IBM has already presented a blade server prototype based on two Cell processors, running the 2.6.11 Linux kernel. ^[10] The processors ran at 2.4–2.8 GHz. IBM expects soon to run them at 3.0 GHz, providing 200 GFLOPS single-precision floating point performance per CPU (or 400 GFLOPS per board). IBM also expects to arrange seven blades in a single rackmount chassis (similar to their BladeCenter product line) for a total performance of 2.8 TFLOPS (or 284 GFLOPS in double precision) per chassis. However, the performance numbers released by IBM are still theoretical, and the real-world performance may fall significantly short of theoretical expectations.

IBM's H-series Blade servers will incorporate the cell processor as of March 2006.

Mercury Computer Systems, Inc. has released preproduction blades with cell microprocessors that are currently shipping.

Console videogames

Sony's PlayStation 3 video game console will contain the first production application of the Cell processor, clocked at 3.2 GHz and containing seven usable SPEs. An eighth will be manufactured, but one will be disabled at the factory in order to allow Sony to increase the yield on the processor manufacture.

Home cinema

Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 MPEG-2 streams simultaneously on a 1920×1080 screen. ^[11]^[12] This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.

Software engineering

Due to the flexible nature of the Cell, there are several possibilities for the utilization of its resources: ^[13]

Job queue

The PPE maintains a job queue, schedules jobs in SPEs, and monitors progress. Each SPE runs a "mini kernel" whose role is to fetch a job, execute it, and synchronize with the PPE.

Self-multitasking of SPEs

The kernel and scheduling is distributed across the SPEs. Tasks are synchronized using mutexes or semaphores as in a conventional operating system. Ready-to-run tasks wait in a queue for a SPE to execute them. The SPEs use shared memory for all tasks in this configuration.

Stream processing

Each SPE runs a distinct program. Data comes from an input stream, and is sent to SPEs. When an SPE has terminated the processing, the output data is sent to output stream.

This actually provides a very flexible, yet powerful architecture for stream processing, allowing to explicitly schedule each SPE separately. Other processors are also able to perform this kind of processing but this comes often with limitations on the possible kernels to be loaded.

Open source software development

As of 2005-06-23, patches enabling Cell support in the Linux kernel were submitted for inclusion by IBM developers [14]. Arnd Bergmann (one of the developers of the aforementioned patches) also described the Linux-based Cell architecture at LinuxTag 2005. ^[15]

Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries. According to Sony, a compiler, debugger, IDE, performance analyzer, and Cell emulator should be made available soon. ^[16] IBM has developed a pseudo-filesystem for Linux coined "Spufs" that simplifies access to and use of the SPE resources.

IBM is currently maintaining the Linux kernel and GDB ports, while Sony maintains the GNU toolchain (GCC, binutils). ^[17].

In November 2005, IBM released a "Cell Broadband Engine (CBE) Software Development Kit Version 1.0, consisting of a simulator and assorted tools, to its web site. Development versions of the latest kernel and tools for Fedora core 4 are maintained at the Barcelona Supercomputing Center website[18].

With the release of kernel version 2.6.16 on 20 March 2006, the Linux kernel officially supports the Cell processor.

Patents

**U.S. patents 2004**
Corporation	Filings
Internation Business Machines	3,248
Matsushita Electric Industrial	1,934
Canon Kabushika Kaisha	1,805
Hewlett-Packard Development	1,775

It is well known that IBM holds one of the world's largest patent portfolios. The United States Patent and Trademark Office proclaimed in 2004, "For the twelfth consecutive year, IBM received more patents than any other private sector organization."

A selection of the patents filed by the Cell design team most pertinent to the unique conception and features of Cell are detailed below. This material offers an interesting perspective on how IBM viewed the novelties involved in the Cell processor during the initial design period.

U.S. 6,779,049 — distributed DMA translation

United States patent 6779049 was granted on August 17, 2004 to Erik R. Altman and ten others, including Dr Peter Hofstee, Cell Chief Scientist and Cell Synergistic Processor Chief Architect, bearing the rather ponderous title Symmetric multi-processing system with attached processing Units being able to access a shared memory without being structurally configured with an address translation mechanism.

The abstract cites one embodiment of the invention as consisting of an SMP system with shared memory with '"a plurality of processing elements coupled to the shared memory". In the Cell design inspired by this patent, the processing elements were fleshed out as the PPE core and the eight SPE cores.

The abstract then further defines the nature of a processing element: "Each of the plurality of processing elements comprises a processing unit, a direct memory access controller and a plurality of attached processing units." All nine Cell cores have a DMA controller as this patent describes. For the PPE core the attached processing units is the Power architecture PPU execution engine; for the SPE cores, the processing unit is the SPU execution engine.

Finally, it further defines each DMA controller as comprising "an address translation mechanism thereby enabling each associated attached processing unit to access shared memory in a restricted manner without an address translation mechanism". The vague phrase restricted manner is key. For the Cell this statement underscores that the predominant memory access mechanism is DMA requests set up by the DMA controller contained within each core. For the SPE cores, the DMA mechanism is in fact the only mechanism for making direct access to system memory. Whether the PPE funnels all varieties of system memory request exclusively through its internal DMA controller is not stipulated by IBM in their Cell overview materials.

This invention in part stemmed from an observation by IBM that GPU processors typically achieve higher performance per watt because they are not burdened with address translation overheads on every memory access. GPU working memory was adapted to Cell as the SPE "local store". No address translation is performed when the SPU performs memory operations on local store. Memory translation is instead performed by the SPE's internal DMA controller on the granularity of DMA requests.

It is important to note that this significantly reduces the frequency of memory address translations performed. Instead of one translation per load or store instruction, address translation is performed once per DMA request. A single DMA request can be up to 16KB in length. However, Cell is tuned for DMA requests of 128B in length so this request length is especially common.

The patent primarily concerns the management and coherence of the distributed TLB tables. As an example of the level of the patent, Fig. 6 contains a flow chart containing seven boxes which flow in sequence with no decision points, paraphrased extremely roughly runs as follows:

**Fig. 6 Heavily Redacted**
a processor invalidates a page table entry within its own TLB the processor issues a TLB invalidated notice the processor broadcasts the notice to other places other places search their own TLB for the notified entries other places invalidate entries of their own as notified other places issue acknowledgements the originating processor issues a synchronization notice to other places

This is not a reliable, legal account; for that you must consult a qualified patent lawyer. The claims largely read on what each of those steps might entail and ineluctable elaborations such as handling a range of addresses.

See Also

See also United States patent 6907477 granted a year later on June 14, 2005 also to Erik R. Altman and many others, bearing the less ponderous title Symmetric multi-processing system utilizing a DMAC to allow address translations for attached processors. Same general idea, newfangled claims; syndicates Fig 6. Focusses more on the address translation mechanism such as translating a range of virtual address into a corresponding range of physical addresses.

U.S. 6,760,819 — DMA coherence

United States patent 6760819 was granted to Sang Hoo Dhong and four others on July 6, 2004 with IBM as the assignee under the title Symmetric multiprocessor coherence mechanism.

This patent concerns address coherency mechanisms. While less specific to Cell than the distributed DMA address translation (U.S. patent 6779049), it again reveals IBM's preoccupation with efficient multiprocessor coherency mechanisms during the Cell design period.

The abstract describes the patent as involving a mechanism to reduce "the number of coherency busses" associated with "snoop resolution and coherency operations" on a multi-level processor cache. In this approach, the L2 cache contains a copy of the L1 cache tags which permits both sets of tags to be snooped at the L2 cache in one operation, eliminating the need for an L1 cache-coherency bus. The abstract also states that "updates to the coherency states of the copy of the L1 directory are mirrored in the L1 directory and L1 cache" without describing how this information is conveyed between the caches; perhaps on an unnamed bus of less complexity than the coherency bus replaced.

It is not clear whether the Cell implementation exploits this technique or not. IBM was enaged in the design of other cores during the same time period (such as the Power 970 cores in the XBox-360) which might exploit this technique instead. Most likely IBM adopted it for all their subsequent Power core designs featuring multi-level caches if it proved workable in practice.

U.S. 6,820,142 — token based DMA

United States patent 6820142 was granted to Peter Hofstee and two others on November 16, 2004 with IBM as the assignee under the title Token based DMA.

This is another patent pertaining to the problem of orchestrating multiple cores each with their own DMA unit. Without governance, the DMA controllers are capable of overwhelming available resources leading to contention or starvation effects.

In this approach, a "master controller" grants tokens to the processing elements to access "the shared memory for a particular duration of time at a unique deterministic point in time". In this patent, the emphasis is on determinism, a system characteristic most important from the real-time performance perspective. However, this same technique also serves concepts related to determinism, such as priority and fairness.

In the actual Cell design, this basic mechanism is greatly refined. The system is partitioned into resource allocation groups (RAGs) down to the granularity of individual memory banks on the XDR memory devices. Note that the use of reservation policy is optional from the perspective of an individual SPE. This mechanism exists to aid the SPE cores in sharing resources effectively if they chose to do so. Untrusted code running on an SPE core is capable of ignoring the "master controller" token mechanism and swamping other elements on the EIB bus with nuisance transactions.

U.S. 6,785,841 — redundant elements

United States patent 6785841 was granted to Chekib Akrout and two others on August 31, 2004 with IBM as the assignee under the title Processor with redundant logic.

The patent bears directly on Cell from the cost-to-manufacture perspective. The last sentence of the History of Related Art declares "Thus, for conventionally designed processor chips, redundancy has typically not been used with great success. It would be desirable, therefore, to design a processor device with cost effective redundant elements".

The direct bearing on Cell is revealed by the statement "Each of the attached processors may comprise a single instruction multiple data (SIMD) processor such as a vector processor or an array processor" characterizing the SPU design exactly.

In the design of the Cell EIB, each bus participant has a element identification number. The Summary of Invention concludes "Disabling the non-functional processor may include altering the information in the attached processor ID register while enabling the redundant processor may include programming the processor ID of the redundant processor to the value of the non-functional processor. Disabling the non-functional attached processor may further include electrically disconnecting the attached processor such as by destroying one or more fuseable links."

Cell chips sold into the PS3 market will contain seven functional SPE cores. Some sources claim that processors sold into the consumer appliance market, such as HDTV sets made by Toshiba, will have six functional SPE cores. Cell chips with eight functional SPE cores are high-graded into the scientific workstation market.

Note that while the SPE cores are identical one to another, the location of the SPE cores relative to other participants on the EIB bus does have an impact on the timing and efficiency of EIB transactions; this proceedure for eliminating defective SPU cores is not absolutely transparent to the performance of the device in all cases.

U.S. 6,839,828 — selective scalar subpath

United States patent 6839828 was granted to Michael Gschwind and two others on January 4, 2005 with IBM as the assignee under the ponderous title SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode.

On one level this patent says that if you have four beer taps (vector pub) and a customer orders only one beer (scalar customer), you fill the order by using only one beer tap (selective subpath). The rest of the patent governs how to keep the beer cold in the three inactive kegs without burning unnecessary power on refrigeration, which claim 3 depicts as "The processor of claim 1, further including a power savings unit which disables functional units not used for processing a given instruction."

Amazingly, this is more difficult than it sounds. In one of their early disclosures about Cell, Dr Hofstee admitted that the circuitry involved in managing this decision consumed almost as many resources as it saved, so they ended up discarding the mechanism in the inaugural device. The technique might show up again in a future Cell design if the balance of cost to savings improves.

The question arose because of an unusual feature of the SPU processor: the unified register set where all registers are 128 bits wide. For many reasons, these registers are often used to hold word scalars, ignoring the other three vector positions. As an example, all address calculations on the local store are performed using word scalars. In theory it ought to save power to disable execution units for the disused vector elements when operating on scalar values. In this case, the practice was contrary.

See Also

See also United States patent 6,785,841 which explains how to designate a beer tap that doesn't work (defective element) as being "out of service".

U.S. 6,865,631 — reduction of RPC interrupts

United States patent 6865631 was granted to Peter Hofstee and Ravi Nair on March 8, 2005 with IBM as the assignee under the title Reduction of interrupts in remote procedure calls.

Again we have a patent centered around Cell's novel DMA structure. Not every method of coordinating the SPE cores would be characterized as a remote procedure call (RPC). This concept arises most naturally when the SPE cores are orchestrated by the microtasking kernel approach. Each SPE microtask can be regarded as a procedure call which binds the code and data together.

In prior art, if the subordinate processors receiving the remote procedure call are implemented as separate chips—which is common—the normal mechanism to alert the master process of task completion is to signal an external interrupt. Some processors are specially designed to feature lightweight interrupt handlers taking perhaps a few dozen instructions to handle a simple interrupt request. Other processors, especially very fast processors, are much less agile in handling interrupts, the cost can run to hundreds of processor cycles. It is likely that the Cell PPE—a very fast processor—would take a relatively large performance hit from each interrupt handled. For this reason, as IBM concludes the Background Information section, "It would therefore be desirable to develop an SMP system where the APU(s) do not interrupt the processing unit upon completion of its task(s) in one or more remote procedure calls." [emph. added]

This patent describes an alternate method devised in which the DMA controllers are used to monitor task status by monitoring dedicated completion signals. Presumably each SPE has a completion signal line that runs to the PPE DMA controller, though the implementation at the hardware level is not specified by the patent.

Unlike a fast execution core, a DMA controller is extremely agile. Even with many DMA requests enqueued, the DMA controller is usually rate limited by available bandwidth or concurrency. In the Cell design, each DMA controller can have a maximum of two transfers active concurrently; any other enqueued requests are stalled. It is not especially challenging to have the DMA controller monitor signal lines (which the patent strangely depicts as "polling") and awaken dormant transfer requests within the request queue once the signal is received.

This is a highly technical patent which spells out the gory details involved. The most complex scenario presented is where the RPC completion events are distributed into multiple DMA queues. This is an elaborate scenario where a PPE thread delegates a microtask to a SPE core and then delegates the completion logic to the SPE's local DMA controller.

Fundamentally this patent describes an esoteric synchronization mechanism appropriate to a design rich in integrated DMA controllers. The solution does not exist in prior art because the problem did not exist in prior practice. That is not an offhand remark: it is quite possible that the Cell architecture's DMA-centric design was partially motivated by the opportunity to create so many strange new problems with patentable solutions such as this patent describes.

U.S. 6,924,802 — SIMD function interpolation

United States patent 6924802 was granted to Gordon Fossum and three others on August 2, 2005 with IBM as the assignee under the title Efficient function interpolation using SIMD vector permute functionality.

Unlike the majority of patents pertaining to Cell, this patent does not involve the hardware as such. IBM had Cell in mind as a visualization workstation. As a result, they explored ways to exploit Cell to its best advantage. As IBM notes in their Description of Related Art, "Unfortunately, a major bottleneck exists in the calculation and estimation of functions that generate the visual display data. An advance in the calculation and estimation of functions that generate the visual display data would allow for substantial improvement in visual display system performance" citing "representative examples" as including "sin(x), cos(x), log2(x) and exp2(x)" while noting that "many others are involved in the calculation of visual display data."

This patent describes a way to exploit the powerful vector permute instruction in the VXE and SPU instruction sets to manage coefficients to interpolate these functions more rapidly. The patent therefore pertains to running a certain kind of software program on the Cell hardware. The vector permute instruction did not originate with Cell. It was part of the original AltiVec instruction set designed by Apple Computer, IBM and Motorola (the AIM alliance). The AltiVec trademark is owned solely by Motorola. When the technology sharing agreement ended, IBM did not license any AltiVec technology from Motorola. Instead, IBM reverse-engineered the AltiVec instructions and included them in a larger instruction set termed VMX from which the Cell SIMD instruction sets derive.

What IBM recognizes in this patent is not the originality of vector permute but a particular use of this instruction to enhance the speed of visual presentation.

The vector permute instruction can be viewed as a kind of mathematical function, as can the display functions which it is used to interpolate. To the extent that this invention is purely mathematical in nature, this type of patent would be regarded as controversial. Nevertheless, the U.S. Patent office grants many patents of this nature irrespective of how well they might or might not hold up in court.

It is, however, somewhat unusual to see the use of a computational primitive patented as such.

Implementation

First edition Cell on 90 nm CMOS

IBM has published information concerning two different versions of Cell in this process, an early engineering sample designated DD1, and an enhanced vesion designated DD2 intended for production.

**Known Cell Variants in 90 nm Process**
Designation	Die Area	First Disclosed	Enhancement
DD1	221 mm²	ISSCC 2005
DD2	235 mm²	Cool Chips April 2005	enhanced PPE core

The main enhancement in DD2 was a small lengthening of the die to accommodate a larger PPE core, which is reported to "contain more SIMD/vector execution resources"^[19]. Some preliminary information released by IBM references the DD1 variant. As a result some early journalistic accounts of the Cell's capabilities now differ from production hardware.

Cell floorplan

[Powerpoint material accompanying an STI presentation given by Dr Peter Hofstee], includes a photograph of the DD2 Cell die overdrawn with functional unit boundaries which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:

**Cell Function Units and Footprint**
Cell function unit	Area (%)	Description
XDR interface	5.7	interface to Rambus system memory
memory controller	4.4	manages external memory and L2 cache
512 KiB L2 cache	10.3	cache memory for the PPE
PPE core	11.1	Power PC processor
test	2.0	unspecified "test and decode logic"
EIB	3.1	element interconnect bus linking processors
SPE (each) x 8	6.2	synergistic coprocessing element
I/O controller	6.6	external I/O logic
Rambus FlexIO	5.7	external signalling for I/O pins

SPE floorplan

Additional details concerning the internal SPE implementation have been disclosed by IBM engineers, including Peter Hofstee, IBM's chief architect of the synergistic processing element, in a scholarly IEEE publication.^[20]

This document includes a photograph of the 2.54 x 5.81 mm SPE, as implemented in 90-nm SOI. In this technology, the SPE contains 21 million transistors of which 14 million are contained in arrays (a term presumably designating register files and the local store) and 7 million transistors are logic. This photograph is overdrawn with functional unit boundaries, which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:

**SPU Function Units and Footprint**
SPU function unit	Area (%)	Description	Pipe
single precision	10.0	single precision FP execution unit	even
double precision	4.4	double precision FP execution unit	even
simple fixed	3.25	fixed point execution unit	even
issue control	2.5	feeds execution units
forward macro	3.75	feeds execution units
GPR	6.25	general purpose register file
permute	3.25	permute execution unit	odd
branch	2.5	branch execution unit	odd
channel	6.75	channel interface (three discrete blocks)	odd
LS0-LS3	30.0	four 64 KiB blocks of local store	odd
MMU	4.75	memory management unit
DMA	7.5	direct memory access unit
BIU	9.0	bus interface unit
RTB	2.5	array built-in test block (ABIST)
ATO	1.6	atomic unit for atomic DMA updates
HB	0.5	obscure

Understanding the dispatch pipes is important to write efficient code. In the SPU architecture, two instructions can be dispatched (started) in each clock cycle using dispatch pipes designated even and odd. The two pipes provide different execution units, as shown in the table above. As IBM partitioned this, most of the arithmetic instructions execute on the even pipe, while most of the memory instructions execute on the odd pipe. The permute unit is closely associated with memory instructions as it serves to pack and unpack data structures located in memory into the SIMD multiple operand format that the SPU computes on most efficiently.

Unlike other processor designs providing distinct execution pipes, each SPU instruction can only dispatch on one designated pipe. In competing designs, more than one pipe might be designed to handle extemely common instructions such as add, permitting more two or more of these instructions to be executed concurrently, which can serve to increase efficiency on unbalanced workflows. In keeping with the extremely Spartan design philosophy, for the SPU no execution units are multiply provisioned.

Understanding the limitations of the restrictive two pipeline design is one of the key concepts a programmer must grasp to write efficient SPU code at the lowest level of abstraction. For programmers working at higher levels of abstraction, a good compiler will automatically balance pipeline concurrency where possible.

SPE power and performance

As tested by IBM under a heavy transformation and lighting workload [average IPC of 1.4], the performance profile of this implementation for a single SPU processor is qualified as follows:

**Relationship of speed to heat**
Voltage (V)	Frequency (GHz)	Power (W)	Die Temp (C)
0.9	2.0	1	25
0.9	3.0	2	27
1.0	3.8	3	31
1.1	4.0	4	38
1.2	4.4	7	47
1.3	5.0	11	63

The entry for 2.0 GHz operation at 0.9 V represents a low power configuration. Other entries show the peak stable operating frequency achieved with each voltage increment. As a general rule in CMOS circuits, power dissipation rises in a rough relationship to V^2 * F, the square of the voltage times the operating frequency.

Though the wattage measurements provided by the IBM authors lack precision they convey a good sense of the overall trend. These figures show the part is capable of running above 5 GHz under test lab conditions--though at a die temperature too hot for standard commercial configurations. The first Cell processors made commercially available were rated by IBM to run at 3.2 GHz, an operating speed where this chart suggests a SPU die temperature in a comfortable vicinity of 30 degrees.

Note that a single SPU represents 6% of the Cell processor's die area. The wattage figures given in the table above represent just a small portion of the overall power budget.

Future editions in CMOS

IBM has publicly announced their intention to implement Cell on a future technology below the 90 nm node to improve power consumption. Reduced power consumption could potentially allow the existing design to be boosted to 5 GHz or above without exceeding the thermal constraints of existing products.

Prospects at 65 nm

The most likely design node for a future Cell processor is the upcoming 65nm node in which IBM and Toshiba have already invested great sums of money. All things remaining equal, a reduction to 65 nm would reduce the existing 230 mm² die based on the 90 nm process to half its current size, about 120 mm², greatly reducing IBM's manufacturing cost as well.

Alternately, IBM could elect to partially redesign the chip to take advantage of additional silicon area. The Cell architecture already makes explicit provisions for the size of the local store to vary across implementations. A chip-level interface is available to the programmer to determine local store capacity, which is always an exact binary power.

Based on the reported die area of 30% for the local store in the 90 nm edition, it would be feasible to double the local store to 512KiB per SPU leaving the total die area devoted to the SPU processors roughly unchanged. In this scenario, the SPU area devoted to the local store would increase to 60% while other areas shrink by half. Going this route would reduce heat, and increase performance on memory intensive workloads, but without yielding IBM much if any reduction in cost of manufacture.

Prospects beyond 65 nm

Process technologies below 65 nm capable of implementing a Cell processor have not been demonstrated. For any number of reasons dictated by technology or market, IBM might elect to discontinue the Cell technology without achieving these nodes. That said, IBM and Sony have made a substantial investment in the Cell technology and such a large investment will normally be realized over several generations of new process technology.

At this stage, the Sony Toshiba IBM alliance (STI) have announced their intention to continue to work together and share innovation beyond their current venture at 65 nm to the 45 nm and 32 nm process nodes^[21], but they have not mentioned Cell for implementation by name in either of these nodes, though if Cell becomes greatly successful it would be surprising if subsequent Cell editions in these nodes were not someday forthcoming.

Acronyms

EIB: Element Interconnect Bus ^[22]
LS: Local Storage (SPE's local memory) ^[23]
MIC: Memory Interface Controller ^[24]
PPE: Power Processor Element ^[25]
SMF: Synergistic Memory Flow Controller
SPE: Synergistic Processing Element ^[26]
SPU: Streaming Processor Unit ^[27]
STI: Sony Computer Entertainment Inc., Toshiba Corp., IBM

References

^ "Introduction to the Cell multiprocessor". IBM Journal of Research and Development. September 7 2005. {{cite news}}: Check date values in: |date= (help)
^ "CELL Processor Gets Ready To Entertain The Masses". Electronic Design. February 8 2005. {{cite news}}: Check date values in: |date= (help)
^ "Arnd Bergmann on Cell". IBM developerWorks. June 25 2005. {{cite news}}: Check date values in: |date= (help)
^ "Spufs: The Cell Synergistic Processing Unit as a virtual file system". IBM developerWorks. June 25 2005. {{cite news}}: Check date values in: |date= (help)
^ "Cell-CPU auf dem LinuxTag (at the LinuxTag)". pro-linux. June 25 2005. {{cite news}}: Check date values in: |date= (help)
^ "Winner: Multimedia Monster". IEEE Spectrum. 1 January 2006. {{cite news}}: Check date values in: |date= (help)
^ "Open sourcing of Cell coming to fruition". IT Manager's Journal. June 10 2005. {{cite news}}: Check date values in: |date= (help)
^ "Unleashing the power: A programming example of large FFTs on Cell (broadcast replay)". power.org. June 9 2005. {{cite news}}: Check date values in: |date= (help)
^ "IBM Discloses Cell Based Blade Server Board Prototype". Tech-On!. May 25 2005. {{cite news}}: Check date values in: |date= (help)
^ "IBM will unlock door to Cell". EETimes.com. May 23 2005. {{cite news}}: Check date values in: |date= (help)
^ "Toshiba Demonstrates Cell Microprocessor Simultaneously Decoding 48 MPEG-2 Streams". Tech-On!. April 25 2005. {{cite news}}: Check date values in: |date= (help)
^ "CELL: A New Platform for Digital Entertainment". Sony Computer Entertainment Inc. March 9 2005. {{cite news}}: Check date values in: |date= (help)
^ "CELL Microprocessor Revisited". Real World Technologies. 28 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "Power Efficient Processor Design and the Cell Processor" (PDF). IBM. 16 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "Prospects For the CELL Microprocessor Beyond Games". Slashdot. 11 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "ISSCC 2005: The CELL Microprocessor". Real World Technologies. 10 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "A 4.8 GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL Processor" (PDF). Sony Computer Entertainment Inc., Toshiba Corp., IBM. 9 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "The Design and Implementation of a First-Generation CELL Processor" (PDF). Sony Computer Entertainment Inc., Toshiba Corp., IBM. 8 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "IBM, Sony, Toshiba unveil nine-core Cell processor". Macworld. 7 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "Cell Microprocessor Briefing". IBM, Sony Computer Entertainment Inc., Toshiba Corp. 7 February 2005. {{cite news}}: Check date values in: |date= (help)
^ "The Cell Processor Programming Model". LinuxTag 2005. Retrieved 11 June. {{cite web}}: Check date values in: |accessdate= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)
^ "IBM Research - Cell". IBM. Retrieved 11 june. {{cite web}}: Check date values in: |accessdate= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)
^ "Understanding the Cell Microprocessor". Anand Lal Shimpi. Retrieved 17 March. {{cite web}}: Check date values in: |accessdate= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)
^ "Cell DMA Engines". IBM developerWorks. Dec 06 2005. {{cite news}}: Check date values in: |date= (help)
^ "The Microarchitecture of the Synergistic Processor for a Cell Processor" (PDF). IEEE Journal of Solid-State Circuits, Vol.41, No.1. Jan 1 2006. Retrieved 4 April. {{cite web}}: Check date values in: |accessdate= and |date= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)
^ "IBM Showcases Cell Processors in Action at CeBIT". CDRInf. Mar 13 2006. Retrieved 4 April. {{cite web}}: Check date values in: |accessdate= and |date= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)
^ "IBM, Sony, Toshiba extend chip development work". IDG News Service. January 12 2006. Retrieved 4 April. {{cite web}}: Check date values in: |accessdate= and |date= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)
^ "Cell Broadband Engine Architecture and its first implementation". IBM developerWorks. Nov 29 2005. Retrieved 6 April. {{cite web}}: Check date values in: |accessdate= and |date= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)
^ "PowerPC Microprocessor Family Vector/SIMD Multimedia Extension Technology Programming Environments Manual". IBM. Sep 30 2005. Retrieved 8 April. {{cite web}}: Check date values in: |accessdate= and |date= (help); Unknown parameter |accessyear= ignored (|access-date= suggested) (help)

External links

News

Articles

Winner: Multimedia Monster
Holy Chip!
Military to Begin Using Cell Processor Technology
"PlayStation 3 chip has split personality" — By David Becker, CNET News.com, 7 February 2005
"It's the Software, Stupid!" — Robert X. Cringely piece about why software is key to the Cell success.
"Because It's an Once in a Lifetime Challenge" — Ken Kutaragi
"Introducing the IBM/Sony/Toshiba Cell Processor" — Jon "Hannibal" Stokes
- Part I: the SIMD processing units
- Part II: The Cell Architecture
EE Times article on ISSCC paper presentation
Link to image of ISSCC presentation abstract for 90nm process
Cell Architecture Explained
"The Soul of Cell" Interview with Dr. H. Peter Hofstee, Cell Chief Scientist and Cell Synergistic Processor Chief Architect, with the IBM Systems and Technology Group

Technical documentation