Modern Heterogeneous System Examples
By askumar, jkonstan, tnebel, and vvallabh
Due on 2013-04-23 00:00:00

Overview

Conventional parallelism looks at distributing computation across multiple computational units. These computational units are generally regarded as being equivalent in their computing capabilities. The main idea of Heterogeneous Computing is having a system employ computational units of differing capabilities.

To get an intuition for how this may be beneficial, consider these two processors: Comparing Two Processors In theoretical total rate of work, processor B comes out ahead. However, it is easy to imagine a workload may not allow for that for that ideal distribution of parallel computation. There could be bottlenecks of sequential work that must be done. At this point processor B will be stuck with one of it's weaker cores (with respect to processor A) doing that sequential work while all the others wait. Processor A could do that sequential work faster due to its more powerful individual cores.

This benefits of these types of systems are workload specific. Common different computational units used within a single system are things like a graphics processing unit, general purpose processor, or field-programmable gate array.

Case Study 1: Xbox 360

The GPU is huge

The logic board is known as the "Falcon" board. It is designed to allow for better cooling of the GPU.

IBM PowerPC CPU

This triple core CPU is designed to allow for high performance processing.It abandons more complex instruction execution for in-order execution, which reduces power consumption. It operates on RISC (Reduced Instruction Set Computing), or many simple instructions performed at high speed. Each core has multiple FPUs (floating point units) and SIMD vectors. Their SIMD units have customized "dot product" instructions that take less latency that standard instructions. Each core is multithreaded and clocked at 3.2 GHz. It has 512 Mb of Ram and 1 MB shared L2 cache. It's theoretical peak performance is 115.2 gigaFLOPS and can perform 9.6 billion dot products per second.

ATI Xenos GPU

The GPU is the largest element of the logic board. This makes sense since the Xbox is a gaming system, and therefore is constantly performing graphics computations. The GPU has 10 MB of video Ram and a clock speed of 500 MHz. It contains 2 silicon dies, with one die handling specific computatios such as anti-aliasing, z-buffering, and alpha blending.

Data streaming

A custom instruction, extended data cache block touch (xDCBT) reads prefetches data directly to the L1 cache, by passing the shared L2 cache in order to avoid thrashing. Conversely, writes skip the L1 cache and go directly to the L2. This allows the GPU to read data produced by the CPU without going to memory. This pattern of data streaming is called xbox procedural synthesis (XPS).

Case Study 2: Tegra 4i

The Tegra 4i is a System On Chip (SOC) targeted towards smartphones. Therefore low power consumption was a very important factor. In addition many of the activities such as videography, photography, watching videos in 1080p, playing intense games (high frame rate) and surfing the internet. In addition NVidia needs to engineer the tegra 4i chip for the unknown advances in smartphone usage which may require more parallelism or computational horsepower. The last factor that motivated the engineering of Tegra 4i was the need to keep heat dissipation low.

Checkout the Tegra 4i

In the Tegra 4i there are four R4 Arm cores (2.3 ghz) with one battery saver core, 60 cores of gpu, an integrated i500 core (for LTE), and a few cores specialized for computational photography, audio processing, image processing, and a video engine.

Checkout the Tegra 3

If you look at the differences between the Tegra 3 and Tegra 4i you will notice they have downsized the amount of resources devoted to specialized tasks (such as image processing), this maybe because that NVidia is trying to make their hardware more generic and be able to handle a broader range of applications well.

Case Study 3: iPhone 3GS

iPhone 3GS logic board

The iPhone 3GS is a specialized device designed to function as a mobile phone, a GPS, and a mobile computer all in one; this functionality is implemented using hardware specialization. The logic board includes: a System-on-a-Chip made by Samsung, composed of an ARM processor core, a dedicated GPU, the touch-controller, and many other smaller pieces of essential functionality (see the S5PC100 Block Diagram below); a Broadband processor; 16GBs of Flash memory; a 256MB pseudo SRAM chip; a transceiver and multiple power amplifiers; a GPS transceiver; an accelerometer; power-management integrated circuits (PMICs); and a WiFi/Bluetooth transceiver.

Samsung S5PC100 System-on-a-Chip (SoC)

Samsung S5PC100 Block diagram

The S5PC100 is a SoC based on the Cortex A8 application processor and targeted for low power portable designs. In addition to the application processor, the system includes a PowerVR SGX GPU, a dedicated touch-controller, and interfaces to memory stored elsewhere on the logic board.

ARM Cortex A8 applications processor

The Cortex A8 runs at 600MHz and includes separate 32KB instruction and 32KB data caches (L1), and a 256KB L2 cache. The A8 is a two-issue in-order core, capable of fetching, decoding and executing two RISC instructions in parallel. The A8 is also capable of SIMD execution powered by the NEON data engine.

PowerVR SGX 535 GPU

PowerVR SGX 535 GPU

The PowerVR SGX is a fully programmable core that supports OpenGL 2.0. Running at 200MHz, the GPU is reportedly capable of rendering 7 million triangles/second and 250 million pixels/second.

Baseband Processor

The baseband processor manages all the radio functions except for WiFi and Bluetooth, which are managed by a separate chip. According to Wikipedia, there are three primary reasons for separating the baseband processor from the main applications processor: - Radio performance: radio functions require a realtime OS.
- Legal: separating the baseband from the applications processor means that the applications processor is spared the need for certifications required for any device which communicates with the cellular network.
- Reliability: the applications processor can be updated separately when software fixes are released.

Sources

Xbox 360 hardware

Xbox 360 teardown

Tegra 3 Overview

Tegra 4i Overview

iPhone 3GS teardown

iPhone 3GS analysis

iPhone 3Gs baseband processor