

# In-Data Center Performance Analysis of a Tensor Processing Unit<sup>TM \*</sup>

David Patterson and the Google TPU Team

[davidpatterson@google.com](mailto:davidpatterson@google.com)

April 5, 2017

\*4/5/17 Google published a blog on the TPU. A 17-page technical paper with same title will be on arXiv.org. (Paper will also appear at the *International Symposium on Computer Architecture* on June 26, 2017.)

# A Golden Age in Microprocessor Design

- Stunning progress in microprocessor design 40 years  $\approx 10^6$ x faster!
- Three architectural innovations ( $\sim 1000$ x)
  - Width: 8->16->32 ->64 bit (~8x)
  - Instruction level parallelism:
    - 4-10 *clock cycles per instruction* to 4+ *instructions per clock cycle* (~10-20x)
  - Multicore: 1 processor to 16 cores (~16x)
- Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)
- Made possible by IC technology:
  - **Moore's Law:** growth in transistor count (2X every 1.5 years)
  - **Dennard Scaling:** power/transistor shrinks at same rate as transistors are added (constant per mm<sup>2</sup> of silicon)

# Changes Converge

- Technology
  - End of Dennard scaling: power becomes the key constraint
  - Slowdown (retirement) of Moore's Law: transistors cost
- Architectural
  - Limitation and inefficiencies in exploiting instruction level parallelism end the uniprocessor era in 2004
  - Amdahl's Law and its implications end “easy” multicore era
- Products
  - PC/Server ⇒ Client/Cloud

# End of Growth of Performance?

40 years of Processor Performance



# What's Left?

Since

- Transistors not getting much better
- Power budget not getting much higher
- Already switched from 1 inefficient processor/chip to N efficient processors/chip

Only path left is *Domain Specific Architectures*

- Just do a few tasks, but extremely well

# What is Deep Learning?

- Loosely based on (what little) we know about the brain



# The Artificial Neuron

$$y = F \left( \sum_i w_i x_i \right)$$



$F$ : a nonlinear  
differentiable  
function

IS THIS A  
**CAT or DOG?**



CAT   DOG



# Key NN Concepts for Architects

- *Training* or learning (development)  
vs. *Inference* or prediction (production)
- *Batch size*
  - Problem: DNNs have millions of weights that take a long time to load from memory (DRAM)
  - Solution: Large batch  $\Rightarrow$  Amortize weight-fetch time by inferring (or training) many input examples at a time
- Floating-Point vs. Integer (“*Quantization*”)
  - Training in Floating Point on GPUs popularized DNNs
  - Inferring in Integers faster, lower energy, smaller

- 2013: Prepare for success-disaster of new DNN apps
  - Scenario with users speaking to phones 3 minutes per day:  
If only CPUs, need 2X-3X times whole fleet
  - Unlike some hardware targets, DNNs applicable to a wide range of problems, so can reuse for solutions in speech, vision, language, translation, search ranking, ...
- Custom hardware to reduce the TCO of DNN inference phase by 10X vs. GPUs
  - Must run existing apps developed for CPUs and GPUs
- A very short development cycle
  - Started project 2014, running in datacenter 15 months later:  
Architecture invention, compiler invention, hardware design, build, test, deploy
- Google CEO Sundar Pichai reveals Tensor Processing Unit at Google I/O on May 18, 2016 as “10X performance/Watt”

[cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html](http://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html)

- TPU Card to replace a disk
  - Up to 4 cards / server

# TPU Card & Package



## 3 Types of NNs

### 1. Multilayer Perceptrons

- Each new layer applies nonlinear function  $F$  to weighted sum of all outputs from prior layer (“fully connected”)  $x_n = F(Wx_{n-1})$

### 2. Convolutional Neural Network

- Like MLPs, but same weights used on nearby subsets of outputs from prior layer

### 3. Recurrent NN/“Long Short-Term Memory”

- Each new layer a NL function of weighted sums of past *state* and prior outputs; same weights used across time steps

# Inference Datacenter Workload (95%)

| Name  | LOC  | Layers |      |        |      |       | Nonlinear function | Weights | TPU Ops / Weight Byte | TPU Batch Size | % Deployed |
|-------|------|--------|------|--------|------|-------|--------------------|---------|-----------------------|----------------|------------|
|       |      | FC     | Conv | Vector | Pool | Total |                    |         |                       |                |            |
| MLP0  | 0.1k | 5      |      |        |      | 5     | ReLU               | 20M     | 200                   | 200            | 61%        |
| MLP1  | 1k   | 4      |      |        |      | 4     | ReLU               | 5M      | 168                   | 168            |            |
| LSTM0 | 1k   | 24     |      | 34     |      | 58    | sigmoid,<br>tanh   | 52M     | 64                    | 64             | 29%        |
| LSTM1 | 1.5k | 37     |      | 19     |      | 56    | sigmoid,<br>tanh   | 34M     | 96                    | 96             |            |
| CNN0  | 1k   |        | 16   |        |      | 16    | ReLU               | 8M      | 2888                  | 8              | 5%         |
| CNN1  | 1k   | 4      | 72   |        | 13   | 89    | ReLU               | 100M    | 1750                  | 32             |            |

## TPU Architecture and Implementation

- Add as accelerators to existing servers
  - So connect over I/O bus (“PCIe”)
  - TPU ≈ matrix accelerator on I/O bus
- Host server sends it instructions like a Floating Point Unit
  - Unlike GPU that fetches and executes own instructions

- The Matrix Unit: 65,536 (256x256)  
8-bit multiply-accumulate units
- 700 MHz clock rate
- Peak: 92T operations/second
  - $65,536 * 2 * 700M$
- >25X as many MACs vs GPU
- >100X as many MACs vs CPU
- 4 MiB of on-chip Accumulator memory
- 24 MiB of on-chip Unified Buffer (activation memory)
- 3.5X as much on-chip memory vs GPU
- Two 2133MHz DDR3 DRAM channels
- 8 GiB of off-chip weight DRAM memory



# TPU: a Neural Network Accelerator Chip



- 5 main (CISC) instructions
  - Read\_Host\_Memory
  - Write\_Host\_Memory
  - Read\_Weights
  - MatrixMultiply/Convolve
  - Activate (ReLU, Sigmoid, Maxpool, LRN, ...)
- Average Clock cycles per instruction: >10
- 4-stage overlapped execution, 1 instruction type / stage
  - Execute other instructions while matrix multiplier busy
- Complexity in SW: No branches, in-order issue,  
SW controlled buffers, SW controlled pipeline synchronization

## TPU Architecture, programmer's view

- Problem: energy/ time for repeated SRAM accesses of matrix multiply
- Solution: “Systolic Execution” to compute data on the fly in buffers by pipelining control and data
  - Relies on data from different directions arriving at cells in an array at regular intervals and being combined

## Systolic Execution in Matrix Array

# Systolic Execution: Control and Data are pipelined



# Can now ignore pipelining in matrix

Pretend each 256B input read at once, & they instantly update 1 location of each of 256 accumulator RAMs.



# Relative Performance: 3 Contemporary Chips

| Processor                     | $mm^2$ | Clock<br>MHz | TDP<br>Watts | Idle<br>Watts | Memory<br>GB/sec | Peak TOPS/chip |        |
|-------------------------------|--------|--------------|--------------|---------------|------------------|----------------|--------|
|                               |        |              |              |               |                  | 8b int.        | 32b FP |
| CPU: Haswell<br>(18 core)     | 662    | 2300         | 145          | 41            | 51               | 2.6            | 1.3    |
| GPU: Nvidia<br>K80 (2 / card) | 561    | 560          | 150          | 25            | 160              | --             | 2.8    |
| TPU                           | <331*  | 700          | 75           | 28            | 34               | 91.8           | --     |

\*TPU is less than half die size of the Intel Haswell processor

K80 and TPU in 28 nm process; Haswell fabbed in Intel 22 nm process

These chips and platforms chosen for comparison because widely deployed in Google data centers

# GPUs and TPUs added to CPU server

## Relative Performance: 3 Platforms

| <i>Processor</i>                                                 | <i>Chips/<br/>Server</i> | <i>DRAM</i>                    | <i>TDP<br/>Watts</i> | <i>Idle<br/>Watts</i> | Observed<br>Busy Watts<br>in datacenter |
|------------------------------------------------------------------|--------------------------|--------------------------------|----------------------|-----------------------|-----------------------------------------|
| CPU: Haswell (18 cores)                                          | 2                        | 256 GB                         | 504                  | 159                   | 455                                     |
| NVIDIA K80 (13 cores)<br>(2 die per card;<br>4 cards per server) | 8                        | 256 GB<br>(host) +<br>12GB x 8 | 1838                 | 357                   | 991                                     |
| TPU (1 core)<br>(1 die per card;<br>4 cards per server)          | 4                        | 256GB<br>(host) +<br>8GB x 4   | 861                  | 290                   | 384                                     |

These chips and platforms chosen for comparison because widely deployed in Google datacenters

2 Limits to performance:

1. Peak Computation
2. Peak Memory Bandwidth  
(For apps with large data that don't fit in cache)

Arithmetic Intensity (FLOP/byte or reuse) determines which limit

Weight-reuse = Arithmetic Intensity for DNN roofline

## Roofline Visual Performance Model

$$\text{GFLOP/s} = \text{Min}(\text{Peak GFLOP/s}, \text{Peak GB/s} \times \text{AI})$$



# TPU Die Roofline



# Haswell (CPU) Die Roofline



# K80 (GPU) Die Roofline



# Why so far below Rooflines? (MLPO)

| <i>Type</i> | <i>Batch</i> | <u><i>99th% Response</i></u> | <i>Inf/s (IPS)</i> | <i>% Max IPS</i> |
|-------------|--------------|------------------------------|--------------------|------------------|
| CPU         | 16           | 7.2 ms                       | 5,482              | 42%              |
| CPU         | 64           | 21.3 ms                      | 13,194             | 100%             |
| GPU         | 16           | 6.7 ms                       | 13,461             | 37%              |
| GPU         | 64           | 8.3 ms                       | 36,465             | 100%             |
| TPU         | 200          | 7.0 ms                       | 225,000            | 80%              |
| TPU         | 250          | 10.0 ms                      | 280,000            | 100%             |

# Log Rooflines for CPU, GPU, TPU



# Linear Rooflines for CPU, GPU, TPU



# TPU & GPU Relative Performance to CPU

| Type  | MLP  |      | LSTM |     | CNN  |      | Weighted Mean |
|-------|------|------|------|-----|------|------|---------------|
|       | 0    | 1    | 0    | 1   | 0    | 1    |               |
| GPU   | 2.5  | 0.3  | 0.4  | 1.2 | 1.6  | 2.7  | 1.9           |
| TPU   | 41.0 | 18.5 | 3.5  | 1.2 | 40.3 | 71.0 | 29.2          |
| Ratio | 16.7 | 60.0 | 8.0  | 1.0 | 25.4 | 26.3 | 15.3          |

# Perf/Watt TPU vs CPU & GPU

Performance/Watt vs. CPU or GPU



- Current DRAM
  - 2 DDR3 2133  $\Rightarrow$  34 GB/s
- Replace with GDDR5 like in K80  $\Rightarrow$  180 GB/s
  - Move Ridge Point from 1400 to 256

Improving TPU: Move  
“Ridge Point” to the  
left

# Revised TPU Raises Roofline

Improves performance 4X for  
LSTM1, LSTM0, MLP1, MLP0



# Perf/Watt Original & Revised TPU



## Related Work

Two survey articles document that custom NN ASICs go back at least 25 years [Ien96]/[Asa02]. For example, CNAPS chips contained a 64 SIMD array of 16-bit by 8-bit multipliers, and several CNAPS chips could be connected together with a sequencer [Ham90]. The Synapse-1 system was based on a custom systolic multiply-accumulate chip called the MA-16, which performed sixteen 16-bit multiplications at a time [Ram91]. The system concatenated several MA-16 chips together and had custom hardware to do activation functions.

Twenty-five SPERT-II workstations, accelerated by the T0 custom ASIC, were deployed starting in 1995 to do both NN training and inference for speech recognition [Asa98]. The 40-MHz T0 added vector instructions to the MIPS instruction set architecture. The eight-line vector unit could produce up to sixteen 32-bit arithmetic results per clock cycle based on 8-bit and 16-bit inputs, making it 25 times faster at inference and 20 times faster at training than a SPARC-20 workstation. They found that 16 bits were insufficient for training, so they used two 16-bit words instead, which doubled training time. To overcome that drawback, they introduced “bunches” (batches) of 32 to 1000 data sets to reduce time spent updating weights, which made it faster than training with one word but no batches.

The more recent DianNao family of NN architectures minimizes memory accesses both on the chip and to external DRAM by having efficient architectural support for the memory access patterns that appear in NN applications [Keu16]/[Che16a]. All 16-bit integer operations and all designs do down to layout, but no chips were fabricated. The original DianNao uses an array of 64 16-bit integer multiply-accumulate units with 44 KB of on-chip memory and is estimated to be 3 mm<sup>2</sup> (65 nm), to run at 1 GHz, and to consume 0.5W [Che14a]. Most of this energy went to DRAM accesses for weights, so one successor DaDianNao (“big computer”) includes eDRAM to keep 36 MB of weights on chip [Che14b]. The goal was to have enough memory in a multic平tchip system to avoid external DRAM accesses. The follow-on PuDianNao (“general computer”) is aimed at more traditional machine learning algorithms beyond DNNs, such as support vector machines [Liu15]. Another offshoot is ShiDianNao (“vision computer”) aimed at CNNs, which avoids DRAM accesses by connecting the accelerator directly to the sensor [Du15].

The Convolution Engine is also focused on CNNs for image processing [Qad13]. This design deploys 64 10-bit multiply-accumulate units and customizes a Tensilica processor estimated to run at 800 MHz in 45 nm. It is projected to be 8X to 15X more energy-efficient than an SIMD processor and within 2X to 3X of custom hardware designed just for a specific kernel.

The Fathom benchmark paper seemingly reports contradictory to ours, with the GPU running inference much faster than the CPU [Ado16]. However, their CPU and GPU are not server-class, the GPU has only four cores, and the applications do not use the CPU’s AVX instructions, and there is no response-time cutoff (see Table 4) [Bro16].

Catapult is the most widely deployed example of using reconfigurability to support DNNs, which many have proposed [Fad09]/[Chai10]/[Far11]/[Cav13]/[Zhu15]. They chose FPGAs over GPUs to reduce power as well as the risk that latency-sensitive applications wouldn’t map well to GPUs. FPGAs can also be reprogrammed, such as for search, compression, and network interface cards [Put15]. The TPU project actually began with FPGAs, but we abandoned them when we saw that the FPGAs of that time were not competitive in performance compared to the GPUs of that time, and the TPU could be much lower power than GPUs while being as fast or faster, giving it potentially significant benefits over both of FPGAs and GPUs.

Although first published in 2014 [Put14], Catapult is a TPU contemporary since it deployed 28-nm Stratix V FPGAs into datacenters concurrently with the TPU in 2015. Catapult has a 200 MHz clock, 3,926 18-bit MACs, 5 MBs of on-chip memory, 11 GB/s memory bandwidth, and uses 25 Watts. The TPU has a 700 MHz clock, 65,536 18-bit MACs, 34 GB/s, and requires only 40 Watts. A revised version of Catapult uses newer FPGAs and was deployed at a large scale in 2016 [Cat16].

Catapult V1 runs CNNs—using a systolic matrix multiplier—2.3X as fast as a 2.1 GHz, 16-core, dual-socket server [Ovt15a]. Using the next generation of FPGAs (14-nm Arria 10) of Catapult V2, performance might go up to 7X, and perhaps even 17X with more careful floorplanning [Ovt15b]. Although it’s apples versus oranges, a current TPU die runs its CNNs 40X to 70X versus a somewhat faster server (Tables 2 and 6). Though the biggest difference is that to get the best performance the user must write long programs in the low-level hardware-design-language Verilog [Met16]/[Put16] versus writing short programs using the high-level TensorFlow framework. That is, reprogrammability comes from software for the TPU rather than from firmware for the FPGA.

Recent research, which appeared after the TPU was deployed, accelerates DNNs by optimizing the cases when weights and data are very small or zero. Our tight schedule precluded such optimizations in the TPU, but we saw the same opportunity in our studies. The Efficient Inference Engine is based on a first pass that reduces the number of weights by about a factor of 10 [Han15] as a separate step by filtering out very small values and then uses Huffman encoding to shrink the data even further to improve inference performance [Han16]. Cvrnlib [Alb16] avoids multiplications when an activation input is zero—which it is 44% of the time, presumably in part due to ReLU nonlinear function that transforms negative values to zero—to improve performance by an average 1.4 times.

Eyrius is a novel, low-power dataflow architecture that takes advantage of zeros by run-length encoding data to reduce the memory footprint and saves power by avoiding computations when an input is zero [Che16a]. Using Eyrius terminology, a TPU convolutional layer maps C and M to the rows and columns of the matrix unit, taking H/W cycles to perform one pass. With high C/M, it takes RS passes to process the layer; for low C/M, a number of other passes reduces power and improve utilization. (More can be found in the online references [Ros15a]/[Ros15b]/[Ros15c]/[Ros15d]/[Thol15]/[Yoo15].)

Minerva is a co-design system that crosses algorithm, architecture, and circuit disciplines to reduce power by 8X in part by pruning activation data with small values and in part by quantizing the data [Rea16]. [Gup15] looks at 16-bit fixed-point arithmetic for training instead of for inference. Others leverage the lower precision of DNN calculations by utilizing analog circuits during the computation to improve energy and performance [Li16]. [Sha16]. By tailoring an instruction set to DNNs, Cambrian reduces code size [Liu16]. Recent work looked at processors/inmemory architectures for NNS [Icn16]/[Kim16].

# Related Work

Comparing the TPU to some of these architectures:

- [Che14a] DMAs data from DRAM to input and weight buffers. They are read by the 3-stage pipelined NFU that performs multiplies, adds, and non-linear-functions; the results go to the output buffer, and then to DRAM. The NFU has no storage and isn’t systolic.
- [Gup15] appears to stream both matrix inputs while storing partial sums in the systolic array; the TPU stores the weight matrix tile while streaming the other input and the pre-activation partial sums. The TPU doesn’t support stochastic rounding.
- [Zha15] is built out of computation units equivalent to a 4x2 version of the TPU matrix unit. In an ASIC, the wiring cost of the crossbars that connect input and output buffers to these compute engines would be significant. We are surprised that we didn’t see architectural support for additional reductions to combine results from compute engines in [Zha15].

All three of [Gup15]/[Che14a]/[Zha15] store activations in DRAM during computation; the TPU’s Unified Buffer is sized so that no DRAM spilling or reloading happens during normal operation.

## REFERENCES

- [Abu11] Abus, M., Agarwal, M., Barham, P., Brevis, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., and Ghemawat, S., 2011. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. *arXiv preprint arXiv:1603.04467*.
- [Aho16] Ahuja, R., Judd, P., Hoffman, J., Tanteri, A., Terrell, J., Terrell, N.E., and Moshkov, I., 2016. Cineflect: Ineffeclent-Neuron-Free Deep Neural Networks. *Proceedings of the 2016 International Conference on Computer Architecture*.
- [Ado16] Adorf, R., Bana, S., Beagle, R., Wei, G.Y., and Brooks, D., 2016, September. Feature reference workloads for modern deep learning methods. *IEEE International Symposium on Workload Characterization (ISWC)*.
- [Aso02] Asovaldi, S., 2002. *Programmable Neurons*. In *The Handbook of Brain Theory and Neural Networks*. Second Edition, M. A. Arbib (Ed.), MIT Press, Cambridge, MA, USA.
- [Bra15] Brattin, W., and Bell, D., 2007. The promise of energy-efficient computing. *IEEE Computer*, vol. 40.
- [Bro16] Brooks, D., 2016, September. The 22nd International Parallel and Distributed Processing Symposium (IPDPS). Amazon EU - Up to 16 GPUs. <https://aws.amazon.com/compute/gpu/ipdps2016-p2-p3-type-for-practical-ipdps-to-be-targeted/>
- [Bro16a] Brooks, D., November, 2016. Private communication.
- [Cat16] Catapult, A.M., Chang, E.S., Petersen, A., Haehnlein, H.A.J.F.M., Humphrey, S.H.M., Daniel, P.K.J.Y.K., Ovtcharov, L.T.M.K., Lirika, K., and van der Hoorn, S., 2016. Catapult: A reconfigurable architecture for ML/DNN inference.
- [Cav13] Cavallaro, L., Gachard, D., Mayer, C., Willi, S., Mahsun, B., and Benini, L., 2013, May. Origen: A convolutional network accelerator. *Proceedings of the 25th annual Great Lakes Symposium on VLSI*.
- [Che14a] Chakraborty, S., Sarkaradas, K., Kalyan, and C., and Cadambi, S., 2014. Dynamically configurable coprocessor for convolutional neural networks. *Proceedings of the 2014 International Conference on Computer Architecture*.
- [Che14b] Chen, J., Li, Y., Wang, J., Wang, J., Chen, Y., and Tensom, 2014. Diannao: A single-chip high-throughput accelerator for ubiquitous machine-learning. *Proceedings of ASPLOS*.
- [Che16a] Chen, Y., Liu, T., Su, S., Zhang, S., He, L., Wang, J., Li, J., Chen, T., Xu, Z., Sun, N., and Tensom, 2014, December. Dadamiao: A microarchitectural design of a convolutional neural network. *Proceedings of the 43rd International Symposium on Microarchitecture*.
- [Che16b] Chen, Y., Li, Y., and Su, S., 2014. EyeNet: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. *Proceedings of the 43rd International Symposium on Computer Architecture*.
- [Che16c] Chen, Y., Xie, Y., Zu, X., Sun, N., and Tensom, 2014, October. DaNanDuo Family: Energy-Efficient Hardware Accelerators for Machine Learning, Research Highlight. *Communications of the ACM*, 59(11).
- [Che16d] Chen, Y., Li, S., Qi, Y., Xu, X., Zhang, T., Wang, Y., and Xie, Y., 2016. PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. *Proceedings of the 43rd International Symposium on Computer Architecture*.
- [Chi15] Clark, J., October 26, 2015. Intel’s New Cloud Computing Its Latest Web Search On A Chip. *Bloomberg Technology*, www.bloomberg.com. URL <http://www.bloomberg.com/technology/intels-new-cloud-computing-its-latest-web-search-on-a-chip>
- [Dai13] Dai, J., and Jang, L.A., 2013. The tail is all: Communication of an ACM SIGGRAPH presentation.
- [Dai15] Da, Z., Fasthuber, R., Chen, T., Jiang, P., Li, Y., Luo, T., Feng, X., Chen, Y., and Tensom, 2015, June. DianNao/Nan: shifting vision processing closer to the sensor. *Proceedings of the 2015 International Symposium on Computer Architecture*.
- [Dai16] Dai, J., Poole, B., Chen, Y., and Xie, Y., 2016, August. Cnnp: Using both weights and connections for efficient neural networks. *In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain*. Proceedings of the 2016 International Symposium on Computer Architecture
- [Dei16] Deis, P., Lin, J., Tran, J., and Dally, J., 2016. Efficient inference engine on compressed deep neural network. *arXiv preprint arXiv:1610.05027*.
- [Den11] Denyer, J., and Memmi, A., 2011, June. Neuflow: A runtime reconfigurable dataflow processor for neural networks. *Proceedings of the 38th International Conference on Computer Architecture*.
- [Din09] Din, J., and Jang, L.A., 2009. The state of the art. *Communications of the ACM*, 52(11).
- [Dit15] Ditzel, J., and Jang, L.A., 2015. Large-Scale Deep Learning for Building Intelligent Systems. *ACM Webinar*.
- [Dit16] Du, Z., Fasthuber, R., Chen, T., Jiang, P., Li, Y., Luo, T., Feng, X., Chen, Y., and Tensom, 2015, June. DianNao/Nan: shifting vision processing closer to the sensor. *Proceedings of the 2015 International Symposium on Computer Architecture*.
- [Dit16a] Ditzel, J., Poole, B., Chen, Y., and Xie, Y., 2016, August. Cnnp: Using both weights and connections for efficient neural networks. *2009 International Conference on Field-Programmable Logic and Applications*.
- [Fai11] Faitach, C., Martini, B., Corda, B., Alsekfeld, P., Colacicoli, E., and LeCun, Y., 2011, June. Neuflow: A runtime reconfigurable dataflow processor for vision. *In CIPS’2011 Workshops*.
- [Fai15] Faitach, C., Martini, B., Corda, B., Alsekfeld, P., Colacicoli, E., and LeCun, Y., 2015, July. Deep Learning with Limited Numerical Precision. *ICML*.
- [Fai16] Faitach, C., Martini, B., Corda, B., Alsekfeld, P., Colacicoli, E., and LeCun, Y., 2016. A VLSI architecture for high-performance, low-cost, on-chip learning. *1990 IEEE International Joint Conference on Neural Networks*.
- [Han15] Han, S., Pool, J., Tran, J., and Dally, J., 2015, August. Learning both weights and connections for efficient neural networks. *In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montréal, Québec, Canada*. Proceedings of the 2015 International Symposium on Computer Architecture
- [Han16] Han, S., Pool, J., Tran, J., and Dally, J., 2016. EIE: efficient inference engine on compressed deep neural network. *arXiv preprint arXiv:1603.09027*.
- [Hen96] Henley, R., and Asanovic, K., 1996. Computer architecture: a practical approach, 6th edition. Elsevier.
- [Hou16] Hou, Y., and Lee, Y., 2016. Neuflow: An energy-efficient computer. Morgan and Claypool.
- [Icn16] Icnar, P., Cossu, T., and Kahn, G., 2016. Neuromorphic architecture for neural networks. *Journal of VLSI signal processing systems for signal, image and video technology*, 73(1).
- [Int16] Intel, 2016. Intel® Xeon® Processor E4-2695 v3. <http://intel.com/design/xeon-e4-2695-v3.html>
- [Jia16] Jia, Y., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16a] Jia, Y., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16b] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16c] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16d] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16e] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16f] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16g] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16h] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16i] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16j] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16k] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16l] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16m] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16n] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16o] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16p] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16q] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16r] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16s] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16t] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16u] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16v] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16w] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16x] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16y] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16z] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16aa] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ab] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ac] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ad] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ae] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16af] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ag] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ah] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ai] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16aj] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ak] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16al] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16am] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16an] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ao] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ap] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16aq] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ar] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16as] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16at] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16au] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16av] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16aw] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16az] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ba] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ca] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16da] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ea] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16fa] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ga] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ha] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ia] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ja] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ka] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16la] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ma] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16na] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16oa] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ab] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16bc] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16cd] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16de] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ef] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16fg] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16gh] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16hi] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ji] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16kj] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16lm] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16na] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ob] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16pc] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16qd] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16re] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16sf] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16tg] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16uh] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16vi] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16wf] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16xg] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16yf] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16zg] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16aa] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16bb] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16cc] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16dd] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ee] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ff] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16gg] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16hh] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ii] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16jj] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16kk] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ll] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16mm] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16nn] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16oo] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16pp] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16qq] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16rr] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ss] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16tt] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16uu] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16vv] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ww] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16xx] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16yy] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16zz] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16aa] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16bb] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16cc] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16dd] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ee] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ff] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16gg] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16hh] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ii] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16jj] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16kk] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16ll] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16mm] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16nn] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16oo] Jia, Y., and Han, J., 2016. Deep learning: what have we learned? *CoRR*, abs/1608.08807.
- [Jia16pp] Jia, Y., and Han, J., 2

# TPU succeeded because of

## Conclusions (1/2)

- Large matrix multiply unit
- Substantial software-controlled on-chip memory
- Run whole inference models to reduce host CPU
- Single-threaded, deterministic execution model  
good match to 99th-percentile response time
- Enough flexibility to match NNs of 2017 vs. 2013
- Omission of GP features ⇒ small, low power die
- Use of 8-bit integers in the quantized apps
- Apps in TensorFlow, so easy to port at speed

## Conclusions (2/2)

- Inference prefers latency over throughput
- K80 GPU relatively poor at inference (vs. training)
- Small redesign improves TPU at low cost
- 15-month design & live on I/O bus yet TPU  
15X-30X faster Haswell CPU, K80 GPU (inference),  
 $<\frac{1}{2}$  die size,  $\frac{1}{2}$  Watts
  - 65,536 (8-bit) TPU MACs cheaper, lower energy, &  
faster 576 (32-bit) CPU MACs, 2496 GPU (32-bit) MACs
- 10X difference in computer products are rare

# Questions?

\*4/5/17 Google published a blog on the TPU. A 17-page technical paper with same title will be on arXiv.org. (Paper will also appear at the *International Symposium on Computer Architecture* on June 26, 2017.)

<https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html>