



# Master Informatics Eng.

2020/21

*A.J.Proen  a*

**Beyond Vector Extensions** (*online*)  
*(most slides are borrowed)*

## *Beyond vector extensions*



- Evolution of vector/SIMD-extended architectures
  - **accelerators optimized for number crunching (GPU)**
  - **support for matrix multiply + accumulate operations**
    - most scientific, engineering, AI & finance applications use matrix computations, namely the dot product: multiply and accumulate the elements in a row of a matrix by the elements in a column from another matrix
    - typically these extensions are **Tensor Processing Unit (TPU)**
  - **support for half-precision FP & 8-bit integer**
    - machine learning using neural nets is becoming very popular; to compute the model parameters during training phase, intensive matrix products are used and with very low precision (is adequate!)

# *Machine learning w/ neural nets & deep learning*



**Artificial Neural Network  
(Single Layer ML)**



**Deep Neural Network  
(Multiple Layer ML)**



— Input Layer

— Hidden Layer

— Output Layer

# *Deep Learning phases*



# *Deep Learning workflow*



**Key algorithms to train & classify use matrix dot products,  
but do not require high precision numbers!**

# Training a Neural Net Model



During model **training**, labeled data samples flow from *input* to *output* (*I* to *O*) through all layers of parametrized transformations — a *forward pass*. At the output end, the *output*, or *prediction*, is compared to the *correct answer* for that particular *input*. *Prediction error* is computed; *error* being the difference between the *predicted output* and the *correct one*. Then, the *error* begins to work its way backwards in the *O-to-I* direction, via the *backpropagation algorithm*. As the *error* flows through each layer, it interacts with the *I-to-O* data that produced it (that data has been parked there, waiting for the *error* to come back) and, together, they determine how to change the layer's parameters to most effectively reduce the *error*. The parameters are then adjusted, and this process of forward-backpropagation steps continues for numerous passes over the set of *training examples*, until the *error* becomes insignificant or doesn't decrease anymore.



# *Required hardware operations & data types to train & classify neural nets*



## *Approaches to operations on tensors*



- **Tensor**: a mathematical object that describes the relationship between other mathematical objects that are all linked together; they are commonly shown as a multidimensional array
- Different approaches followed by chip manufacturers:
  - add new extensions to existing HPC vector devices
    - **NVidia**: tensor core units in HPC GPUs
    - **Intel**: AVX-512VNNI & AMX
  - develop SoC devices for embedded/specific application fields
    - neural net devices: **Google** TPU, **Intel** Habana, ...
    - autonomous driving: **Tesla** FSD, **NVidia** Orin, ...
    - smartphones: **Apple** A14 Bionic, **Huawei** Kirin 9000, Qualcomm Snapdragon, Samsung Exynos, ...
    - gaming: ...

## ***Approaches to operations on tensors***



- **Tensor:** a mathematical object that describes the relationship between other mathematical objects that are all linked together; they are commonly shown as a multidimensional array
- Different approaches followed by chip manufacturers:
  - add new extensions to existing HPC vector devices
    - **NVidia:** tensor core units in HPC GPUs
    - **Intel:** AVX-512VNNI & AMX
  - develop SoC devices for embedded/specific application fields
    - neural net devices: **Google TPU**, **Intel Habana**, ...
    - autonomous driving: **Tesla FSD**, **NVidia Orin**, ...
    - smartphones: **Apple A14 Bionic**, **Huawei Kirin 9000**,  
**Qualcomm Snapdragon**, **Samsung Exynos**, ...
    - gaming: ...



# NVidia Volta Architecture: the new Tensor Cores



$$D = \left( \begin{array}{cccc} A_{0,0} & A_{0,1} & A_{0,2} & A_{0,3} \\ A_{1,0} & A_{1,1} & A_{1,2} & A_{1,3} \\ A_{2,0} & A_{2,1} & A_{2,2} & A_{2,3} \\ A_{3,0} & A_{3,1} & A_{3,2} & A_{3,3} \end{array} \right) \left( \begin{array}{cccc} B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ B_{1,0} & B_{1,1} & B_{1,2} & B_{1,3} \\ B_{2,0} & B_{2,1} & B_{2,2} & B_{2,3} \\ B_{3,0} & B_{3,1} & B_{3,2} & B_{3,3} \end{array} \right) + \left( \begin{array}{cccc} C_{0,0} & C_{0,1} & C_{0,2} & C_{0,3} \\ C_{1,0} & C_{1,1} & C_{1,2} & C_{1,3} \\ C_{2,0} & C_{2,1} & C_{2,2} & C_{2,3} \\ C_{3,0} & C_{3,1} & C_{3,2} & C_{3,3} \end{array} \right)$$

FP16 or FP32      FP16      FP16      FP16 or FP32

Figure 8. Tensor Core 4x4 Matrix Multiply and Accumulate



**For each SM:**  
**8x 64 FMA ops/cycle**  
**1 KFLOP/cycle!**



Figure 9. Mixed Precision Multiply and Accumulate in Tensor Core



## NVidia Ampere Architecture: the new 3<sup>rd</sup> generation Tensor Cores



- GEMM (*Generic Matrix Multiplication*) computes  $D = A * B + C$ 
  - $A$  can be up to 8x8 matrix (mixed-precision with  $B$ )
  - $B, C, D$  can be up to 8x4 matrix
- Each SM in A100 has 4 Tensor Cores
  - 4x 256 FMA ops/cycle (FP16), **2 KFLOP/cycle**
- A100 with Fine-Grained Structured Sparsity
  - 2:4 structured sparsity on rows (2 non-zero values in every 4-entry vector)





# Fine-Grained Structured Sparsity in A100



<https://www.nextplatform.com/2020/05/28/diving-deep-into-the-nvidia-ampere-gpu-architecture/>



## Data Types in Ampere Tensor Cores

<https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf>



TF32: same range as FP32 and same precision as FP16

The FP multiplier scales with the square of the mantissa width ( $8^2/11^2 \approx 0.5$ )



## Tensor Cores per SM: Volta vs. Ampere

<https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf>





## NVidia GPUs

|                                 | "Fermi"         | "Fermi"      | "Kepler"     | "Kepler"     | "Maxwell"    | "Pascal"     | "Volta"      | "Turing"     | "Ampere"     |
|---------------------------------|-----------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| <b>Tesla GPU</b>                | <b>GF100</b>    | <b>GF104</b> | <b>GK104</b> | <b>GK110</b> | <b>GM200</b> | <b>GP100</b> | <b>GV100</b> | <b>TU104</b> | <b>GA100</b> |
| Compute Capability              | 2.0             | 2.1          | 3.0          | 3.5          | 5.3          | 6.0          | 7.0          | 7.0          | 8.0          |
| Streaming Multiprocessors (SMs) | 16              | 16           | 8            | 15           | 24           | 56           | 84           | 72           | 128          |
| FP32 CUDA Cores / SM            | 32              | 32           | 192          | 192          | 128          | 64           | 64           | 64           | 64           |
| FP32 CUDA Cores                 | 512             | 512          | 1,536        | 2,880        | 3,072        | 3,584        | 5,376        | 4,608        | 8,192        |
| FP64 Units                      | -               | -            | 512          | 960          | 96           | 1,792        | 2,688        | -            | 4,096        |
| Tensor Core Units               |                 |              |              |              |              |              | 672          | 576          | 512          |
| Threads / Warp                  | SIMT/SIMD instr | 32           | 32           | 32           | 32           | 32           | 32           | 32           | 32           |
| Max Warps / SM                  | SMT             | 48           | 48           | 64           | 64           | 64           | 64           | 64           | 64           |
| Max Threads / SM                | 1,536           | 1,536        | 2,048        | 2,048        | 2,048        | 2,048        | 2,048        | 2,048        | 2,048        |
| Max Thread Blocks / SM          | 8               | 8            | 16           | 16           | 32           | 32           | 32           | 32           | 32           |
| 32-bit Registers / SM           | 32,768          | 32,768       | 65,536       | 65,536       | 65,536       | 65,536       | 65,536       | 65,536       | 65,536       |
| Max Registers / Thread          | 63              | 63           | 63           | 255          | 255          | 255          | 255          | 255          | 255          |
| Max Threads / Thread Block      | 1,024           | 1,024        | 1,024        | 1,024        | 1,024        | 1,024        | 1,024        | 1,024        | 1,024        |
| Shared Memory Size Configs      | 16 KB           | 16 KB        | 16 KB        | 16 KB        | 96 KB        | 64 KB        | Config       | Config       | Config       |
|                                 | 48 KB           | 48 KB        | 32 KB        | 32 KB        |              |              | Up To        | Up To        | Up To        |
|                                 |                 |              | 48 KB        | 48 KB        |              |              | 96 KB        | 96 KB        | 164 KB       |

## ***Approaches to operations on tensors***



- **Tensor:** a mathematical object that describes the relationship between other mathematical objects that are all linked together; they are commonly shown as a multidimensional array
- Different approaches followed by chip manufacturers:
  - add new extensions to existing HPC vector devices
    - **NVidia:** tensor core units in HPC GPUs
    - **Intel:** AVX-512VNNI & AMX
  - develop SoC devices for embedded/specific application fields
    - neural net devices: **Google TPU**, **Intel Habana**, ...
    - autonomous driving: **Tesla FSD**, **NVidia Orin**, ...
    - smartphones: **Apple A14 Bionic**, **Huawei Kirin 9000**, **Qualcomm Snapdragon**, **Samsung Exynos**, ...
    - gaming: ...

# Intel AVX-512: Vector Neural Network Instructions (VNNI)



**VNNI:** 2 new instr + 2 new extensions to merge previous set of 2 & 3 instr; ex.:



# The Intel Advanced Matrix Extension (AMX)

(expected in 2021)



# AMX instructions in Accelerator 1



| AMX Extensions |                                            |                |
|----------------|--------------------------------------------|----------------|
| Feature Set    | Description                                | Instructions   |
| AMX-TILE       | The base matrix tile architecture support. | 7 instructions |
| AMX-INT8       | Dot-product of Int8 tiles.                 | 4 instructions |
| AMX-BF16       | Dot-product of BF16 tiles.                 | 1 instruction  |

**Tile A**



**Tile B**



**Tile C**



**TMUL**  
 $C += A * B$

**Data types**

Sign      Range      Precision

FP32      8 BITS      23 BITS  
TF32 Range

TENSOR FLOAT 32 (TF32)      8 BITS      10 BITS  
TF32 Precision

FP16      5 BITS      10 BITS

BFLOAT16      8 BITS      7 BITS



# Intel Scalable Xeon roadmap



## ***Approaches to operations on tensors***



- **Tensor:** a mathematical object that describes the relationship between other mathematical objects that are all linked together; they are commonly shown as a multidimensional array
- Different approaches followed by chip manufacturers:
  - add new extensions to existing HPC vector devices
    - **NVidia:** tensor core units in HPC GPUs
    - **Intel:** AVX-512VNNI & AMX
  - develop SoC devices for embedded/specific application fields
    - **neural net devices:** **Google TPU**, **Intel Habana**,
    - autonomous driving: **Tesla FSD**, **NVidia Orin**, ...
    - smartphones: **Apple A14 Bionic**, **Huawei Kirin 9000**,  
**Qualcomm Snapdragon**, **Samsung Exynos**, ...
    - gaming: ...

- The Matrix Unit: 65,536 (256x256)  
8-bit multiply-accumulate units ← INT8
- 700 MHz clock rate FMA
- Peak: 92T operations/second
  - $65,536 * 2 * 700M \rightarrow 92 \text{ TOPS}$
- >25X as many MACs vs GPU
- >100X as many MACs vs CPU
- 4 MiB of on-chip Accumulator memory
- 24 MiB of on-chip Unified Buffer, SRAM (activation memory) ← SRAM
- 3.5X as much on-chip memory vs GPU
- Two 2133MHz DDR3 DRAM channels
- 8 GiB of off-chip weight DRAM memory

## TPU: High-level Chip Architecture





## Google TPU

### Chip floor plan



TPU: a Neural Network Accelerator Chip



TPUs are intensively used by Google, namely in Google Photos, RankBrain, StreetView & Google Translate

### TPUv2 Chip



- 16 GB of HBM
- 600 GB/s mem BW
- Scalar unit: 32b float
- MXU: 32b float accumulation but reduced precision for multipliers
- 45 TFLOPS

bfloat



## TPUv2 architecture



## TPUv2 core: 1 Lane of the Vector Unit

### TPU Core: Vector Unit (Lane)



## TPU Core: Matrix Multiply Unit

- 128 x 128 systolic array
  - Streaming LHS and results
  - Stationary RHS (w/ optional transpose)
- Numerics
  - bfloat16 multiply
    - $\{s, e, m\} = \{1, 8, 7\}$
    - The original!
  - float32 accumulation





## Google TPUv3 (May'18)

TPUv4 released Jun 2020  
but no data available yet...



# *Neural net devices at Intel: Nervana & later Habana chips*



- Intel acquired **Nervana** Engine (Aug 2016)
- Intel launched Nervana NNP (Neural Net Processor) (Oct 2017)
- Key features: matrix multiplication & convolution (*for neural nets*)
- Intel discontinued Nervana NNP (Jan 2020...)

- Intel acquired the Israel chipmaker **Habana** Labs (Dec 2019)
  - Habana training chip **Gaudi**, with support to FP32, INT32, **BF16**, INT16, INT8, UINT32, UINT16, UINT8
  - Habana inference chip **Goya**, with support to FP32, INT32, INT16, INT8, UINT32, UINT16, UINT8



## Training Processor: GAUDI

**habana**  
GAUDI™

<https://en.wikichip.org/wiki/habana/microarchitectures/gaudi>



**TPC:** Tensor Processing Core  
**GEMM:** General Matrix Multiply





## Inference Processor: GOYA

**habana**  
GOYA™

<https://en.wikichip.org/wiki/habana/microarchitectures/goya>



TPC: Tensor Processing Core  
GEMM: General Matrix Multiply



## *Approaches to operations on tensors*



- **Tensor**: a mathematical object that describes the relationship between other mathematical objects that are all linked together; they are commonly shown as a multidimensional array
- Different approaches followed by chip manufacturers:
  - add new extensions to existing HPC vector devices
    - **NVidia**: tensor core units in HPC GPUs
    - **Intel**: AVX-512VNNI & AMX
  - develop SoC devices for embedded/specific application fields
    - neural net devices: **Google TPU**, **Intel Habana**, ...
    - **autonomous driving**: **Tesla FSD**, **NVidia Orin**, ...
    - smartphones: **Apple A14 Bionic**, **Huawei Kirin 9000**, **Qualcomm Snapdragon**, **Samsung Exynos**, ...
    - gaming



# Tesla Full Self-Driving chip (FSD)



[https://en.wikichip.org/wiki/tesla\\_\(car\\_company\)/fsd\\_chip](https://en.wikichip.org/wiki/tesla_(car_company)/fsd_chip)



# The Neural Processing Unit in FSD





# NVidia roadmap for Drive systems





# NVidia Xavier SoC

## Xavier SoC

16 CSI  
109 Gbps  
1gE & 10gE

DLA  
5 TFLOPS FP16  
10 TOPS INT8

Video Processor  
1.2 GPIOX/s Encode  
1.8 GPIOX/s Decode

PVA  
1.6 TOPS  
Stereo Disparity  
Optical Flow  
Image Processing

Volta GPU  
FP32 / FP16 / INT8 Multi Precision  
512 CUDA Cores  
1.3 CUDA TFLOPS  
20 Tensor Core TOPS

ISP  
1.5 GPIOX/s  
Native Full-range HDR  
Tile-based Processing

Carmel ARM64 CPU  
8 Cores  
10-wide Superscalar  
2700 SpecInt2000  
Functional Safety Features  
Dual Execution Mode  
Parity & ECC

256-Bit LPDDR4  
137 GB/s

Jun'18



# NVidia Xavier SoC



**Deep Learning Accelerator**  
Large array of multiply-accumulate units optimized for CNNs



# NVidia Drive SoC's: Parker, Xavier, Orion



NVIDIA ARM SoC Specification Comparison

|                              | Orin<br>2021?                        | Xavier<br>2018                        | Parker<br>2016                          |
|------------------------------|--------------------------------------|---------------------------------------|-----------------------------------------|
| <b>CPU Cores</b>             | 12x Arm "Hercules"<br>Cortex-A78AE * | 8x NVIDIA Custom ARM<br>"Carmel"      | 2x NVIDIA Denver +<br>4x Arm Cortex-A57 |
| <b>GPU Cores</b>             | Ampere iGPU<br>(?? cores)            | Xavier Volta iGPU<br>(512 CUDA Cores) | Parker Pascal iGPU<br>(256 CUDA Cores)  |
| <b>INT8 DL TOPS</b>          | 200 TOPS                             | 30 TOPS                               | N/A                                     |
| <b>FP32 TFLOPS</b>           | ?                                    | 1.3 TFLOPs                            | 0.7 TFLOPs                              |
| <b>Manufacturing Process</b> | 7nm?                                 | TSMC 12nm FFN                         | TSMC 16nm FinFET                        |
| <b>TDP</b>                   | ~5-45W                               | 30W                                   | 15W                                     |

\* AE, *Automotive Enhanced*: improved functional security and safety



# NVidia Orin: a SoC for Autonomous Driving



## *Approaches to operations on tensors*



- **Tensor**: a mathematical object that describes the relationship between other mathematical objects that are all linked together; they are commonly shown as a multidimensional array
- Different approaches followed by chip manufacturers:
  - add new extensions to existing HPC vector devices
    - **NVidia**: tensor core units in HPC GPUs
    - **Intel**: AVX-512VNNI & AMX
  - develop SoC devices for embedded/specific application fields
    - neural net devices: **Google TPU**, **Intel Habana**, ...
    - autonomous driving: **Tesla FSD**, **NVidia Orin**, ...
    - **smartphones**: **Apple A14 Bionic**, **Huawei Kirin 9000**,  
**Qualcomm Snapdragon**, **Samsung Exynos**, ...
    - gaming: ...



Bionic



# Evolution of Apple A series

**Bionic => with a neural engine**





## Apple A14 Bionic SoC



The infographic highlights several key features of the Apple A14 Bionic SoC:

- Machine learning controller** (represented by a blue cube icon with "ML")
- New 6-core CPU** (represented by a hexagonal icon)
- Next-generation ML accelerators** (represented by a neural network diagram)
- 16-core NEURAL ENGINE** (represented by a large text "NEURAL ENGINE" with a circuit board texture)
- Advanced image signal processor** (represented by a camera lens icon)
- New 4-core GPU** (represented by a hexagonal icon)
- Secure Enclave** (represented by a padlock icon)

**5 nanometer process** (highlighted with a yellow oval)

**11.8 billion Transistors** (highlighted with a yellow oval)

**11 trillion Operations per second**



## Apple A14 Bionic SoC



Apple-designed  
64-bit six-core CPU  
implementing ARMv8.4 ISA  
8 MiB L2 cache



16-core Neural Engine +  
2<sup>nd</sup> gen ML accelerators (AMX) +  
high-performance 4-core GPU =>  
powerful image recognition, natural language learning, motion analysis, ...





## Roadmap of Apple Silicon



# Apple M1: an extended version of A14



Nov'20





# Apple M1 SoC



**5-nanometer process**  
**16 billion transistors**





## Apple M1 Performance



## *The new MacBooks (Nov'20)*

# Apple Event

November 10, 2020

A collage of features for the new MacBooks, each in its own box:

- Unified memory architecture (Icon: two overlapping squares)
- Up to **3.5x** faster CPU
- Up to **6x** faster GPU
- Up to **15x** faster machine learning
- Neural Engine**
- macOS Big Sur (Icon: colorful wavy background)
- Advanced camera ISP (Icon: camera lens with yellow grid)
- Industry-leading performance per watt (Icon: bar chart)
- Up to **20 hours** battery life (Icon: green battery with 20 hours)
- Wi-Fi 6 (Icon: Wi-Fi symbol)
- iPhone and iPad apps (Icon: laptop with three stacked rectangles)
- Secure Enclave (Icon: lock)
- Universal apps (Icon: globe)
- M1 (Icon: Apple logo with 'M1' text)



## Huawei Kirin 9000



The Most Powerful



Chip, Ever





# Huawei Kirin 9000



A77: old ARM  
generation...



# *Approaches to operations on tensors*



## *Approaches to operations on tensors*



- **Tensor:** a mathematical object that describes the relationship between other mathematical objects that are all linked together; they are commonly shown as a multidimensional array
- Different approaches followed by chip manufacturers:
  - add new extensions to existing HPC vector devices
    - **NVidia:** tensor core units in HPC GPUs
    - **Intel:** AVX-512VNNI & AMX
  - develop SoC devices for embedded/specific application fields
    - neural net devices: **Google** TPU, **Intel** Habana, ...
    - autonomous driving: **Tesla** FSD, **NVidia** Orin, ...
    - smartphones: **Apple** A14 Bionic, **Huawei** Kirin 9000, Qualcomm Snapdragon, Samsung Exynos, ...
    - gaming: ...

And after this...

AJProen , Advanced Architectures, MiEI, UMinho, 2020/21

7

Yet some more odd approaches:

- SoC with reconfigurable components: **Xilinx** ACAP
- a system based on a **very large** “chip”: **Cerebras**

## Xilinx Versal: an Adaptive Compute Acceleration Platform (ACAP)



ACAP die, an adaptable accelerator-fabric ecosystem with:

- multicore ARM SoC's
- an FPGA fabric with distributed memory
- hw-programmable DSP engines
- AI Engines with vector units
- other specialized accelerators
- a flexible NoC interconnection



# The AI Engine Array in Xilinx Versal (1)



## AI Engine: Xilinx Reinvents Multi-Core Compute

### Traditional Multi-core (cache-based architecture)



- Fixed, shared Interconnect
  - Blocking limits compute
  - Timing not deterministic

- Data Replicated
  - Robs bandwidth
  - Reduces capacity

### AI Engine Array (intelligent engine)



- Local, Distributed Memory
  - No cache misses
  - Higher bandwidth
  - Less capacity required

## The AI Engine Array in Xilinx Versal (2)



### AI Engine: Tile-Based Architecture



# The AI Engine Array in Xilinx Versal (3)



## AI Inference Mapping on Versal ACAP

A = Activations  
W = Weights

$$\begin{bmatrix} A_{00} & A_{01} \\ A_{10} & A_{11} \end{bmatrix} \times \begin{bmatrix} W_{00} & W_{01} \\ W_{10} & W_{11} \end{bmatrix} = \begin{bmatrix} A_{00} \times W_{00} + A_{01} \times W_{10} & \dots \\ A_{10} \times W_{00} + A_{11} \times W_{10} & \dots \end{bmatrix}$$



- > Custom memory hierarchy
  - > Buffer on-chip vs off-chip; Reduce latency and power
- > Broadcast on AI interconnect (Weights and Activations)
  - > Read once: reduce memory bandwidth
- > AI-optimized vector instructions (128 INT8 mults/cycle)



**Cerebras** Wafer Scale Engine (WSE):  
the largest chip ever built

**46,225 mm<sup>2</sup> chip**

56x larger than the biggest GPU ever made

**400,000 core**

78x more cores

**18 GB on-chip SRAM**

3000x more on-chip memory

**100 Pb/s interconnect**

33,000x more bandwidth





## Some additional data:

- 16 nm process
- 215 mm x 215 mm, ~15 kW consumption !
- 84 individual chips (12 wide by 7 tall)
- each chip:
  - 225 MiB SRAM
  - $54 \times 94 = 5,076$  Sparse Linear Algebra cores (SLA)  
(2 cores per row/column unused due to repair scheme leaving 4,888 usable cores)
- each core:
  - 47 kiB SRAM
  - Zeros not loaded from memory and zeros not multiplied
  - FP32 precision and scalar execution (can't filter zeros from memory with SIMD)
  - FMAC datapath with peak 8 operations per cycle)
  - Tensor control unit feeds the FMAC datapath with strided accesses  
(from memory or inbound data from links)
  - 4x 8 GB/s bidirectional links to its neighbours





## Architecture Designed for Deep Learning

**Each component optimized for AI compute**

### Compute

- Fully-programmable core, ML-optimized extensions
- Dataflow architecture for sparse, dynamic workloads

### Memory

- Distributed, high performance, on-chip memory

### Communication

- High bandwidth, low latency fabric
- Cluster-scale networking on chip
- Fully-configurable to user-specified topology

**Together, orders of magnitude performance and efficiency gain**

**Linear cluster-scale performance on a single chip**





# Cerebras: programming WSE





# Packing the Cerebras WSE: clusters of CS-1s (Argonne National Lab)



Figure 8: A high-level overview of the compilation process for the WSE

## The CS-1 internal structure



### 2. Engine Block

Power, cooling and packaging  
solution for the Wafer-Scale  
Engine



## Wafer Scale Engine – Generation 2

850,000 **AI-optimized cores**

2.6 Trillion **Transistors**

TSMC 7nm **Process**



Cerebras executive Sean Lie

Powered by StreamingVideoProvider