

# Tensix AI processor IP

## IP Business Unit



CONFIDENTIAL - CONTAINS TRADE SECRETS

# Silicon Roadmap



# Tensix AI IP Roadmap



# Tensix: IP for your IP Workload



Efficient : Industry leading  
Perf/W and perf/\$\$



Training & Inference



Silicon proven



Scalable & Configurable



Easy to Use & Mature  
Software stack(s)

# Tensix-BH Features

## Near-memory compute

- High-bandwidth L1 <=> Register interface
- Vector int/float accumulation in L1(1.5MB) memory

## Many data formats supported

- FP8, BFP2/4/8, INT8, INT32, BF16, FP16, TF32, FP32

## General purpose SIMD engine (SFPU)

- Fast transcendental instructions (Gelu, Relu, Exp)
- SFPUC++ compiler

## 5 C++-programmable RiscV cores per Tensix

- RISC-V cores manage overhead of computation and NoC transfers
- Icache, data cache, data local memory, instruction prefetcher, branch predictor, floating-point and atomics support
- Support for interrupts

## Sparsity

- HW support for fine-grained structured sparsity

## Silicon Proven with Grayskull, Wormhole, and Blackhole



# Data Formats

| Format               | Spec.                                                                       |
|----------------------|-----------------------------------------------------------------------------|
| BFP8/BFP4/BFP2 (A/B) | Shared exp, 1-bit sign, 7/3/1-bit mantissa                                  |
| FP8                  | 4-bit exponent, 3-bit mantissa<br>5-bit exponent, 2-bit mantissa            |
| Integer              | Int8: signed/unsigned<br>Int32: signed                                      |
| FP16 (A/B)           | A: 1-bit sign, 5-bit exp, 10-bit man<br>B: 1-bit sign, 8-bit exp, 7-bit man |
| TF32                 | 1-bit sign, 8-bit exp, 10-bit man                                           |
| FP32                 | 1-bit sign, 8-bit exp, 23-bit man                                           |

FPU multiplier operates on 5b x 7b input and uses phases for higher fidelity

- BFP4 -> 1 phase (2K MACs/cycle/core)
- BFP8 -> 2 phases
- FP16/TF32 -> 4 phases
- BFLOAT can be done in 2 phases with loss of 1 LSB

# Tensix I/O

## 4 ring interconnects

- N/S: 1 ring CW and 1 ring CCW
- E/W: 1 ring CW and 1 ring CCW

Data width: 256b (WH), 512b (BH)



# Tensix NEO Architecture



# Tensix NEO

## High performance / mm<sup>2</sup>

- Reduced total SRAM due to better utilization
- Fewer L2 banks with simpler crossbar
- Implementation improvements with tiled design
- Many micro-architectural optimizations

## Improved SFPUs and FPU utilization

- Higher L1 bandwidth to more easily saturate FPU
- Optimized ISA for complex data patterns
- Dedicated RiscV core and registers for SFPUs

## Simpler data movement

- More powerful RISC-V data movement cores
- 64-bit address space
- Significantly smaller DMA engines
- Wide bi-directional mesh NOC

## Even more data formats

- Support for new Microscaling formats
- Added FP4



## Tensix Neo Data Formats

- Formats
  - FP16
  - BF16
  - UINT8/INT8
  - MXFP8
  - MXFP6P
  - MXFP6R
  - *MXINT8*
  - *MXINT4*
  - *MXINT2*
- Vector Unit with FP32 support

# AI IP Configurations

## Data-types

- Floating-point + integer
- Integer only

## Throughput

- Configurable matrix and vector engine throughputs

## RISC-V Cores

- Configurable cache sizes
- Optional vector unit with configurable vector width

## L1 Memory size

- Number of banks 16/32
- Configurable space per bank

# Neo Config-1

| Feature                | Neo Config-1               |
|------------------------|----------------------------|
| Number of Tensix       | $4 \times 4 \times 4 = 64$ |
| SRAM/Tensix            | 0.75MB                     |
| Tensix/Dispatch        | 8                          |
| Dispatch SRAM          | 4MB                        |
| NOC BW (Aggregate)     | 256GB/s                    |
| I/O BW                 | 256GB/s                    |
| DRAM BW (Per Endpoint) | 15GB/s                     |
| DRAM Endpoints         | 8                          |
| Aggregate DRAM BW      | 120GB/s                    |
| Memory Locations       | North Edge                 |



# Tensix IP Configuration Overview

|                                   | BH        | BH+       | NEO<br>Auto        | NEO                |
|-----------------------------------|-----------|-----------|--------------------|--------------------|
| <b>Matrix Multiply Throughput</b> |           |           |                    |                    |
| FP4 MACs                          | N/A       | N/A       | 4096               | 16384              |
| BFP4/FP8 MACs                     | 2048      | 1024      | 4096               | 8192               |
| BFP8 MACs                         | 1024      | 2048      | 4096               | 8192               |
| BFLOAT16 MACs                     | 1024      | 2048      | 4096               | 8192               |
| FP16 MACs                         | 512       | 512       | 1024               | 2048               |
| INT8 MACs                         | 512       | 2048      | 8192               | 8192               |
| FP32 accumulator                  | Yes       | Yes       | Yes                | Yes                |
| <b>Vector Multiply Throughput</b> |           |           |                    |                    |
| FP32 MACs                         | 32        | 32        | 128                | 128                |
| <b>SRAM</b>                       |           |           |                    |                    |
| L1 size (MB)                      | 1.5       | 1.5       | 3-4                | 4                  |
| <b>NoC</b>                        |           |           |                    |                    |
| Topology                          | 2D Torus  | 2D Torus  | 2D Mesh            | 2D Mesh            |
| Link BW                           | 64B/cycle | 64B/cycle | 256B/cycle         | 256B/cycle         |
| ASIL                              | No        | No        | Yes                | No                 |
| <b>Physical</b>                   |           |           |                    |                    |
| Area                              | 2.5       | TBD       | 6.5mm <sup>2</sup> | 6.3mm <sup>2</sup> |
| Power                             | 2.11      | TBD       | 2.0                | 3.0                |
| Node                              | N6        | N6        | S5A                | S4                 |
| Clock Speed                       | 1.45      | 1.45      | 1                  | 1.15               |



# Simulation Modelling



# Introduction



Different Abstraction Levels: Answer Specific Questions/Demands

- Abstract Models : Product/Roadmap Planning, Competitive Analysis, TCO Models
- Hybrid Models : Architecture/Design Space Exploration, HW/SW Codesign
- Concrete Models : Run real kernels on a single core

# Architecture Performance Modelling Goals

- **Concrete Modeling:** Create A Configurable, Cycle Approximate, Microarchitectural Simulation Model for Tensix Core, with an ability to run real kernels (Binary/Source Compatible, Extensible ISA) on a single Core (or a small grid of Cores) in an idealized SoC setting
- **Hybrid Modeling:** Create A Configurable, High Level Architecture Simulation Model for TT SoC, to do performance projection of real workloads (Buda/Metal) on a single SoC (e.g. Quasar) or at scale (e.g., multiple Quasars + Ethernet Interconnect)
- **Abstract Modeling:** Create A Configurable, High Level Analytical Model for TT System Configuration TCO Analysis, Roadmap Planning, and Competitive Projections

# Plans/Timeline

Post Si Correlation

| 2024                                            |     |     |                                      |     |                                     |     |                                            | 2025                                                   |                                                           |     |     |     |     |     |  |  |
|-------------------------------------------------|-----|-----|--------------------------------------|-----|-------------------------------------|-----|--------------------------------------------|--------------------------------------------------------|-----------------------------------------------------------|-----|-----|-----|-----|-----|--|--|
| Q3                                              |     |     |                                      | Q4  |                                     |     |                                            | Q1                                                     |                                                           |     |     | Q2  |     |     |  |  |
| JUL                                             | AUG | SEP | OCT                                  | NOV | DEC                                 | JAN | FEB                                        | MAR                                                    | APR                                                       | MAY | JUN | JUL | AUG | SEP |  |  |
| <b>Abstract Modeling<br/>(SoC)</b>              |     |     | LLMs@Roofline<br>(ONNX/Python API)   |     | Si Correlation<br>(ONNX/Python API) |     |                                            | Improved Op/WL Support<br>(ONNX/PyTorch2.x/Python API) |                                                           |     |     |     |     |     |  |  |
| <b>Abstract Modeling<br/>(At Scale Systems)</b> |     |     | LLMs@Rocline<br>(ONNX/Python API)    |     | Collectives Support                 |     | API Integration<br>(ORT-DSpeed/Python API) |                                                        | Improved WL Support<br>(ORT-DSpeed/PyTorch2.x/Python API) |     |     |     |     |     |  |  |
| <b>Hybrid Modeling<br/>(SoC)</b>                |     |     | LLMs@QSR NOC<br>(ttMetal/Python API) |     | RTL/Si Correlation<br>(ttMetal)     |     | Improved Dataflow Support<br>(ttMetal)     |                                                        |                                                           |     |     |     |     |     |  |  |
| <b>Concrete Modeling<br/>(Tensix Neo)</b>       |     |     | Key LLKs (Matmul/Attn)<br>(ttMetal)  |     | RTL/Si Correlation<br>(ttMetal)     |     | LLK Integration<br>(ttMetal)               |                                                        |                                                           |     |     |     |     |     |  |  |

# Software



# Tenstorrent Software – Two Different Approaches

GENERALITY > PERFORMANCE



Quick deployment and improving performance

- Out of the box, run many models, small overhead incurred
- Performance continues to increase on monthly release cadence
- Support for all major frameworks including PyTorch and Tensorflow

PERFORMANCE > GENERALITY



Complete low-level control of the hardware

- Targeted models running higher performance at higher utilization
- Completely open hardware allows you to program non-ML applications.
- Useful for HPC, C++ environments and low-level model development.

# Open Tenstorrent Software

- TT-Forge - Integrated into various frameworks for native model ingest
- TT-MLIR - new MLIR-based compiler
- TT-NN – a library of optimized operators
  - ATen coverage
  - PyTorch-like API
- TT-Metalium – low level programming model & entry point



# Open Tenstorrent Software

- TT-Forge - Integrated into various frameworks for native model ingest
- TT-MLIR - new MLIR-based compiler
- TT-NN – a library of optimized operators
  - ATen coverage
  - PyTorch-like API
- TT-Metalium – low level programming model & entry point



## LLMs

| Model                                      | Batch | Hardware                 | ttft<br>(ms) | t/s/u | Target<br>t/s/u | t/s    | Release                      |
|--------------------------------------------|-------|--------------------------|--------------|-------|-----------------|--------|------------------------------|
| <a href="#">Falcon7B-decode</a>            | 32    | <a href="#">e150</a>     |              | 4.2   | 4.4             | 134.4  |                              |
| <a href="#">Falcon7B</a>                   | 32    | <a href="#">n150</a>     | 75           | 17.1  | 26              | 547.2  | <a href="#">v0.53.0-rc33</a> |
| <a href="#">Mistral-7B</a>                 | 32    | <a href="#">n150</a>     |              | 9.9   | 25              | 316.8  | <a href="#">v0.51.0-rc28</a> |
| <a href="#">Mamba-2.8B</a>                 | 32    | <a href="#">n150</a>     | 48           | 12.3  | 41              | 393.6  | <a href="#">v0.51.0-rc26</a> |
| <a href="#">LLaMA-3.1-8B</a>               | 1     | <a href="#">n150</a>     | 291          | 22.9  | 23              | 22.9   | <a href="#">v0.53.0-rc16</a> |
| <a href="#">Falcon7B (DP=8)</a>            | 256   | <a href="#">QuietBox</a> | 101          | 14.4  | 26              | 3686.4 | <a href="#">v0.53.0-rc33</a> |
| <a href="#">LLaMA-3.1-70B (TP=8)</a>       | 32    | <a href="#">QuietBox</a> | 190          | 15.1  | 20              | 483.2  | <a href="#">v0.53.0-rc33</a> |
| <a href="#">Falcon40B (TP=8)</a>           | 32    | <a href="#">QuietBox</a> |              | 5.3   | 36              | 169.6  | <a href="#">v0.53.0-rc33</a> |
| <a href="#">Mixtral7Bx8 (TP=8)</a>         | 32    | <a href="#">QuietBox</a> | 235          | 14.2  | 33              | 454.4  | <a href="#">v0.53.0-rc33</a> |
| <a href="#">Falcon7B (DP=32)</a>           | 1024  | <a href="#">Galaxy</a>   | 242          | 4.4   | 26              | 4505.6 | <a href="#">v0.53.0-rc33</a> |
| <a href="#">LLaMA-3.1-70B (DP=4, TP=8)</a> | 128   | <a href="#">Galaxy</a>   | 190          | 14.3  | 20              | 1835.5 | <a href="#">v0.52.0-rc31</a> |

Last Update: November 4, 2024

<https://github.com/tenstorrent/tt-metal>

## CNNs

| Model                                          | Batch | Hardware                     | fps     | Target fps | Release |
|------------------------------------------------|-------|------------------------------|---------|------------|---------|
| <a href="#">ResNet-50 (224x224)</a>            | 20    | <a href="#">e150</a>         | 5,100   | 10,000     |         |
| <a href="#">ResNet-50 (224x224)</a>            | 16    | <a href="#">n150</a>         | 4,100   | 7,000      |         |
| <a href="#">ResNet-50 (224x224) (DP=2)</a>     | 32    | <a href="#">n300</a>         | 8,200   | 14,000     |         |
| <a href="#">ResNet-50 (224x224) (DP=8)</a>     | 128   | <a href="#">QuietBox</a>     | 32,250  | 56,000     |         |
| <a href="#">ResNet-50 (224x224) (DP=32)</a>    | 512   | <a href="#">Galaxy</a>       | 95,900  | 224,000    |         |
| <a href="#">ResNet-50 (224x224) (DP=64)</a>    | 1024  | <a href="#">Two Galaxies</a> | 145,000 | 448,000    |         |
| <a href="#">ViT</a>                            | 9     | <a href="#">e150</a>         | 1,360   | 2,000      |         |
| <a href="#">ViT</a>                            | 8     | <a href="#">n150</a>         | 912     | 1,600      |         |
| <a href="#">Stable Diffusion 1.4 (512x512)</a> | 1     | <a href="#">n150</a>         | 0.167   | 0.3        |         |
| <a href="#">Yolo V4 (320x320)</a>              | 1     | <a href="#">n150</a>         | 95      | 300        |         |

## NLPs

| Model                      | Batch | Hardware             | sen/sec | Target sen/sec | Release |
|----------------------------|-------|----------------------|---------|----------------|---------|
| <a href="#">BERT-Large</a> | 12    | <a href="#">e150</a> | 370     | 410            |         |
| <a href="#">BERT-Large</a> | 8     | <a href="#">n150</a> | 270     | 400            |         |
| <a href="#">T5 small</a>   |       | <a href="#">e150</a> | 140     |                |         |
| <a href="#">Bloom</a>      |       | <a href="#">e150</a> | 70      |                |         |



<https://github.com/tenstorrent/tt-metal>



Thank You!

