

This event includes forward-looking statements about future products and other topics, which are based on our current expectations and subject to risks and uncertainties. Please refer to the press release for this event and our SEC filings at [intc.com](http://intc.com) for more information on the risk factors that could cause actual results to differ materially.



# Architecture Day

2021





# 1000x

by 2025



# 1000x

by 2025



**1000x**  
by 2025

=

**(Moore's Law)<sup>5</sup>**

1x

# 1000x

by 2025

Process & Packaging —

1x



# 1000x

by 2025

Process & Packaging —



# 1000x

by 2025

Process & Packaging

Memory

Interconnect

1x



1000x

Architecture

Software

Memory

Interconnect

Process & Packaging

1x



# Scalar



Scalar



Vector



Scalar



Vector



Matrix



Scalar



Vector



Matrix



Spatial



Scalar



Vector



Matrix



Spatial



# Hybrid Computing Architectures



# Hybrid Computing Architectures

Process

Packaging

Caches

Memory

Interconnect



# Hybrid Compute Cluster

in a Package



Performance  
Core

Efficient  
Core

Intel  
Thread  
Director

X<sup>e</sup> - core

Sapphire  
Rapids

X<sup>e</sup> HPC &  
Ponte  
Vecchio

Alder  
Lake

AMX

X<sup>e</sup> HPG

Mount Evans

Performance  
Core

AMX

Alder  
Lake

**Efficient  
Core**

Intel  
Thread  
Director

X<sup>e</sup> SS

X<sup>e</sup> HPG

X<sup>e</sup> - core

Sapphire  
Rapids

Mount Evans

X<sup>e</sup> HPC &  
Ponte  
Vecchio

# Efficient x86 Core

Stephen Robinson

# Microarchitecture Goals

Highly Scalable Architecture To Address the Throughput Efficiency Needs For the Next Decade of Compute



Intel's Most Efficient Performant CPU



Dense & Highly Scalable



Vector and AI Instruction Support



Wide Dynamic Range



## Intel's **New** Efficient x86 Core Microarchitecture

Designed for throughput, enabling scalable multi-threaded performance for modern multi-tasking

Optimized for power and density efficient throughput with:

**Deep Front-End**  
with on-demand length decode

**Wide Back-End**  
with many execution ports

**Optimized Design**  
for latest transistor technologies

# Instruction Control



# Instruction Control



# Data Execution

**Five-wide allocation  
with eight-wide retire**

**256 entry  
out of order window**

Discovers data parallelism

**Seventeen  
execution ports**

Executes data parallelism



# Data Execution



# Memory Subsystem

Dual Load + Dual Store

Up to 4MB L2

shared among four cores  
with 64 Bytes/cycle bandwidth in 17 cycles of latency

Deep buffering

supporting 64 outstanding misses

Advanced Prefetchers

at all cache levels to detect a wide variety of streams

Intel® Resource Director Technology

enables software to control fairness among the cores  
and between different software threads



# Modern Instruction Set

## Security

Intel® Control-flow Enforcement Technology designed to improve defense in depth

Intel® VT-rp  
(Virtualization Technology redirect protection) Supported

Advanced speculative execution validation methodology

Support for **Advanced Vector Instructions** with AI extensions

**Wide Vector** Instruction Set Architecture

**Floating point multiply-accumulate (FMA) instructions for 2x throughput**

**Key instruction additions to enable integer AI throughput (VNNI)**



# Efficiency in Both Power and Performance per Transistor

**Intense focus on feature selection and design implementation costs**

to maximize area efficiency, which in turns enables core count scaling

**Low switching energy per instruction**

to maximize power constrained throughput, key for today's throughput-driven workloads

**Reduced operating voltage required for all frequencies**

saving power while extending the performance range

$$P = C \times F \times V^2$$

# Latency Performance



SPECrate2017\_int\_base estimates using an open source compiler, iso-binary.  
For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.



# Throughput Performance



SPECrate2017\_int\_base estimates using an open source compiler, iso-binary  
For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.





## Intel's **New** Efficient x86 Core Microarchitecture

Designed for throughput, enabling scalable multi-threaded performance for modern multi-tasking

Optimized for power and density efficient throughput with:

**Deep Front-End**  
with on-demand length decode

**Wide Back-End**  
with many execution ports

**Optimized Design**  
for latest transistor technologies

# Performance Core



# Architecture Day

2021

New Architectural Foundations

Alder  
Lake

X<sup>e</sup> HPG

AMX

X<sup>e</sup> SS

Intel  
Thread  
Director

X<sup>e</sup> - core

Sapphire  
Rapids

Mount Evans

X<sup>e</sup> HPC &  
Ponte  
Vecchio

# Performance x86 Core

Adi Yoaz



# Performance

## x86 Core

### Architecture Goals

A Step Function in CPU Architecture  
Performance For the Next Decade of Compute

All in a tailored scalable architecture to serve the  
full range of Laptops to Desktops to Data Centers

**Deliver a step function**  
in general purpose CPU performance

**Advance the Arch/uArch with new features**  
for evolving trends of workload patterns

Innovate with next disruption in  
**AI performance acceleration**



Intel's New

# Performance x86 Core Architecture

Designed for speed, pushing the limits of low latency and single threaded application performance via:

Wider

Deeper

Smarter

- Acceleration of workloads with large code footprint & large data sets
- **NEW** AI acceleration technology via coprocessor for matrix multiplication
- **NEW** smart PM controller for fine grain power budget management

# Front-End

Fetch instructions and decodes them into  $\mu$ ops

## Large Code

- 128→256 4K iTLB, 16→32 2M/4M iTLB
- Enhanced code prefetch
- 5K→12K branch targets

## Wider

- 16B→32B length decode
- 4→6 decoders
- 6→8  $\mu$ op/cyc from  $\mu$ op\$

## Smarter

Improved branch prediction accuracy  
Smarter code prefetch mechanism

## $\mu$ op Queue

- 70 → 72 entries per thread
- 70 → 144 single thread

## $\mu$ op\$

- 2.25K→4K  $\mu$ ops:
  - increased hit-rate
  - increased Frontend BW



# Out of Order Engine

Track μop dependencies and dispatch ready μops to execution units

## Wider

5 → 6 wide allocation  
10 → 12 execution ports

## Deeper

512-entry Reorder-Buffer and larger Scheduler sizes

## Smarter

More instructions “executed” at rename / allocation stage



# Integer Execution Units

5th Integer execution port /  
ALU added

1-cycle LEA on all 5 ports

Used also for arithmetic calculations



# Vector Execution Units

## New Fast Adder (FADD):

Power efficient, low latency

## FMA units support FP16 data type

FP16 added to Intel® AVX512 including complex numbers support



# L1 Cache & Memory Subsystem

## Wider

- 2 → 3 load ports:
- 3×256bit loads
- 2×512bit loads

## Smarter

- Reduced effective Load Latency
- Faster Memory Disambiguation resolution

## Deeper

- Deeper Load Buffer and Store Buffer expose more memory parallelism

## Large Data

- DTLB 64 → 96
- L1D\$: 12 → 16 fill buffers
- L1D\$ enhanced prefetcher
- 2 → 4 page walkers



# L2 Cache & Memory Subsystem

## Bigger

L2\$: 1.25MB (client) or 2MB (data center)

## Faster

Max demand misses 32 → 48

## Smarter

- L2\$ pattern-based multi-path prefetcher
- Feedback-based prefetch throttling
- Full-line-write predictive bandwidth optimization – reduces DRAM reads



# General-Purpose Performance Vs. 11<sup>th</sup> Gen Intel® Core™



SPEC CPU 2017, SYSmark 25, Crossmark, PCMark 10, WebXPRT3, Geekbench 5.4.1

<sup>1</sup>Geomean of Performance core (ADL) vs. Cypress Cove (RKL) Core @ ISO 3.3GHz Frequency

For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Intel® Advanced Matrix Extensions (Intel® AMX)

Tiled Matrix Multiplication Accelerator - Data Center



# Intel® Advanced Matrix Extensions (Intel® AMX)

Tiled Matrix Multiplication Accelerator - Data Center

AMX architecture has two components:

## Tiles

- A new expandable 2D register file – 8 new registers, 1Kb each: T0-T7
- Register file supports basic data operators – load/store, clear, set to constant, etc.
- TILES declares the state and is OS-managed by XSAVE architecture

## TMUL

- Set of matrix multiplication instructions, the first operators on TILES
- A MAC computation grid calculates 'tiles' of data
- TMUL – performs Matrix ADD-Multiplication ( $C = A * C + B$ ) using three Tile registers ( $T_2 = T_1 * T_0$ )
- TMUL requires TILE to be present



Express more work per instruction and per µop –  
save power for fetch/decode/OOO

# Intel® Advanced Matrix Extensions (Intel® AMX)

## Architecture



■ New state to be managed by OS

■ Commands and status delivered synchronously via TILE/accelerator instructions

■ Dataflow – accelerators communicate to host through memory



New

# Performance

## x86 Core

A Step Function in CPU Architecture  
Performance For the Next Decade of Compute

A significant IPC boost at high power efficiency

Wider

Deeper

Smarter

Better supports large data set and large code footprint applications

Enhanced power management improves frequency and power

Machine Learning Technology: Intel® AMX – Tile Multiplication

All in a tailored scalable architecture to serve the full range of Laptops to Desktops to Data Centers

# Architecture Day

2021

New Architectural Foundations



Intel Thread Director

X<sup>e</sup> - core

Sapphire Rapids

X<sup>e</sup> SS

Mount Evans

X<sup>e</sup> HPG

Alder Lake

X<sup>e</sup> HPC & Ponte Vecchio

# Scalar Architecture Roadmap

Coves



Monts



2019

Today

2021



Graph is for conceptual illustration purposes only.

# Intel Thread Director

Rajshree Chabukswar

Performance Hybrid

# Scheduling Goals

Software Transparent

Real-Time Adaptive

Scalable from Mobile to Desktop



Introducing

# Intel Thread Director

Intelligence built directly into the core

**Monitors the runtime instruction mix**

of each thread and as well as the state of each core – with nanosecond precision

**Provides runtime feedback to the OS**

to make the optimal scheduling decision for any workload or workflow

**Dynamically adapts guidance**

based on the thermal design point, operating conditions, and power settings – without any user input

OS Scheduler



Introducing

# Intel Thread Director

Scheduling Examples



Introducing

# Intel Thread Director

Scheduling Examples

1

Priority tasks scheduled on P-cores



Introducing

# Intel Thread Director

Scheduling Examples

1 Priority tasks scheduled on P-cores

2 Background tasks scheduled on E-cores



Introducing

# Intel Thread Director

Scheduling Examples

1 Priority tasks scheduled on P-cores

2 Background tasks scheduled on E-cores

3 New AI thread ready



Introducing

# Intel Thread Director

Scheduling Examples

1 Priority tasks scheduled on P-cores

2 Background tasks scheduled on E-cores

3 New AI thread ready



Introducing

# Intel Thread Director

Scheduling Examples

1 Priority tasks scheduled on P-cores

2 Background tasks scheduled on E-cores

3 AI thread prioritized on P-core



Introducing

# Intel Thread Director

Scheduling Examples

1 Priority tasks scheduled on P-cores

2 Background tasks scheduled on E-cores

3 AI thread prioritized on P-core

4 Spin loop wait moved from P to E-core



Introducing

# Intel Thread Director

Scheduling Examples

1 Priority tasks scheduled on P-cores

2 Background tasks scheduled on E-cores

3 AI thread prioritized on P-core

4 Spin loop wait moved from P to E-core



Introducing

# Intel Thread Director

Scheduling Examples

1 Priority tasks scheduled on P-cores

2 Background tasks scheduled on E-cores

3 AI thread prioritized on P-core

4 Spin loop wait moved from P to E-core



# Architecture Day

2021

New Architectural Foundations



X<sup>e</sup> - core

Sapphire  
Rapids

X<sup>e</sup> HPC &  
Ponte  
Vecchio

AMX  
Advanced Matrix Extension - Engine

Alder  
Lake

X<sup>e</sup> HPG

Mount Evans

# Alder Lake

Arik Gihon



Introducing

# Alder Lake

Reinventing Multi Core Architecture

## Single, Scalable SoC Architecture

All Client Segments – 9W to 125W – built on Intel 7 process

## All-New Core Design

Performance Hybrid with Intel Thread Director

## Industry-Leading Memory & I/O

DDR5, PCIe Gen5, Thunderbolt™ 4, Wi-Fi 6E

# Scalable Client Architecture

## Desktop

LGA 1700  
Socket



## Mobile

BGA Type3  
50 x 25 x 1.3 mm



## Ultra Mobile

BGA Type4 HDI  
28.5 x 19 x 1.1 mm



Visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims) for details

# Alder Lake

## Building Blocks



Desktop

Mobile

Ultra Mobile



### Building Blocks



P-Core



E-Cores



Display



PCIe



TBT



GNA 3.0



IPU



LLC



Memory



SOC

## Desktop



## Mobile

## Ultra Mobile



## Building Blocks



P-Core



E-Cores



Display



PCIe



TBT



GNA 3.0



IPU



LLC



Memory

SOC

## Desktop



## Mobile



## Ultra Mobile



## Building Blocks



P-Core



E-Cores



Display



PCIe



TBT



GNA 3.0



IPU



Media  
32 EU



LLC



Memory



SOC

## Desktop



## Mobile



## Ultra Mobile



## Building Blocks



P-Core



E-Cores



Display



PCIe



TBT



GNA 3.0



IPU



Media

32 EU



Media

96 EU



LLC



Memory



SOC



# Alder Lake

## Core/Cache

Up To

**16** Cores

8 Performance  
8 Efficient

Up To

**24** Threads

2T per P-core  
1T per E-core

Up to

**30** MB

Non-inclusive  
LL Cache

# Alder Lake

## Memory

Leading the industry  
transition to DDR5

Support for all four major memory  
technologies

Dynamic voltage-frequency scaling

Enhanced overclocking support



# Alder Lake

## PCIe

Leading the industry transition to  
PCIe Gen5

Up to 2X bandwidth vs. Gen4  
Up to 64GB/s with x16 lanes



Visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims) for details

# Alder Lake Interconnect

Compute Fabric

Up to  
**1000** GB/s

Dynamic Latency  
Optimization



I/O Fabric

Up to

**64** GB/s

Real-time, demand-  
based BW control

Memory Fabric

Up To

**204** GB/s

Dynamic Bus Width  
& Frequency

Visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims) for details



Beginning Fall 2021

# Alder Lake

Reinventing Multi Core Architecture

## Single, Scalable SoC Architecture

All Client Segments – 9W to 125W – built on Intel 7 process

## All-New Core Design

Performance Hybrid with Intel Thread Director

## Industry-Leading Memory & I/O

DDR5, PCIe Gen5, Thunderbolt™ 4, Wi-Fi 6E

# Architecture Day

2021

New Architectural Foundations



Mount Evans



Xe HPG architecture

# Leadership Integrated Graphics



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Unconstrained Discrete Graphics



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Vivid PC Graphics Market

**1.5B**  
PC Gamers

Over the last 4 years, the amount of concurrent users has **doubled** on Steam.

**8.8B**  
Hours of  
Live Streams  
Watched

Twitch.tv viewership has **doubled** in one year

**13M+**  
Game  
Developers

Over **10,000 games** released on Steam in 2020

1. Source: <https://www.pcgamesn.com/pc-gaming-study>

2. Source: <https://blog.streamlabs.com/streamlabs-stream-hatchet-q1-2021-live-streaming-industry-report-eaba2143f492>

3. Source: Part 1 : Game Developer Population Forecast 2020, April 2020, SlashData

# intel® ARC™

Powered by

Alchemist Soc



# X<sup>e</sup> HPG Sneak Peek

Lisa Pearce

# Software First



# Render Quality Trade-off



# Render Quality Trade-off



# Render Quality Trade-off



# Render Quality Trade-off



# X<sup>e</sup> Super Sampling



# X<sup>e</sup>SS

## Hits the Sweet Spot



Graph is for conceptual illustration purposes only. Subject to revision with further testing.

# Xe SS SDK

Available this month



# X<sup>e</sup> HPG Sneak Peek

David Blythe





Compute Efficiency



Scalability

Graphics Efficiency



High Performance  
Gaming Optimized



# Xe-core

Compute Building Block of Xe HPG-based GPUs

16  
Vector Engines

256 bit  
per engine

16  
Matrix Engines

1024 bit  
per engine



# Render Slice

## Render slice



4 X<sup>e</sup>-cores with XMX

## Render slice

### X<sup>e</sup>-core



### X<sup>e</sup>-core



### X<sup>e</sup>-core



### X<sup>e</sup>-core



Ray Tracing Unit

Ray Tracing Unit

Ray Tracing Unit

Ray Tracing Unit

Sampler



Sampler



Sampler



Sampler



Geometry

Rasterizer

HIZ

Pixel Backend

Pixel Backend

## Fixed Function optimized for DX12 Ultimate Gaming

Geometry Pipeline

Rasterization Pipeline

Samplers

Pixel Backends

## Render slice



# Render slice



# Render slice



# Xe HPG

## Scaling the Graphics Engine



# Xe HPG

## Scaling the Graphics Engine





# Leadership IP Performance/Watt

Architecture

Logic Design

Circuit Design

Process Technology

Software



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.



“In the world of graphics, there is an insatiable demand for better performance and more realism. TSMC is excited that Intel has chosen our N6 technology for their Alchemist family of discrete graphics solutions”.

“There are many ingredients to a successful graphics product including the semiconductor technology. With N6, TSMC provides an optimal balance of performance, density and power efficiency that are ideal for modern GPUs. We are pleased with the collaboration with Intel on the Alchemist family of discrete GPUs”.

**Dr. Kevin Zhang,**

Senior Vice President of Business Development at TSMC



# Multi-Year **Roadmap**

Performance ↑

## Alchemist

X<sup>e</sup> HPG



Q1  
2022

## Battlemage

X<sup>e</sup>2 HPG



## Celestial

X<sup>e</sup>3 HPG



## Druid

X<sup>e</sup> Next Architecture



intel®  
**ARC™**

# Architecture Day

2021

New Architectural Foundations



# Architecture Day

2021

## Part 1 Recap



# Sapphire Rapids

Sailesh Kottapalli

Introducing

# Sapphire Rapids

Next-Gen Intel Xeon Scalable Processor

New Standard for  
Data Center Architecture

Designed for Microservices  
& AI Workloads

Pioneering Advanced Memory  
& IO Transitions



**Node Performance**



**Data Center Performance**

# Node Performance



## Scalar Performance

New Performance Core  
Microarchitecture

## Data Parallel Performance

Multiple Integrated  
Acceleration Engines

## Cache & Memory Sub- System Arch

Larger Private &  
Shared Caches

DDR 5

Next Gen Optane  
Support

PCIe 5.0

## Intra/Inter Socket Scaling

Modular SoC /w  
Modular Die Fabric

Wider & Faster UPI

Embedded Silicon  
Bridge (EMIB)



## Data Center Performance

# Ice Lake

Single Monolithic Die



# Sapphire Rapids

Multi-Tile Design for Increased Scalability



Delivers a scalable, balanced architecture leveraging existing software paradigms  
for monolithic CPUs via a modular architecture

# Sapphire Rapids

Multiple Tiles, Single CPU

Every thread has full access to all resources on all tiles

Cache, Memory, IO...

Provides consistent low latency & high cross-section BW across the entire SoC



# Sapphire Rapids SoC



# Sapphire Rapids

## Key Building Blocks



# Performance Core

## Built for Data Center

Major microarchitecture and IPC improvement

Improved support for large code/data footprint

Consistent performance for multi-tenant usages

Autonomous/Fast PM for high freq @ low jitter



# Performance Core

Architecture  
Improvements for DC  
Workloads & Usages

|                  |                                                                                                                      |
|------------------|----------------------------------------------------------------------------------------------------------------------|
| AI               | <b>Intel® Advanced Matrix Extensions - AMX</b><br>Tiled matrix operations for inference & training acceleration      |
| Attached Device  | <b>Accelerator interfacing Architecture - AiA</b><br>Efficient dispatch, signaling & synchronization from user level |
| FP16             | <b>Half- Precision</b><br>Support for higher throughput lower precision                                              |
| Cache Management | <b>CLDEMOTE</b><br>Proactive placement of cache contents                                                             |

# Sapphire Rapids

## Acceleration Engines

**Increasing effectiveness of cores,**  
by enabling offload of common mode tasks via  
seamlessly integrated acceleration engines



Utilization Without Acceleration



Utilization With Acceleration



# Acceleration Engine

Optimizing streaming data movement and transformation operations

up to  
4 Instances per Socket

Low Latency Invocation

No Memory Pinning Overhead



Results have been estimated or simulated and based on tests with Ice Lake with Intel QAT For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Acceleration Engine

Accelerating Cryptography and Data De/Compression



**98%**  
additional  
workload capacity  
after QAT offload

Results have been estimated or simulated. Sapphire Rapids estimation based on architecture models and baseline testing with Ice Lake and Intel QAT. For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Sapphire Rapids SoC



# Sapphire Rapids

## I/O Advancements

### Introducing Compute eXpress Link (CXL) 1.1

Accelerator and memory expansion in datacenter

### Expanded device performance via PCIe 5.0 & connectivity

Improved DDIO & QoS capabilities

### Improved Multi-Socket scaling via Intel® Ultra Path Interconnect (UPI) 2.0

Up to 4 x24 UPI links operating @ 16 GT/s

New 8S-4UPI performance optimized topology



# Sapphire Rapids

## Memory and Last Level Cache

### Increased Shared Last Level Cache (LLC)

Up to >100 MB LLC shared across ALL cores

### Increased bandwidth, security & reliability via DDR 5 Memory

4 memory controllers supporting 8 channels

### Intel® Optane™ Persistent Memory 300 Series



# Sapphire Rapids

## High Bandwidth Memory

**Significantly Higher Memory Bandwidth**

vs. baseline Xeon-SP with 8 channels of DDR 5

**Increased capacity and Bandwidth**

some usages can eliminate need for DDR entirely

**2 Modes**

**HBM Flat Mode**  
Flat Mem Regions w/ HBM & DRAM

**HBM Caching Mode**  
DRAM backed cache



# Sapphire Rapids - Architected for AI

AI has become ubiquitous across usages – AI performance required in all tiers of computing

Goal

Enable efficient usage of AI across all services deployed on elastic general-purpose tier by delivering many times more AI performance and lower CPU utilization

|                                                                        |                                                                                                                             |
|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| For Deep Learning Datatypes                                            | <ul style="list-style-type: none"><li>▪ int8 with int32 accumulation</li><li>▪ Bfloat16 with IEEE SP accumulation</li></ul> |
| Acceleration at the ISA Level                                          | <ul style="list-style-type: none"><li>▪ Full Intel Arch. programmability</li><li>▪ Low Latency</li></ul>                    |
| Available and integrated with industry-relevant frameworks & libraries |                                                                                                                             |



Results have been simulated. For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Sapphire Rapids - Built for elastic computing models - microservices

>80% of new cloud-native and SaaS applications are expected to be built as microservices

## Goal

Enable higher throughput while meeting latency requirements and reducing infrastructure overhead for execution, monitoring and orchestration thousands of microservices

|                                             |                                                                                                                       |
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| Improved Performance and Quality of Service | Runtime Languages - lower latency for Runtime Languages<br>AiA ISA's - efficient worker threads, signaling and synch. |
| Reduced Infrastructure Overhead             | Kubernetes – enhanced for scaling, placement and policies<br>Advanced Telemetry - easier analysis & optimization      |
| Better Distributed Communication            | Improved latency of Remote procedure calls and service-mesh<br>QAT, DSA etc.- optimized networking and data movement  |

Results have been simulated. For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

## Microservices Performance



## New Standard in Data Center Architecture

Multi Tile SoC for Scalability

Physically Tiled,  
Logically Monolithic

General Purpose  
& Dedicated  
Acceleration Engines

## Designed for Microservices and AI Workloads

Performance Core  
Architecture

Workload Specialized  
Acceleration

## Pioneering Advanced Memory & IO Transitions

DDR 5 &  
HBM

PCIe 5.0

Enhanced  
Virtualization  
Capabilities

# Sapphire Rapids

Biggest Leap in Data Center Capabilities  
in over a Decade



# Architecture Day

2021

New Architectural Foundations



# Infrastructure Processing Unit

Guido Appenzeller

# Server Architecture in a classic Data Center

Software and Infrastructure are all controlled by One Entity



# Classic Server Architecture



# Cloud Server Architecture





# Major Advantages of IPUs

1



## Separation of Infrastructure & Tenant

Guest can fully control the CPU with their SW, while CSP maintains control of the infrastructure and Root of Trust

2



## Infrastructure Offload

Accelerators help process these task efficiently. Minimize latency and jitter and maximize revenue from CPU

3



## Diskless Server Architecture

Simplifies data center architecture while adding flexibility for the CSP

# Advantage 1 - Separation of Infrastructure and Tenant

Maximum Control and Isolation for the Tenant



## Advantage 2 - Infrastructure Offload

In some cases, the majority of CPU cycles are spent on overhead

**31%**  
to 83%

Microservice  
Overhead at  
Facebook



Source: From Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale. Akshitha Srirama, Abhishek Dhanotia. Facebook.

## Advantage 2 - Infrastructure Offload

Dedicated Accelerators Free up CPU Capacity



# Advantage 3 - Diskless Server Architecture

Scale with Virtual Storage via Network



# Broad Infrastructure Acceleration Portfolio

## Dedicated ASIC IPU

Performance and power optimized

Optimized secure networking and storage pipeline



## FPGA-based Acceleration

### IPU Platforms & Adapters

Faster time to market for evolving standards

Re-programmable Secure Datapath enables flexible/customizable workload offload (future proof)

Onboard Intel® Xeon® processor



### SmartNICs

Programmable accelerated infrastructure workloads with customizable packet processing

Intel Ethernet NIC with DPDK support



Note: Future Intel IPUs may integrate both ASIC and FPGA

Introducing

# Oak Springs Canyon

High perf networking and storage acceleration for  
Cloud Service Providers

OVS, NVMe over Fabric, and RoCE solutions

Programmable through Intel OFS, DPDK, and SPDK

Customizable solutions with FPGA



# Oak Springs Canyon

Built with Intel® Agilex FPGA and Xeon-D SoC

High speed Ethernet support - 2x100G

PCIe Gen 4 x16

Hardware crypto block enables security at line rate



Introducing

# Arrow Creek

Acceleration Development Platform (ADP) for High Performance 100G networking acceleration

Customizable packet processing  
including bridging and networking services

Programmable through Intel OFS and DPDK

Accelerated infrastructure workloads  
Juniper Contrail, OVS, SRv6, vFW

Secure Remote Update  
of FPGA and Firmware over PCIe

On-board root of trust



# Arrow Creek

Built with Intel® Agilex FPGA and Ethernet E810 Controller



# Mount Evans

Naru Sundar

Introducing

# Mount Evans

## Intel's 200G IPU



## Hyperscale Ready

Co-designed with a top cloud provider

Integrated learnings from multiple gen. of FPGA sNICs

High performance under real world load

Security and isolation from the ground up

## Technology Innovation

Best-in-Class Programmable Packet Processing Engine

NVMe storage interface scaled up from Intel Optane Tech

Next Generation Reliable Transport

Advanced crypto and compression accel.

## Software

SW/HW/Accel co-design

P4 Studio based on Barefoot

Leverage and extend DPDK and SPDK

# Mount Evans

## Architectural Breakdown



# Mount Evans

## Architectural Breakdown

Support for up to 4 host Xeons with  
200Gb/s full duplex

High-performance ROCEv2

NVMe offload engine

Programmable packet pipeline with QoS  
and telemetry capabilities

Inline IPSec



# Mount Evans

## Compute Complex



Up to 16 Arm Neoverse® N1 Cores

Dedicated compute and cache with up to 3 memory channels

Lookaside crypto and compression

Dedicated management processor

# Architecture Day

2021

New Architectural Foundations





Xe HPC architecture

## HPC FP64



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

## Bandwidth GB/s



## HPC FP64



## AI FP16/BF16



## Bandwidth GB/s



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

## HPC FP64



## AI FP16/BF16



## Bandwidth GB/s



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.



# Architecture

Hong Jiang



# Xe -core

Compute Building Block of X<sup>e</sup> HPC-based GPUs





| Vector Engine<br>(ops/clk) |
|----------------------------|
| 256 FP32                   |
| 256 FP64                   |
| 512 FP16                   |



| Matrix Engine<br>(ops/clk) |
|----------------------------|
| 2048 TF32                  |
| 4096 FP16                  |
| 4096 BF16                  |
| 8192 INT8                  |





# Xe HPC Slice





16 X<sup>e</sup> – cores

8MB L1 Cache

## 16 Ray Tracing Units

## Ray Traversal

## Triangle Intersection

## Bounding Box Intersect.

## 1 Hardware Context



## Up to

4 Slices

64 X<sup>e</sup> - cores

64 Ray Tracing Units

4 Hardware Contexts

L2 Cache

4 HBM2e controllers

1 Media Engine

8 X<sup>e</sup> Links





## 2 - Stack



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Xe Link



# Xe Link for Scalability





# Xe Link for Scalability





# Xe Link for Scalability





# Link for Scalability





## 8x System Compute Rates

### Vector

**8x** Up to **32,768**  
FP64 Ops/CLK

**8x** Up to **262,144**  
TF32 Ops/CLK

**8x** Up to **524,288**  
BF16 Ops/CLK

### Matrix

**8x** Up to **1,048,576**  
INT8 Ops/CLK

# Ponte Vecchio

Masooma Bhaiwala

# Ponte Vecchio



# Ponte Vecchio



New **Verification Methodology**

New **Software**

New **Reliability Methodology**

New **Signal Integrity Techniques**

New **Interconnects**

New **Power Delivery Technology**

New **Packaging Technology**

New **I/O Architecture**

New **Memory Architecture**

New **IP Architecture**

New **SOC Architecture**

# Ponte Vecchio soc

>100 Billion Transistors

47 Active Tiles

5 Process Nodes



# Ponte Vecchio

## Key Challenges

Scale of Integration

Foveros Implementation

Verification Tools & Methods

Signal Integrity, Reliability & Power Delivery



# Ponte Vecchio

## Compute Tiles

Per Tile  
**8**  
**X<sup>e</sup>-cores**

L1 Cache  
**4MB**  
Per Tile

Built on  
**TSMC**  
**N5**



# Ponte Vecchio

## Base Tile

|                                       |                                    |              |
|---------------------------------------|------------------------------------|--------------|
| Built on<br><b>Intel 7</b><br>FOVEROS | Area<br><b>640mm<sup>2</sup></b>   | <b>HBM2e</b> |
| L2 Cache<br><b>144MB</b>              | Host Interface<br><b>PCIe Gen5</b> | <b>MDFI</b>  |



# Ponte Vecchio

## X<sup>e</sup> Link Tile

Per Tile  
**8 X<sup>e</sup> Links**

8 ports  
**Embedded Switch**

Built on  
**TSMC N7**

Up to  
**90G**  
Serdes



For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Ponte Vecchio

## Execution Progress



### A0 Silicon Current Status

**> 45 TFLOPS**

FP32 Throughput

**> 5 TBps**

Memory Fabric Bandwidth

**> 2 TBps**

Connectivity Bandwidth

For workloads and configurations visit [www.intel.com/ArchDay21claims](http://www.intel.com/ArchDay21claims). Results may vary.

# Accelerated Compute Systems

Ponte Vecchio  
OAM



Ponte Vecchio  
x4 Subsystem  
with X<sup>e</sup> Links



Ponte Vecchio x4 Subsystem  
with X<sup>e</sup> Links

+ 2S Sapphire Rapids



# Overcoming Separate CPU and GPU Software Stacks



CPU



GPU



Freedom from proprietary programming models

Full performance from the hardware

Piece of mind for developers

## CPU & XPU - Optimized Stack



# oneAPI Industry Momentum

## Cross-Vendor

3rd-party implementations on

Nvidia GPU

Arm CPU

Huawei ASIC

AMD GPU

## Evolving Spec

Provisional spec v1.1 released May'21  
with deep industry leader involvement

+ Graph interfaces for Deep Learning workloads

+ Advanced raytracing libraries



# Industry Momentum

## End Users



## National Labs



## ISVs & OSVs



## OEMs & SIs



## Universities & Research Institutes



## CSPs & Frameworks



# >200K Developers

Unique installs of Intel® oneAPI product since Dec'20 release

# >300 Applications

Deployed in market using Intel® oneAPI language & libraries

# >80 HPC & AI Applications

Functional on Intel's Xe HPC architecture using Intel® oneAPI



1  
oneAPI

Toolkits v2021.3  
Available Now

# >200K Developers

Unique installs of Intel® oneAPI product since Dec'20 release

# >300 Applications

Deployed in market using Intel® oneAPI language & libraries

# >80 HPC & AI Applications

Functional on Intel's Xe HPC architecture using Intel® oneAPI



1  
oneAPI

# >200K Developers

Unique installs of Intel® oneAPI product  
since Dec'20 release

# >300 Applications

Deployed in market using Intel® oneAPI  
language & libraries

# >80 HPC & AI Applications

Functional on Intel's XeHPC architecture  
using Intel® oneAPI



1  
**oneAPI**

# >200K Developers

Unique installs of Intel® oneAPI product since Dec'20 release

# >300 Applications

Deployed in market using Intel® oneAPI language & libraries

# >80 HPC & AI Applications

Functional on Intel's Xe HPC architecture using Intel® oneAPI



1  
oneAPI



# Aurora

Blade

Building Block for the ExaScale Supercomputer



# Ponte Vecchio

The vision 2 years ago...



**Leadership Performance  
for HPC/AI**

**Connectivity to drive scaleup  
and scale out**

**Unified Programming Model  
powered with oneAPI**



# Architecture Day

2021

# Architecture Day

2021

New Architectural Foundations



See you at

# intel.<sup>®</sup> innovation



# Architecture Day

2021

# Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at [www.intel.com/PerformanceIndex](http://www.intel.com/PerformanceIndex). Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See [www.intel.com/ArchDay2Iclaims](http://www.intel.com/ArchDay2Iclaims) for configuration details. No product or component can be absolutely secure.

All product plans and roadmaps are subject to change without notice. Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. Intel technologies may require enabled hardware, software or service activation.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These are not "commercial" names and not intended to function as trademarks.

Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies.

Statements in this presentation that refer to future plans and expectations are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "goals," "plans," "believes," "seeks," "estimates," "continues," "may," "will," "would," "should," "could," and variations of such words and similar expressions are intended to identify such forward-looking statements. Statements that refer to or are based on estimates, forecasts, projections, uncertain events or assumptions, including statements relating to future products and technology and the expected availability and benefits of such products and technology, market opportunity, and anticipated trends in our businesses or the markets relevant to them, also identify forward-looking statements. Such statements are based on management's current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in these forward-looking statements. Important factors that could cause actual results to differ materially from the company's expectations are set forth in Intel's reports filed or furnished with the Securities and Exchange Commission (SEC), including Intel's most recent reports on Form 10-K and Form 10-Q, available at Intel's investor relations website at [www.intc.com](http://www.intc.com) and the SEC's website at [www.sec.gov](http://www.sec.gov). Intel does not undertake, and expressly disclaims any duty, to update any statement made in this presentation, whether as a result of new information, new developments or otherwise, except to the extent that disclosure may be required by law.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.