

# Advanced Topics in Communication Networks

## Programming Network Data Planes



Laurent Vanbever  
[nsg.ee.ethz.ch](http://nsg.ee.ethz.ch)

ETH Zürich  
Oct 29 2019

Last week on  
**Advanced Topics in Communication Networks**

P4 hardware  
target

P4-based  
applications

How do we build a *fast*  
reprogrammable switch?

**“Programmable switches are 10-100x slower than fixed-function switches. They cost more and consume more power.”**

**Conventional wisdom in networking**

How can we allow network programmability in the field,  
at reasonable cost, and without **sacrificing speed**

supporting Tbps of  
backplane throughput

# Let's look at a concrete design: Reconfigurable Match Tables (RMT)

The screenshot shows a PDF document titled "Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN". The document is presented in a window with a toolbar at the top, including icons for zoom, search, and file operations. The title is centered above the author information. Below the title, the authors are listed as Pat Bosshart<sup>†</sup>, Glen Gibb<sup>‡</sup>, Hun-Seok Kim<sup>†</sup>, George Varghese<sup>§</sup>, Nick McKeown<sup>‡</sup>, Martin Izzard<sup>†</sup>, Fernando Mujica<sup>†</sup>, and Mark Horowitz<sup>‡</sup>. The affiliations are indicated by superscripts: <sup>†</sup>Texas Instruments, <sup>‡</sup>Stanford University, and <sup>§</sup>Microsoft Research. The email addresses for the authors are provided at the bottom. The abstract section follows, describing the RMT model and its benefits. The introduction section begins with a Churchill quote about change.

**Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN**

Pat Bosshart<sup>†</sup>, Glen Gibb<sup>‡</sup>, Hun-Seok Kim<sup>†</sup>, George Varghese<sup>§</sup>, Nick McKeown<sup>‡</sup>,  
Martin Izzard<sup>†</sup>, Fernando Mujica<sup>†</sup>, Mark Horowitz<sup>‡</sup>  
<sup>†</sup>Texas Instruments   <sup>‡</sup>Stanford University   <sup>§</sup>Microsoft Research  
pat.bosshart@gmail.com {grg, nickm, horowitz}@stanford.edu  
varghese@microsoft.com {hkim, izzard, fmujica}@ti.com

**ABSTRACT**

In Software Defined Networking (SDN) the control plane is physically separate from the forwarding plane. Control software programs the forwarding plane (e.g., switches and routers) using an open interface, such as OpenFlow. This paper aims to overcomes two limitations in current switching chips and the OpenFlow protocol: i) current hardware switches are quite rigid, allowing “Match-Action” processing on only a fixed set of fields, and ii) the OpenFlow specification only defines a limited repertoire of packet processing actions. We propose the RMT (reconfigurable match tables) model, a new RISC-inspired pipelined architecture for switching chips, and we identify the essential minimal set of action primitives to specify how headers are processed in hardware. RMT allows the forwarding plane to be changed in the field without modifying hardware. As in OpenFlow, the programmer can specify multiple match tables of arbitrary width and depth, subject only to an overall resource limit, with each table configurable for matching on arbitrary fields. However, RMT allows the programmer to modify *all* header fields much more comprehensively than in OpenFlow. Our paper describes the design of a 64 port by 10 Gb/s switch chip implementing the RMT model. Our concrete design demonstrates, contrary to concerns within the community, that flexible OpenFlow hardware switch implementations are feasible at almost no additional cost or power.

**1. INTRODUCTION**

*To improve is to change; to be perfect is to change often.* — Churchill

Good abstractions—such as virtual memory and time-sharing—are paramount in computer systems because they allow systems to deal with change and allow simplicity of programming at the next higher layer. Networking has progressed because of key abstractions: TCP provides the abstraction of connected queues between endpoints, and IP provides a simple datagram abstraction from an endpoint to the network edge. However, routing and forwarding *within* the network remain a confusing conglomerate of routing protocols (e.g., BGP, ICMP, MPLS) and forwarding behaviors (e.g., routers, bridges, firewalls), and the control and forwarding planes remain intertwined inside closed, vertically integrated boxes.

Software-defined networking (SDN) took a key step in abstracting network functions by separating the roles of the control and forwarding planes via an *open* interface between them (e.g., OpenFlow [27]). The control plane is lifted up and out of the switch, placing it in external software. This programmatic control of the forwarding plane allows network owners to add new functionality to their network, while replicating the behavior of existing protocols. OpenFlow has become quite well-known as an interface between the control plane and the forwarding plane based on the approach known as “Match-Action”. Roughly, a subset of packet bytes

[SIGCOMM'13]

The paper argues that flexibility does **not** come at the price of performance or cost

## Outline

- Conventional switch chips are inflexible
- SDN demands flexibility...sounds expensive...
- How do we do it: The RMT switch model
- Flexibility costs less than 15%

Enter...

## Reconfigurable Match Tables (RMT)

### Outline

- Conventional switch chip are inflexible
- SDN demands flexibility...sounds expensive...
- **How do we do it: The RMT switch model**
- Flexibility costs less than 15%

# What kind of switch architecture could support flexibility and yet run at Terabits per second?

|                                |                       |
|--------------------------------|-----------------------|
| Throughput aggregate           | 1 Tbps                |
| Packet size average            | 1000 bits             |
| # operations per packet (avg.) | 10                    |
| Requirements                   | 10 billion op./second |

Pipelined architectures organize processing through a sequence of processing units and local memory



For flexibility,  
each processing unit/memory can be made generic



Each CPU can process distinct packets, with up to 10 packets going through the pipeline simultaneously



The runtime behavior of the parser & the match stages is defined through the RMT abstract model

## The RMT Abstract Model

- Parse graph
- Table graph

How do we implement in hardware  
a programmable parser and a logical pipeline?

# How do we implement in hardware a programmable parser and a logical pipeline?

The screenshot shows a PDF document titled 'Design Principles for Packet Parsers' by Glen Gibb, George Varghese, Mark Horowitz, and Nick McKeown. The document includes abstracts, categories, keywords, and an introduction. A diagram of a TCP packet structure is also shown.

**Design Principles for Packet Parsers**

Glen Gibb<sup>†</sup>, George Varghese<sup>‡</sup>, Mark Horowitz<sup>†</sup>, Nick McKeown<sup>†</sup>  
†Stanford University    ‡Microsoft Research  
 [{grg, horowitz, nickm}@stanford.edu](mailto:{grg, horowitz, nickm}@stanford.edu)    [varghese@microsoft.com](mailto:varghese@microsoft.com)

**ABSTRACT**

All network devices must parse packet headers to decide how packets should be processed. A  $64 \times 10$  Gb/s Ethernet switch must parse one billion packets per second to extract fields used in forwarding decisions. Although a necessary part of all switch hardware, very little has been written on parser design and the trade-offs between different designs. Is it better to design one fast parser, or several slow parsers? What is the cost of making the parser reconfigurable in the field? What design decisions most impact power and area?

In this paper, we describe trade-offs in parser design, identify design principles for switch and router designers, and describe a parser generator that outputs synthesizable Verilog that is available for download. We show that i) packet parsers today occupy about 1-2% of the chip, and ii) while future packet parsers will need to be programmable, this only doubles the (already small) area needed.

**Categories and Subject Descriptors**

C.2.1 [Computer-Communication Networks]: Network Architecture and Design—*Network Communications*

**Keywords**

Parsing; Design principles; Reconfigurable parsers

**1. INTRODUCTION**

Despite their variety, *every* network device examines fields

**Figure 1: A TCP packet.**

The diagram illustrates the structure of a TCP packet. It starts with an 'Ethernet' layer (Len: 14B), followed by an 'IPv4' layer (Len: ?), then a 'TCP' layer (Len: ?), and finally a 'Payload' layer. The 'IPv4' and 'TCP' layers are highlighted in red, while the 'Ethernet' and 'Payload' layers are blue. Arrows indicate the flow from left to right. Below the diagram, dashed arrows point to the start of each header: 'Next: IPv4' under Ethernet, 'Len: 20B' under IPv4, 'Next: TCP' under TCP, and 'Len: 20B' under TCP.

[ANCS'13]

# Parsing is the (complex) process of identifying and extracting the appropriate fields in a packet header

Throughput

Parser must run at line-rate  
parse 1 packet every 70 ns on a 10 Gbps link

Dependency

Parsing involves sequential processing  
as headers typically point to the next one

Incompleteness

Some headers do not even identify  
the subsequent header

Heterogeneity

Many header formats exist that  
can appear in various orders/locations

A parser can be divided into two separate blocks:  
header identification and field extraction

implements the parse graph's  
state machine



extracts the chosen fields  
from identified headers

In a programmable parser, the two modules rely on  
runtime information instead of hard-coded logic

stored in memory,  
e.g. in RAM and/or TCAM



stores the bit sequences  
that identify the headers

stores the next state,  
the fields to extract,  
and any other data (if any)

How do we implement in hardware  
a programmable parser and a logical pipeline?

A compiler translates a given RMT logical pipeline (specified in P4) into a physical one



The compiler maps each individual logical stage to one or more physical stage.



# The RMT pipeline in a few statistics

## Our Switch Design

- 64 x 10Gb ports
  - 960M packets/second
  - 1GHz pipeline
- Programmable parser
- 32 Match/action stages
- Huge TCAM: 10x current chips
  - 64K TCAM words x 640b
- SRAM hash tables for exact matches
  - 128K words x 640b
- 224 action processors per stage
- All OpenFlow statistics counters

Building a RMT pipeline is **only 15% more expensive**  
than building a fixed-function switching pipeline

## Outline

- Conventional switch chip are inflexible
- SDN demands flexibility...sounds expensive...
- How do I do it: The RMT switch model
- **Flexibility costs less than 15%**

The biggest cost is the memory...  
*not* the processing logic

## Cost of Configurability: Comparison with Conventional Switch

- Many functions identical: I/O, data buffer, queueing...
- Make extra functions optional: statistics
- Memory dominates area
  - Compare memory area/bit and bit count
- RMT must use memory bits efficiently to compete on cost
- Techniques for flexibility
  - Match stage unit RAM configurability
  - Ingress/egress resource sharing
  - Table predication allows multiple tables per stage
  - Match memory overhead reduction
  - Match memory multi-word packing

That was just an academic paper  
Let's look at a real flexible pipeline



A small subset of our lab @ITET with two Tofino 3.2 Tbps, 32x 100 GbE QSFP28

That was just an academic paper  
Let's look at a real flexible pipeline



Programmable Data Plane at Terabit Speeds

Vladimir Gurevich  
May 16, 2017

# Barefoot Tofino 6.5 Tbps backplane

## several billion packets per second at line rate

### 6.5Tb/s Tofino™ Summary

- **State of the art design**
  - Single Shared Packet Buffer
  - TSMC 16nm FinFET+
- **Four Match+Action Pipelines**
  - Fully programmable PISA Embodiment
  - All compiled programs run at line-rate.
  - Up to 1.3 million IPv4 routes
- **Port Configurations**
  - 65 x 100GE/40GE
  - 130 x 50GE
  - 260 x 25GE/10GE
- **CPU Interfaces**
  - PCIe: Gen3 x4/x2/x1
  - Dedicated 100GE port



# Barefoot Tofino 6.5 Tbps backplane

## several billion packets per second at line rate



Source: Programmable Data Planes at Terabit Speeds, Vladimir Gurevich, 2017

# Tofino relies on Packet Header Vector (PHV) to pass states between stages

## Packet Header Vector (PHV)

- A set of uniform containers that carry the headers and metadata along the pipeline
- Fields can be packed into any container or their combination
- PHV Allocation step in the compiler decides the actual packing



Tofino uses a folded pipeline in which the *same* stages are used for both the ingress and the egress pipeline

## Unified Pipeline

- There is no difference between ingress and egress processing
  - The same blocks can be efficiently shared



What's next?

Tofino 2: 12.8 Tbps (7 nm switching ASIC)



<https://www.barefootnetworks.com/press-releases/barefoot-networks-unveils-tofino-2-the-next-generation-of-the-worlds-first-fully-p4-programmable-network-switch-asics/>

This week on

# Advanced Topics in Communication Networks

P4 hardware  
target

P4-based  
applications

What cool things  
can we do with it?

Data plane  
programmability

for

**Performance**  
**Monitoring**  
**Applications offloading**

**Platforms**  
**Correctness**  
**Management**

for  
Data plane  
programmability



Figure 1. This figure shows the result for detecting heavy-hitters between two major AP [12] with different monitoring intervals. Even under high sampling rates, recall quickly diminishes and worsens as the monitoring interval decreases.

solutions use approximate data structures, that support memory and processing overhead in exchange for some loss in accuracy, in order to deal with the limited resources available on switches.

While prior work has focused on heavy-hitter detection at a single switch, network operators often need to track the network-wide heavy hitters. For example, port scanners [15] and super-aggregators [27] typically track the top flows in the network only at one location. Detecting the heavy hitters separately at each switch and then combining the results is not sufficient. Large flows can easily fall ‘under the radar’ in a multiple location but still have high priority.

Approximate heavy-hitter detection at each switch reduces the chance of missing large flows, at the expense of higher communication overhead to report counts to a central coordinator.

Additionally, networks that forward high traffic volumes often resort to sampling 1/x packets, where x is operator-defined based on the needs of the specific network. However, sample can also reduce accuracy on small time-scales [24], even when traffic patterns are stable. In Figure 1, we show the impact sampling has on accuracy while performing heavy-hitter detection on a link between two switches. Sampling 1/10 of the total number of flows. Even with high sampling rates, recall is quite low for short time intervals and it quickly diminishes as sampling rates decrease. In modern datacenter networks where switches

## In-band Network Telemetry (INT)

June 2016

Changhoon Kim, Parag Bhide, Ed Doe: *Barefoot Networks*  
 Hugh Holbrook: *Arista*  
 Anoop Ghanwani: *Dell*  
 Dan Daly: *Intel*  
 Mukesh Hira, Bruce Davie: *VMware*

### Introduction

#### Terms

#### What To Monitor

##### Switch-level Information

##### Ingress Information

##### Egress Information

##### Buffer Information

#### Processing INT Headers

##### INT Header Types

##### Handling INT Packets

#### Header Format and Location

##### INT over any encapsulation

##### On-the-fly Header Creation

##### Header Format

#### Header Location and Format -- INT over Geneve



Figure 2. This figure shows the result for detecting heavy-hitters between two major AP [12] with different monitoring intervals. Even under high sampling rates, recall quickly diminishes and worsens as the monitoring interval decreases.

Additional, networks that forward high traffic volumes often resort to sampling 1/x packets, where x is operator-defined based on the needs of the specific network. However, sample can also reduce accuracy on small time-scales [24], even when traffic patterns are stable. In Figure 1, we show the impact sampling has on accuracy while performing heavy-hitter detection on a link between two switches. Sampling 1/10 of the total number of flows. Even with high sampling rates, recall is quite low for short time intervals and it quickly diminishes as sampling rates decrease. In modern datacenter networks where switches

have fundamental blind spots for other metrics that are not currently being tracked.

Indeed, UnivMon is a framework that often both generalizes by delaying the binding to specific applications of interest but at the same time provides the required fidelity for which these specific applications require. This entails significant investment in algorithm design and hardware support for new metrics of interest. While recent tools like OpenSketch [47] and Scream [31] have been developed to reduce the cost of these metrics after efficient resource allocation, they do not address the fundamental need to design and operate new custom sketches for each task. Furthermore, at any given point in time the data plane resources have to be reallocated to support a set of new metrics to monitor and will have fundamental blind spots for other metrics that are not currently being tracked.

In this paper, we present the *UnivMon* (short for Universal Monitoring) framework that can simultaneously achieve both generality and high fidelity across a broad spectrum of monitoring tasks [31, 26, 38, 31]. UnivMon builds on and



## Current monitoring methods are inadequate

- Not fast enough
  - Involve CPU and control planes
  - Network state changes rapidly
- Do not provide end-to-end state
  - Difficult to correlate per-element state with the actual path of a flow

Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

## INT : In-band Network Telemetry

- Mechanism for collecting network state in the dataplane
  - As close to **realtime** as possible
  - At current and future **line rates**
  - With a framework that can **adapt** over time
- Examples of network state
  - Switch ID, Ingress Port ID, Egress Port ID
  - Egress Link Utilization
  - Hop Latency
  - Egress Queue Occupancy
  - Egress Queue Congestion Status
  - ....

Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

# INT Header Format



Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

## INT using P4

- P4 enables flexible packet parsing and modification for INT
- P4 allows INT to adapt to
  - Any Encapsulation format
  - Any State required to be collected
  - Any feature, protocol – current and future

Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

# INT : P4 Code Snippet

Exact-match  
Table Definition

```
table int_inst {  
    reads {  
        int_header.instruction_mask : exact;  
    }  
    actions {  
        int_set_header_i0;  
        int_set_header_i1;  
        int_set_header_i2;  
        int_set_header_i3;  
        ....  
    }  
}
```

Action  
Definitions

```
action int_set_header_i0() {  
}  
action int_set_header_i1() {  
    int_set_header_3();  
}  
action int_set_header_i2() {  
    int_set_header_2();  
}  
action int_set_header_i3() {  
    int_set_header_3();  
    int_set_header_2();  
}  
....
```

Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

## HULA: INT + Flowlet routing

1. Periodic INT probes
  - disseminate path utilization to switches
2. Flowlet detection and path selection
  - happens at **all** switches
  - hop-by-hop adaptive routing

## INT probes traverse multiple paths



Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

## Probes carry path utilization



Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

## Probes update switch state



Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015

## Summary

- INT provides real-time network state directly in the dataplane
  - Scales to arbitrarily large networks
  - Scales to current and future link speeds
  - Can adapt to any network, any encap, any application
- Knowledge of real-time network state opens up new possibilities
  - Enhanced monitoring and troubleshooting
  - Network-state aware routing
  - ...

Source: In-band Network Telemetry, Mukesh Hira and Naga Katta, 2015



## In-band Network Telemetry (INT)

June 2016

Changhoon Kim, Parag Bhide, Ed Doe: *Barefoot Networks*  
Hugh Holbrook: *Arista*  
Anoop Ghanwani: *Dell*  
Dan Daly: *Intel*  
Mukesh Hira, Bruce Davie: *VMware*





## MARPLE [SIGCOMM'17]

## SONATA [SIGCOMM'18]

Both papers enable operators to express **monitoring queries**

```
result = filter(pktstream, qid == Q and switch == S
                and t_out - t_in > 1ms)
returns a stream of packets experiencing high queuing latencies
```

A compiler then compiles these queries to: switch programs + control code

The two papers differ among others in the types of queries they support



LossRadar [CoNEXT'16]

FlowRadar [NSDI'16]

Develop techniques and tools to monitor *all flows* by

- relying on in-switch data structures (Bloom Filters) and
- decoding them at the controller-level

## DAPPER [SOSR'17]

## Network-Wide HH [SOSR'18]



## Develop P4-based detection mechanisms to

- diagnose TCP performance issue (e.g. small receiver buffers)
- heavy-hitter (e.g. port scanners, superspreaders, DDoS)

Introduce techniques to make sketch-based monitoring more practical (by making sketches adaptive or "universal")

SketchLearn [SIGCOMM'18]

Elastic Sketch [SIGCOMM'18]

UnivMon [SIGCOMM'16]



Data plane  
programmability

for

**Performance  
Monitoring**

**Applications offloading**

**Platforms  
Correctness  
Management**

for

Data plane  
programmability

[SOSR'15]



#### Categories and Subject Descriptors

C.2.4 [Distributed Systems]: Network operating systems; C.4 [Performance of Systems]: Reliability, availability, and serviceability; D.4.5 [Reliability]: Fault-tolerance

#### Keywords

Software-defined networking, Paxos, NetPaxos.

#### 1. INTRODUCTION

Software-defined networking (SDN) is transforming the way networks are configured and run. In contrast to traditional networks, in which forwarding devices have proprietary control interfaces, SDNs generalize network devices using a set of protocols defined by open standards, including most prominently the OpenFlow [24] protocol. This move towards standardization has led to increased "network programmability", allowing ordinary programs to manage the network through direct access to network devices.

This paper explores the possibility of implementing the widely deployed Paxos consensus protocol in network devices. We present two different approaches: (i) a detailed design description for implementing the full Paxos logic in SDN switches, which identifies a sufficient set of required OpenFlow extensions; and (ii) an alternative, optimistic protocol which can be implemented without changes to the underlying switch firmware, but relies on assumptions about how the network orders messages. Although neither of these protocols can be fully implemented without changes to the underlying switch firmware, we argue that such changes are feasible in existing hardware. Moreover, we present an evaluation that suggests that moving Paxos logic into the network would yield significant performance benefits for distributed applications.

Copyright is held by the owner/author(s). Publication rights licensed to ACM.

SOSR'15, June 17–18, 2015, Santa Clara, CA, USA

Copyright is held by the owner/author(s). Publication rights licensed to ACM.

ACM ISBN 978-1-4503-3451-8/15/06 ...\$15.00

<https://doi.org/10.1145/273993.273999>

[HotNets'17]



#### 1 INTRODUCTION

The advent of flexible networking hardware [6] and expressive data plane programming languages [5, 29] have produced networks that are deeply programmable. The functionality of networks can now be enriched without hardware changes while retaining the capability of processing packets at very high rates, even above Terabits per second. Emerging programmable network devices are paving the way for new services to better support data center applications [9, 18] and improve network monitoring [13, 16, 24–26].

\*Amedeo Sapiò is also with Politecnico di Torino.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright © 2017, the author(s). This notice applies to the use of this material or work on this paper. Copying or redistribution of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [permissions@acm.org](mailto:permissions@acm.org).

HotNets'17, November 30–December 1, 2017, Palo Alto, CA, USA

© 2017 Copyright held by the author(s). Publication rights licensed to Association for Computing Machinery.

ACM ISBN 978-1-4503-4653-5/17/08 ...\$15.00

<https://doi.org/10.1145/3152434.3152461>

[SIGCOMM'17]



Consensus at network speed

In-Network Aggregation  
(e.g., for MapReduce, graph analytics, ML)

Stateful layer-4 load balancers

+ NetCache [SOSP'17], NetChain [NSDI'18]

[SOSR'15]



[HotNets'17]



[SIGCOMM'17]



Consensus at network speed

In-Network Aggregation  
(e.g., for MapReduce, graph analytics, ML)

Stateful layer-4 load balancers

+ NetCache [SOSP'17], NetChain [NSDI'18]

# NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé  
Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica



Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

NetCache solves the problem of load-balancing in key-values stores observing *dynamic, skewed* workload

NetCache is a **rack-scale key-value store** that leverages **in-network data plane caching** to achieve

**billions QPS throughput**

&

**~10 µs latency**

even under

**highly-skewed**

&

**rapidly-changing**

workloads.

**New generation of systems enabled by programmable switches ☺**

NetCache solves the problem of load-balancing in key-values stores observing *dynamic, skewed* workload

**Key challenge: **highly-skewed** and rapidly-changing workloads**

**low throughput & high tail latency**



Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

It leverages that a small but very fast cache can provide perfect load-balancing... in theory

## Opportunity: fast, small cache can ensure load balancing

[B. Fan et al. **SoCC'11**, X. Li et al. **NSDI'16**]

Cache  $O(N \log N)$  hottest items

E.g., 10,000 hot objects



**N:** # of servers

E.g., 100 backends with 100 billions items



**Requirement:** cache throughput  $\geq$  backend aggregate throughput

# NetCache relies on the O(billion) throughput of programmable network devices to achieve it in practice

## NetCache: towards billions QPS key-value storage rack

Cache needs to provide the **aggregate** throughput of the storage layer



flash/disk  
each:  $O(100)$  KQPS  
**total:  $O(10)$  MQPS**

storage layer



in-memory  
 **$O(10)$  MQPS**

cache layer

in-memory  
each:  $O(10)$  MQPS  
**total:  $O(1)$  BQPS**



**in-network**  
 **$O(1)$  BQPS**

Small on-chip memory?  
Only cache  **$O(N \log N)$  small** items

Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

It relies on a tailored UDP-based protocol, an de/encoding scheme for storing variable length values, and sketches

## **Key-value caching in network ASIC at line rate ?!**

- How to identify application-level packet fields ?
- How to store and serve variable-length data ?
- How to efficiently keep the cache up-to-date ?

## **Key-value caching in network ASIC at line rate**

- □ How to identify application-level packet fields ?
- How to store and serve variable-length data ?
- How to efficiently keep the cache up-to-date ?

## NetCache Packet Format



- ❑ Application-layer protocol: compatible with existing L2-L4 layers
- ❑ Only the top of rack switch needs to parse NetCache fields

Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

## **Key-value caching in network ASIC at line rate**

- How to identify application-level packet fields ?
- □ How to store and serve variable-length data ?
- How to efficiently keep the cache up-to-date ?

# Key-value store using register array in network ASIC

|        |                         |                         |
|--------|-------------------------|-------------------------|
| Match  | pkt.key == A            | pkt.key == B            |
| Action | <b>process_array(0)</b> | <b>process_array(1)</b> |

pkt.value: A

B

```
action process_array(idx):
    if pkt.op == read:
        pkt.value ← array[idx]
    elif pkt.op == cache_update:
        array[idx] ← pkt.value
```



## Register Array

## Variable-length key-value store in network ASIC?



### Key Challenges:

- No loop or string due to strict timing requirements
- Need to minimize hardware resources consumption
  - Number of table entries
  - Size of action data from each entry
  - Size of intermediate metadata across tables

Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

## Combine outputs from multiple arrays

|               |            |                                  |                                                                                                                                                                       |
|---------------|------------|----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Lookup Table  | Match      | pkt.key == A                     | <b>Bitmap</b> indicates arrays that store the key's value<br><b>Index</b> indicates slots in the arrays to get the value<br><b>Minimal hardware resource overhead</b> |
|               | Action     | bitmap = <u>111</u><br>index = 0 |                                                                                                                                                                       |
| Value Table 0 | pkt.value: | A0   A1   A2                     |                                                                                  |
|               | Match      | bitmap[0] == 1                   |                                                                                                                                                                       |
| Value Table 1 | Action     | process_array_0 (index )         | → Register Array 0                                                                                                                                                    |
|               | Match      | bitmap[1] == 1                   |                                                                                  |
| Value Table 2 | Action     | process_array_1 (index )         |                                                                                                                                                                       |
|               | Match      | bitmap[2] == 1                   |                                                                                  |
|               | Action     | process_array_2 (index )         |                                                                                                                                                                       |

Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

## Combine outputs from multiple arrays

|               |        |                                  |                                           |
|---------------|--------|----------------------------------|-------------------------------------------|
| Lookup Table  | Match  | pkt.key == A                     | pkt.key == B                              |
|               | Action | bitmap = <u>111</u><br>index = 0 | bitmap = <u>110</u><br>index = 1          |
| pkt.value:    |        | A0   A1   A2                     | B0   B1                                   |
| Value Table 0 | Match  | bitmap[0] == 1                   | 0    1    2    3                          |
|               | Action | process_array_0 (index )         | →    A0   B0             Register Array 0 |
| Value Table 1 | Match  | bitmap[1] == 1                   | 0    1    2    3                          |
|               | Action | process_array_1 (index )         | →    A1   B1             Register Array 1 |
| Value Table 2 | Match  | bitmap[2] == 1                   | 0    1    2    3                          |
|               | Action | process_array_2 (index )         | →    A2                  Register Array 2 |

Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

## Combine outputs from multiple arrays

|                    |        |                                         |                                         |                                         |
|--------------------|--------|-----------------------------------------|-----------------------------------------|-----------------------------------------|
| Lookup Table       | Match  | pkt.key == A                            | pkt.key == B                            | pkt.key == C                            |
|                    | Action | bitmap = <b>111</b><br>index = <b>0</b> | bitmap = <b>110</b><br>index = <b>1</b> | bitmap = <b>010</b><br>index = <b>2</b> |
| pkt.value:         |        | A0   A1   A2                            | B0   B1                                 | C0                                      |
| Value Table 0      | Match  | bitmap[0] == 1                          | 0                                       | 1                                       |
|                    | Action | process_array_0 (index )                | A0                                      | B0                                      |
| Value Table 1      | Match  | bitmap[1] == 1                          | 2                                       | 3                                       |
|                    | Action | process_array_1 (index )                | A0                                      | B0                                      |
| Value Table 2      | Match  | bitmap[2] == 1                          |                                         |                                         |
|                    | Action | process_array_2 (index )                | A1                                      | B1                                      |
| → Register Array 0 |        |                                         |                                         |                                         |
| → Register Array 1 |        |                                         |                                         |                                         |
| → Register Array 2 |        |                                         |                                         |                                         |

Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

## Combine outputs from multiple arrays

| Lookup Table | Match      | pkt.key == A                     | pkt.key == B                     | pkt.key == C                     | pkt.key == D                     |
|--------------|------------|----------------------------------|----------------------------------|----------------------------------|----------------------------------|
|              | Action     | bitmap = <u>111</u><br>index = 0 | bitmap = <u>110</u><br>index = 1 | bitmap = <u>010</u><br>index = 2 | bitmap = <u>101</u><br>index = 2 |
|              | pkt.value: | A0   A1   A2                     | B0   B1                          | C0                               | D0   D1                          |

  

| Value Table 0 | Match  | bitmap[0] == 1           | 0  | 1  | 2  | 3                |
|---------------|--------|--------------------------|----|----|----|------------------|
|               | Action | process_array_0 (index ) | A0 | B0 | D0 |                  |
|               | →      |                          |    |    |    | Register Array 0 |

  

| Value Table 1 | Match  | bitmap[1] == 1           | 0  | 1  | 2  | 3                |
|---------------|--------|--------------------------|----|----|----|------------------|
|               | Action | process_array_1 (index ) | A1 | B1 | C0 |                  |
|               | →      |                          |    |    |    | Register Array 1 |

  

| Value Table 2 | Match  | bitmap[2] == 1           | 0  | 1 | 2  | 3                |
|---------------|--------|--------------------------|----|---|----|------------------|
|               | Action | process_array_2 (index ) | A2 |   | D1 |                  |
|               | →      |                          |    |   |    | Register Array 2 |

Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

## **Key-value caching in network ASIC at line rate**

- How to identify application-level packet fields ?
- How to store and serve variable-length data ?
- □ How to efficiently keep the cache up-to-date ?

## Cache insertion and eviction

- Challenge: cache the hottest  $O(N \log N)$  items with **limited insertion rate**
- Goal: react quickly and effectively to workload changes with **minimal updates**



Source: NetCache: Balancing Key-Value Stores with Fast In-Network Caching, Xin Jin, 2017

## Query statistics in the data plane



- Cached key: per-key counter array
- Uncached key
  - Count-Min sketch: report new hot keys
  - Bloom filter: remove duplicated hot key reports

Data plane  
programmability

for

**Performance  
Monitoring**  
**Applications offloading**

**Platforms**  
**Correctness**  
**Management**

for  
Data plane  
programmability

"Data-plane" programmability goes beyond  
switch programmability (or P4 for that matter)

# Offloading... ... to FPGA-based SmartNICs

host networking

congestion control



[NSDI'18]

[HotNets'17]

NetFPGA SUME board

# Host-based programmability + SmartNICs + programmable switches = fully programmable platforms

Big question is

How to combine them best?



Data plane  
programmability

for

Performance  
Monitoring  
Applications offloading

Platforms  
**Correctness**  
Management

for  
Data plane  
programmability

# So you've a programmable networks...

# How do you make sure that it works as it should?!



[SIGCOMM'18]

[CoNEXT'18]

[SIGCOMM'18]

# So you've a programmable networks...

# How do you make sure that it works as it should?!



[SIGCOMM'18]

[SIGCOMM'18]

[CoNEXT'18]

## P4 by example

- P4 is a low-level language → many gotchas
- Let's explore by example!
  - IPv6 router w/ access control list (ACL)

```
control ingress { apply(acl); }

table acl {
    reads { ipv6.dstAddr: lpm; }
    actions { allow; deny; }
}

action allow() {
    modify_field(std_meta.egress_spec, 1);
}

action deny() { drop(); }
```

**What could *possibly* go wrong?**

Source: p4v, Practical Verification for Programmable Data Planes, Liu et al., 2018

## What if we didn't receive an IPv6 packet?

ipv6 header will be **invalid**

## What goes wrong

Table reads arbitrary values

→ Intended ACL policy violated

Can read values from a previous packet

→ Side channel vulnerability!

Real programs are complicated:  
hard to keep validity in your head

```
control ingress { apply(acl); }

table acl {
    reads { ipv6.dstAddr: lpm; }
    actions { allow; deny; }
}

action allow() {
    modify_field(std_meta.egress_spec, 1);
}

action deny() { drop(); }
```

## Property #1: header validity

Source: p4v, Practical Verification for Programmable Data Planes, Liu et al., 2018

## What if acl table misses (no rule matches)?

Forwarding decision is unspecified

## What goes wrong

Forwarding behaviour depends on hardware

- May not do what you expect!
- Code not portable

```
control ingress { apply(acl); }

table acl {
    reads { ipv6.dstAddr: lpm; }
    actions { allow; deny; }
}

action allow() {
    modify_field(std_meta.egress_spec, 1);
}

action deny() { drop(); }
```

## Property #2: unambiguous forwarding

Source: p4v, Practical Verification for Programmable Data Planes, Liu et al., 2018

# Types of properties

## General safety

- **Header validity**
- Arithmetic-overflow checking
- Index bounds checking (header stacks, registers, meters, ...)

## Architectural

- **Unambiguous forwarding**
- **Reparseability**
- **Mutual exclusion of headers**
- Correct metadata usage (e.g., read-only metadata)

## Program-specific

- Custom assertions in P4 program — e.g., IPv4 ttl correctly decremented

Source: p4v, Practical Verification for Programmable Data Planes, Liu et al., 2018

## Challenge #1: imprecise semantics



- P4 language spec doesn't give precise semantics
- Defined semantics by translation to GCL (a simple imperative language)
- Tested semantics
  - Symbolically executed GCL to generate input-output tests for several programs
  - Ran w/ Barefoot P4 compiler & Tofino simulator

Source: p4v, Practical Verification for Programmable Data Planes, Liu et al., 2018

## Challenge #2: modelling the control plane

- A P4 program is just half the program
  - Table rules are not statically known
  - Populated by the control plane at run time
- Control planes are carefully programmed
  - Tables rarely take arbitrary actions
- To rule out false positives, need to model behaviour of control plane



```
table acl {  
    reads {  
        ipv6.dstAddr: lpm;  
    }  
    actions { allow; deny; }  
}
```

```
( @[ Action ] acl <hit> (allow);  
  std_meta.egress_spec := 1)  
[] ( @[ Action ] acl <hit> (deny);  
  std_meta.egress_spec := 511)  
[] @[ Action ] acl <miss>
```

Tables translated into *unconstrained nondeterministic choice*

Source: p4v, Practical Verification for Programmable Data Planes, Liu et al., 2018

# p4v overview

- **Automated** tool for verifying P4 programs
- Considers **all paths**
  - But also practical for **large programs**
- Includes basic safety properties for any program
- **Extensible** framework
  - Verify custom, program-specific properties
  - Assert-style debugging



Source: p4v, Practical Verification for Programmable Data Planes, Liu et al., 2018

Data plane  
programmability

for

**Performance  
Monitoring**  
**Applications offloading**

**Platforms**  
**Correctness**  
**Management**

for  
Data plane  
programmability

So you've a *verified* programmable networks...

How do you manage it?!

How do you perform planned maintenance?  
now that you've state in your switches...

How do you run multiple applications in your switches?  
monitoring, forwarding, load-balancing, etc.

How do you share resources amongst applications?  
especially memory and # packet operations

# We need an **Operating System** for the data plane

Definition  
Wikipedia

An **operating system** is a system software that manages computer hardware and software resources and provides common services for computer programs.

Do we have that? **Nope.** Not yet at least.

# We're working on it...

[SOSR'17]

The screenshot shows a PDF document titled "vanbever\_swing\_state\_sosr\_2017.pdf (page 1 of 7)". The title page contains the following information:

**Swing State: Consistent Updates for Stateful and Programmable Data Planes**

Shouxi Luo\* Hongfang Yu  
University of Electronic Science and Technology of China

Laurent Vanbever  
ETH Zürich

**ABSTRACT**  
With the rise of stateful programmable data planes, a lot of the network functions that used to be implemented in the controller or at the end-hosts are now moving to the data plane to benefit from line-rate processing. Unfortunately, stateful data planes also mean more complex network updates as not only flows, but also the associated states, must now be migrated consistently to guarantee correct network behaviors. The main challenge is that data-plane states are maintained at line rate, according to possibly runtime criteria, rendering controller-driven migration impossible.

We present **Swing State**, a general state-management framework and runtime system supporting consistent state migration in stateful data planes. The key insight behind **Swing State** is to perform state migration entirely within the data plane by piggybacking state updates on live traffic. To minimize the overhead, **Swing State** only migrates the states that cannot be safely reconstructed at the destination switch.

We implemented a prototype of **Swing State** for P4. Given a P4 program, **Swing State** performs static analysis to compute which states require consistent migration and automatically augments the program to enable the transfer of these states at runtime. Our preliminary results indicate that **Swing State** is practical in migrating data-plane states at line rate with small overhead.

**Keywords**  
Network updates; Software-Defined Networking; P4; Stateful programmable data planes.

**1. INTRODUCTION**  
By enabling stateful applications to run directly *in* the data plane, at line rate, programmable data planes [9, 22, 8, 28, 27, 16, 23] have recently emerged as a promising research area.

Yet, despite making SDNs more powerful, maintaining states in the data plane also calls for new consistent update mechanisms as it prevents traditional update techniques from working, and this, for three main reasons. First, the fact that data-plane states can be updated at line rate—at speeds that can reach Tbps [5]—prevents any software-based controller from consistently moving states from one device to another. Inconsistent migration is a problem for any data-plane application that requires strong-consistency network-wide. Examples of such applications include stateful firewalls tracking dynamic flow characteristics (e.g., low-level TCP states [29]) or anomaly detection applications [21]. Second, even ignoring states dynamism, the exact set of states to be migrated may actually be unknown to the controller, preventing it from performing the migration in the first place. Indeed, the states location in memory can differ from device to device according to runtime factors (e.g. a hash computed on packet headers) that are invisible to the controller. Third, data-plane states

**CCS Concepts**

Swing State is a state management framework with 1 primitive: **moveStates**



Source: Swing State: Consistent Updates for Stateful and Programmable Data Planes  
Luo et al., SOSR 2017

# Advanced Topics in Communication Networks



~7 weeks  
how to program in P4

>= 7 weeks  
in teams of 3

# Advanced Topics in Communication Networks



~7 weeks  
how to program in P4

>= 7 weeks  
in teams of 3

The group project starts next week

It accounts for 50% of your final grade

The evaluation of your project will depend on  
your implementation, report, and presentation

# The evaluation of your project will depend on your implementation, report, and presentation

implementation

70%

achieves the basic goals  
is properly documented  
runs + results can be reproduced

The evaluation of your project will depend on  
your implementation, report, and presentation

implementation

70%

achieves the basic goals

is properly documented

runs + results can be reproduced

You'll have to write  
a detailed README (in Markdown)  
We'll provide you with a template

# The evaluation of your project will depend on your implementation, report, and presentation

implementation

70%

achieves the basic goals

is properly documented

runs + results can be reproduced

report

15%, 10 pages max

describes the main building blocks

evaluates the solution

describes what each group member did

# The evaluation of your project will depend on your implementation, report, and presentation

implementation

70%

achieves the basic goals

is properly documented

runs + results can be reproduced

report

15%, 10 pages max

describes the main building blocks

evaluates the solution

describes what each group member did

presentation

15%, 10 min. +questions

summarizes the problem and the solution

contains a *live demo*

involves all group members

The final deadline for the project is

**Wed Dec 16 at 23.59pm**

This week

Select a proposal from the list ([adv-net.ethz.ch](http://adv-net.ethz.ch))  
or send us your own proposal by email

*Every* week

Meet with the responsible assistant  
schedule a recurring slot in [10.15am; noon]

**Mon Dec 16  
11.59pm**

Send us an archive with report, code, slides

**Tue Dec 17  
1.15pm—**

Groups presentation + course/exam debrief  
**attendance is mandatory**

The project has to be done in groups of 3 students  
"Matching" process for incomplete groups via Slack

Project grade is shared by each group member  
provided that each collaborated (roughly equally)

- Let us know in advance if that's *not* the case
- Briefly describe in the report the contribution of each group member
- Each group member should be involved in the presentation and be able to answer questions

If you want to propose your own project,  
send us an email describing it by **Thu Oct 31 11.59am**

Ivanbever@ethz.ch, cedgar@ethz.ch

# Quick overview of the proposals



Albert



Thomas



Roland



Alexander



Maria



Edgar

# Quick overview of the proposals



**Albert**



Thomas



Roland



Alexander



Maria



Edgar

# Proposal #1

## SDNSec: Forwarding Accountability for SDN Data Plane



Current data plane lacks accountability:

- ✗ Enforcing forwarding policies
- ✗ Validating that policies have not been violated
- ✗ Consistency guarantees under reconfiguration

# Proposal #1

## SDNSec: Forwarding Accountability for SDN Data Plane



With SDNSec:

- ✓ Ingress-switch adds path in header
- ✓ Core-switches extract header, decrypt and forward
- ✓ Controller verifies policy

## Proposal #2

# Herd<sup>ing</sup> the Elephants: Detecting Network-Wide Heavy Hitters with Limited Resources



Separating elephant from mice is key in network management:

- ✗ Sampling is not accurate and results are delayed
- ✗ App-specific sketches limit network visibility

## Proposal #2

# Hherding the Elephants: Detecting Network-Wide Heavy Hitters with Limited Resources



- ✓ Herd provides accuracy network wide
- ✓ Switches allocate resources based on flow type
- ✓ Switches notify controller when local heavy hitter
- ✓ Controller finds global heavy hitters

Extension: Network-Wide Heavy Hitter Detection with Commodity Switches

## Proposal #3

### Retroactive Packet Sampling for Traffic Receipts



- ✗ Network nodes could cheat in monitoring
- ✗ Performing better for selected samples
- ✓ Delayed disclosure mechanism prevents it
- ✓ Estimates loss-rate and delay from controller

Extension: SQR: In-Network  
Packet Loss Recovery

# Quick overview of the proposals



Albert



Thomas



Roland



Alexander



Maria



Edgar

# Blink: Fast Connectivity Recovery Entirely in the Data Plane

NSDI'19



# Blink: Fast Connectivity Recovery Entirely in the Data Plane

NSDI'19



**Goal: improving blink**

1. Selecting flows with low RTTs
2. Monitoring backup next-hops continuously to reroute faster
3. Monitoring the throughput to improve accuracy

# NetCache: Balancing Key-Value Stores with Fast In-Network Caching

SOSP'17 (for 2 students only)

Traditional way to implement a key value store:



# NetCache: Balancing Key-Value Stores with Fast In-Network Caching

SOSP'17 (for 2 students only)

Traditional way to implement a key value store:



NetCache:



# NetChain: Scale-Free sub-RTT Coordination

NSDI'18      **(for 2 students only)**

Traditionally, key-value stores are replicated for *fault-tolerance*



# NetChain: Scale-Free sub-RTT Coordination

NSDI'18      (for 2 students only)

Traditionally, key-value stores are replicated for *fault-tolerance*



# Quick overview of the proposals



Albert



Thomas



Roland



Alexander



Maria



Edgar

# NetHide: Secure and Practical Network Topology Obfuscation

*If I receive a packet to X with TTL = i,  
I should send it to Y with TTL = j*



# pForest: In-Network Inference with Random Forests



# iTAP: In-Network Traffic Analysis Prevention Using Software-Defined Networks



# Quick overview of the proposals



Albert



Thomas



Roland



Alexander



Maria



Edgar

## Proposal #4

# Fast String Searching on PISA

P4 is very limited, e.g. it cannot work with strings.  
**Or can it?** It can even handle regular expressions!



| Match |       | Action       |
|-------|-------|--------------|
| state | chars |              |
| 0     | do    | set_state(1) |
| 3     | og    | accept(4)    |
| 1     | g*    | accept(2)    |
| 0     | *d    | set_state(3) |

```
$ grep P4 \  
lecture.txt
```

*In the control-plane:*  
Translate regex  
to automaton.

*In the data-plane:*  
Execute automaton  
using recirculation.

*Evaluation:*  
Compare to grep  
and co.

## Proposal #5

# SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs

*SilkRoad* using a P4 switch to replace software load balancers.  
It can handle **millions of stateful connections using multi-level caching**.



*In the control-plane:*  
Accept incoming  
connections.

*In the data-plane:*  
Keep track of  
existing connections.

*Evaluation:*  
Test performance at  
large scale.

# Proposal #6

## A Distributed Algorithm to Calculate Max-Min Fair Rates Without Per-Flow State

*s-Perc* is a congestion control algorithm that proactively assigns per-flow sending rates without per-flow state on devices.

---

**Algorithm 4 s-PERC: link  $l$  processing control packets for flow  $f$ .**  
Differences from n-PERC are highlighted.

```
1:  $b, x, s$ : vector of bottleneck allocated rates, bottleneck states in packet (initially,  $\infty$ , 0, E, respectively)
2:  $i$ : vector of ignore bits in packet (initially, 1)
3:  $SumE, NumB$ : sum of limit rates of  $E$  flows, and number of  $B$  flows at link
4:  $MaxE, MaxE'$ : max. allocated rate of flows classified into  $E$  since last round (and in this round, respectively) at link

5: if  $s[l] = E$  then
6:    $s[l] \leftarrow B$ 
7:    $SumE \leftarrow SumE - x$ 
8:    $NumB \leftarrow NumB + 1$ 
9:    $b \leftarrow (c - SumE)/NumB$ 
10:  foreach link  $j$ :
11:    if  $i[j] = 0$  then  $p[j] \leftarrow b[j]$  else  $p[j] \leftarrow \infty$ 
12:     $p[l] \leftarrow \infty$ 
13:     $e \leftarrow \min p$ 
14:     $x \leftarrow \min(b, e)$ 
15:     $b[l] \leftarrow b, x[l] \leftarrow x$ 
16:    if  $b < MaxE$  then  $i[l] \leftarrow 1$  else  $i[l] \leftarrow 0$ 
17:    if flow is leaving then  $NumB \leftarrow NumB - 1$ 
18:    else if  $e < b$  then
19:       $s[l] \leftarrow E$ 
20:       $SumE \leftarrow SumE + x$ 
21:       $NumB \leftarrow NumB - 1$ 
22:       $MaxE \leftarrow \max(x, MaxE); MaxE' \leftarrow \max(x, MaxE').$ 
```

---



*In the control-plane:*  
Implement the  
*s-Perc* algorithm.

*In the data-plane:*  
Create and parse  
control messages.

*Evaluation:*  
Compare with TCP  
and other protocols.

## Proposal #7

# Millions of Little Minions: Using Packets for Low Latency Network Programming and Visibility

In active networks, **packets carry programs**, which are run by switches.

| <b>Instruction</b> |
|--------------------|
| LOAD, PUSH         |
| STORE, POP         |
| CSTORE             |
| CEXEC              |



*In the control-plane:*  
Compile and start  
packet programs.

*In the data-plane:*  
Parse packets and  
execute instructions.

*Evaluation:*  
Test the performance  
of packet programs.

# Quick overview of the proposals



Albert



Thomas



Roland



Alexander



Maria



Edgar

# DIBS: Just-in-time congestion mitigation for Data Centers



## currently

- DC patterns can cause **congestion**.
- Switches drop packets they cannot buffer.

## with DIBS

- \* **detours** to neighboring switches.
- \* minimizes drops, which speeds up **job completion time**.

# Marple: Language-Directed Hardware Design for Network Performance Monitoring



- \* The operator writes a query in a domain-specific language called Marple.
- \* The query is compiled into a switch program that runs on the network's programmable switches, augmented with new switch hardware primitives that we design in service of Marple.
- \* The switches stream results out to collection servers, where the operator can retrieve query results.

# 007: Democratically Finding the Cause of Packet Drops



**Need to detect short-lived & concurrent failures despite noise**

- \* 007 scales by uses traceroute to find paths of flows that had packet drops
- \* 007 finds faulty links democratically by letting hosts vote

**Implementation with p4 switches.**

- \* detect retransmissions in switches
- \* issue traceroutes directly from data plane
- \* combine traceroutes in control plane

# Quick overview of the proposals



Albert



Thomas



Roland



Alexander



Maria



**Edgar**

# Hardware-Accelerated Network Control Planes

[HotNets 2018] ETH, (Molero et. al.)

Modern programmable devices can perform small computations on **billions** of small packets per second.

This paper shows how to leverage that to run **control plane** algorithms directly in the **data plane**



# Seek and Push: Detecting Large Traffic Aggregates in the Dataplane

[arXiv 2018] CESNET, Cambridge (Kučera et. al.)

They present a data structure called *Elastic Tire* that is able to detect: **heavy hitters**, **traffic shifts** and **superspreaders**.



# Seek and Push: Detecting Large Traffic Aggregates in the Dataplane

[arXiv 2018] CESNET, Cambridge (Kučera et. al.)

They present a data structure called *Elastic Tire* that is able to detect: **heavy hitters**, **traffic shifts** and **superspreaders**.



High-level architecture:

1. Matching the flow using a dynamic LPM tree
2. Update Statistics
3. Control logic to update or report



# Generic External Memory for Switch Data Planes

[HotNets 2018] CMU, Microsoft, Barefoot Networks (Kim et. al.)

Programmable switches are **flexible** but only have a  
**limited** on-ship SRAM and TCAMS



= Lots of innovative applications!

# Generic External Memory for Switch Data Planes

[HotNets 2018] CMU, Microsoft, Barefoot Networks (Kim et. al.)

Programmable switches are **flexible** but only have a **limited** on-ship SRAM and TCAMS



Leverage **RDMA** to access remote memories at minimal latency and CPU usage



# Generic External Memory for Switch Data Planes

[HotNets 2018] CMU, Microsoft, Barefoot Networks (Kim et. al.)

Packet buffer extension



Extending Lookup Tables



Extending State for network monitoring



# Advanced Topics in Communication Networks

## Programming Network Data Planes



Laurent Vanbever  
[nsg.ee.ethz.ch](http://nsg.ee.ethz.ch)

ETH Zürich  
Oct 29 2019