



MEMSYS18

# *Quantifying the Performance Overheads of PMDK*

William Wang, Stephan Diestelhorst

2 October 2018

# NVDIMMs Shipping in 2H'2018



- 3DXP NVDIMM shipping 2H'2018
  - DDR4 -> \$10/GB, NAND -> \$1/4 per GB
  - 3DXP -> \$2/GB, up to 512GB
- Arm NVDIMM-ready timeframe around 2022

# Multiple System Use-Cases for NVM

## Faster Storage

1000x faster than NAND



- Filesystem bottlenecks

## Denser Mem

10x denser than DRAM



- Transformative Capacity/TCO-advantage
- Endurance
- Bandwidth
- Caching

## Persistent Mem

Non-Volatile



- Persistence
- Failure atomicity
- Ordering
- ISA

# Spectrum of Application Performance Boost with NVM

## Executive summary

|                        | NVMe Flash SSD                | Baseline                       | Faster Storage             | Persistent Memory            |
|------------------------|-------------------------------|--------------------------------|----------------------------|------------------------------|
| Config→                | AOF every <u>second</u> + RDB | NVMe Flash SSD                 | PMEM + Flash               | PMEM-only                    |
| Data loss potential    | Up to 1 sec worth             | AOF every <u>request</u> + RDB | AOF on PMEM + RDB on Flash | SW persistence <u>no</u> RDB |
| Crash restart          | Slowest                       | None                           | None                       | None                         |
| Application rewrite    | No                            | Slowest                        | Fast                       | Instantaneous                |
| “Memory” cost^         | 1x                            | No                             | No                         | Yes                          |
| Iso-SLA performance(†) | 1x                            | 1x                             | 1.25x§                     | 0.27x†                       |
|                        |                               | 0.165x (0.149x)                | 1.46x (1.31x)              | 2.18x (1.96x)                |



^ Flash costs are marginal for all configurations, thus ignored.

§ Assumes 8GB/core DRAM + 1 GB/core SCM capacity – SCM capacity very generous for log use-case, previous studies used as low as 40MB/core.

† Assumes 8GB/core DRAM capacity, SCM = 0.25x DRAM \$/GB, 1:8 capacity ratio for DRAM:SCM cache ratio.

‡ Measured performance de-rated by 0.9x to account for a slower-than-DRAM SCM media fronted by a DRAM cache.

\* Unable to achieve same SLA as baseline with p99 for the every-request AOF logging configuration – hence the unconstrained SLA comparison.

4 Confidential © 2018 Arm Limited



- **13x perf boost, 1/4 mem cost**
- **No data loss**
- **Instantaneous restart**
- **Rewrite application**

arm

Slide Courtesy of Kshitij Sudan

# Why Application Rewrite



**Recovery can inspect the data-structures in PM to restore system to a consistent state**

- Crash consistency (failure atomicity) is needed to ensure recovery can restore system to a consistent state
  - Data move through volatile memories before they get written to PM
  - Using CPU cache flushes and fence instructions

# Example: Add a Node to a Linked List

```
1 // Add a node to a linked list
2 void addnode(struct root *rootp, int data)
3 {
4     struct node *newnodep;
5     if ((newnodep = malloc(1,
6         sizeof(struct node))) == NULL)
7         fatal("out of memory");
8     newnodep->data = data;
9     newnodep->nextp = rootp->headp;
10    rootp->headp = newnodep;
11 }
12 }
```

```
1 // Add failure atomicity
2 void addnode(struct root *rootp, int data)
3 {
4     struct node *newnodep;
5     int flag = 0;
6     if ((newnodep = pm_malloc(1,
7         sizeof(struct node))) == NULL)
8         fatal("out of memory");
9     newnodep->data = data;
10    newnodep->nextp = rootp->headp;
11    pm_flush(newnodep,
12        sizeof(struct node));
13    pm_fence();
14    rootp->headp = newnodep;
15    pm_flush(&(rootp->headp),
16        sizeof(rootp->headp));
17    pm_fence();
18 }
19 }
```



# Example: Add Concurrency to the Mix

```
1 // Add a node to a linked list
2 void
3 addnode(struct root *rootp, int data)
4 {
5     struct node *newnodep;
6     if ((newnodep = malloc(1,
7         sizeof(struct node))) == NULL)
8         fatal("out of memory");
9     newnodep->data = data;
10    newnodep->nextp = rootp->headp;
11    rootp->headp = newnodep;
12 }
```

```
1 // Add multithread atomicity
2 void
3 addnode(struct root *rootp, int data)
4 {
5     struct node *newnodep;
6     if ((newnodep = malloc(1,
7         sizeof(struct node))) == NULL)
8         fatal("out of memory");
9     newnodep->data = data;
10    /* lock the critical section */
11    pthread_mutex_lock(
12        &rootp->listlock);
13    newnodep->nextp = rootp->headp;
14    rootp->headp = newnodep;
15    pthread_mutex_unlock (
16        &rootp->listlock);
17 }
```

```
1 // Add failure atomicity
2 void
3 addnode(struct root *rootp, int data)
4 {
5     struct node *newnodep;
6     if ((newnodep = pm_malloc(1,
7         sizeof(struct node))) == NULL)
8         fatal("out of memory");
9     newnodep->data = data;
10    /* lock the critical section */
11    pthread_mutex_lock(
12        &rootp->listlock);
13    newnodep->nextp = rootp->headp;
14    pm_flush(newnodep,
15        sizeof(struct node));
16    rootp->headp = newnodep;
17    pm_flush(&(rootp->headp),
18        sizeof(rootp->headp));
19    pthread_mutex_unlock (
20        &rootp->listlock);
21 }
```

*What can still go wrong?*

Fences are missing after persists, they can get persisted out of order!!!

*What about making this lock-free?*



**It's Complicated**

- Oracle Labs at PPoPP'18

# Persistent Memory Programming Models

## Native Persistence

```
pt->x = 1;  
pt->y = 1;  
dccvap(&pt->x)  
dccvap(&pt->y)  
dsb
```

```
flag=1;  
dccvap(&flag)  
dsb
```

## Library Persistence – Atomic

```
pt->x = 1;  
pt->y = 1;  
pmem_persist(&pt,  
sizeof(pt))  
  
flag = 1;  
pmem_persist(&flag,  
sizeof(flag))
```

## Library Persistence – Durable TXs

```
TX_BEGIN{  
    pt->x = 1;  
    pt->y = 1;  
} TX_END
```

# Add Concurrency to the Mix

## Lib Persistence – Lock-free

```
1 // ctor, dtor, copy and assign
2 struct Node
3 {
4     int x;
5     int y;
6 };
7
8 //lock-free update
9 int updateNode(Node* pt, int flag)
10 {
11     // allocate a new node (CoW)
12     atomic<Node*> newpt = new Node();
13     persist(newpt);
14
15     // update the node
16     newpt->x = 1;
17     newpt->y = 1;
18     persist(&newpt->x);
19     persist(&newpt->y);
20
21     // swing the pointer to the new Node and update flag
22     while(true) {
23         // copy on write
24         atomic<Node*> cowpt = pt;
25
26         if(CAS(&pt, cowpt, newpt)) {
27             persist(&pt);
28             // no contention for flag
29             flag = 1;
30             persist(&flag);
31             return 1;
32         }
33     }
34 }
```

## Lib Persistence – Lock-based

```
mutex.lock()
pt->x = 1;
pt->y = 1;
pmem_persist(&pt,
sizeof(pt))

flag = 1;
pmem_persist(&flag,
sizeof(flag))
mutex.unlock()
```

## Lib Persistence – Durable TXs

```
TX_BEGIN{
    pt->x = 1;
    pt->y = 1;
} TX_END
```

Programming simpler, overheads higher

# Durable Transactions

- TXs provide clean failure semantics
- Accelerate w. Persistent HTM

```
TX_BEGIN{  
    pt->x = 1;  
    pt->y = 1;  
} TX_END
```

- Durable transactions guarantee Tx failure atomicity
  - Ordered persists + logging



# PMDK (Persistent Memory Development Kit)

Formerly NVML, 'pmem libraries'

- PMDK provides transactional APIs for persistent memory programming
  - libpmemobj transactional APIs
  - Use fine-grained logging and cache flushes
- Works on 64-bit Linux, Windows and 64-bit FreeBSD
- Supports x86 and Armv8 (experimental)



Ref: pmem.io

# Analyze Cost of Durable Transactions

Logging, Persisting and Ordering

# Cost of Durable Transactions

## Logging, Persisting and Ordering

### Persisting

- Latency too high if PoP is off-chip
- Undo log needs to be persisted in TX
- Instruction bloat due to looping through addresses with DCCVAP

### Logging

- Write amplification, i.e., write twice
- Copying (costly, and cache pollution)
- Barriers with undo logging
- Memory fragmentation
- Code bloat due to logging

### Ordering

- Serialize memory accesses
- Excessive # of barriers via API calls
- Persisting barriers same as ordering barriers

# Performance Measurement Setup



## Payload

Redis-Benchmark  
TPCC



## Workload

Redis  
Maps (PMDK/NVML)



## CPU

CLFLUSH vs. CLFLUSHOPT  
Fixed frequency and ASLR off  
Thread and memory affinity in multi-node



## Memory

Local DRAM  
Remote DRAM to emulate NVM

# Flushing, Logging and Fencing Overhead



- Logging overhead 0~35%
- Flushing overhead 14~53%
- Fencing overhead 2~5%

Baseline: PMDK without flushing/fencing and logging on

# Software Opt Can Reduce Logging Overhead



- Software opt can reduce logging overhead
  - NVML v1.3 (left) reduces logging overhead over NVML v1.0 (right)
  - Redis-SET performance not affected by logging overhead

# Summary of Performance Overheads Analysis

- Flushing overhead up to 53%.
- Logging overhead up to 35%.
- Fencing overhead up to 5%
- These overheads can be reduced by
  - Software optimizations
  - Hardware accelerations

# Ease-of-Programming vs Performance



| Programming Model | Ordering | Logging         | Persisting | Translation |
|-------------------|----------|-----------------|------------|-------------|
| Atomics           | ✓        | ✗               | ✓          | ✓           |
| TM                | ✓        | ✓               | ✓          | ✓           |
| Locking           | ✓        | ✓ (log elision) | ✓          | ✓           |

## Ease-of-Programming

Persistent CAS as drop-in replacement for CAS  
Persistent HTM as drop-in replacement for HTM  
Compiler instrumented failure atomicity for locking

## Performance

Relaxed memory persistency model  
Hardware accelerations  
Software optimizations



Thank You!

Danke!

Merci!

谢谢!

ありがとう!

Gracias!

Kiitos!

감사합니다

ধন্যবাদ

arm