

# Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action

Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner

HardBD & Active'18 Workshop in Paris, France on April 16, 2018

# Challenges for Data Processing Nowadays



# Lightweight Compression Techniques

Technique = abstract idea of how compression works

| RLE                                                  | DELTA                                                                  | FOR                                                                       | DICT                                                                 | NS                                                                    |
|------------------------------------------------------|------------------------------------------------------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------|-----------------------------------------------------------------------|
| Run Length Encoding<br>Replace run by value & length | Differential Coding<br>Replace data elem. by difference to predecessor | Frame-of-Reference<br>Replace data elem. by difference to reference value | Dictionary Coding<br>Replace data elem. by 0-based key in dictionary | Null Suppression<br>Eliminate leading zeroes in binary representation |
| 1200            1200                                 | 1000            1000                                                   | 1200            200                                                       | 1000            0                                                    | 00...00 1011                                                          |
| 1200            4                                    | 1100            100                                                    | 1100            100                                                       | 1200            1                                                    | ↓                                                                     |
| 1200            300                                  | 1150            50                                                     | 1000            0                                                         | 1000            0                                                    | 1011                                                                  |
| 1200            2                                    | 1350            200                                                    | 1050            50                                                        | 1050            2                                                    |                                                                       |
| 300                                                  | 1355            5                                                      |                                                                           |                                                                      |                                                                       |
| 300                                                  |                                                                        |                                                                           |                                                                      |                                                                       |

Vectorization is crucial from performance perspective

# Vectorization using SIMD

## Single Instruction Multiple Data (SIMD)

- same instruction on multiple data points simultaneously

## Development of Intel's SIMD Extension

- Trend to larger vector registers
  - 128-bit (SSE)
  - 256-bit (AVX and AVX2)
  - 512-bit (AVX-512)
- Trend to more instructions



# Vectorization and Lightweight Data Compression

Most algorithms have been proposed for 128-bit SIMD registers

- Processing 4 elements (32 bit integers) at one

## Example Run-Length Encoding

- View subsequent occurrences of the same value as a run
- Each run representable by its value and length → just two integers

## RLE-SIMD

- Uses SIMD instructions to parallelize comparisons

| uncompressed |    |    |    | Rle         |
|--------------|----|----|----|-------------|
| 00           | 98 | 76 | 54 | 00 98 76 54 |
| 00           | 98 | 76 | 54 | 00 00 00 04 |
| 00           | 98 | 76 | 54 | 12 34 56 78 |
| 00           | 98 | 76 | 54 | 00 00 00 01 |
| 12           | 34 | 56 | 78 | 00 00 AB CD |
| 00           | 00 | AB | CD | 00 00 AB CD |
| 00           | 00 | AB | CD | 00 00 00 03 |



# RLE-SIMD: Compression



# Evaluation using Different Vector Sizes

## Compression Speed

- Measured in million integers per second (mis)



## Speedup

- Compared to baseline of 128-bit



non-well performing area

well-performing area

# Non-Well Performing Area

## Reasons

- For large run lengths, the number of loaded integers approaches more or less 100%, i.e. every value is only processed once.
- RLE vectorization uses a significantly higher number of load operations for sequences with short runs.
- The redundant processing dramatically increases with increasing vector widths.



# SIMD – New Instruction Sets



# Conflict Detection using AVX512CD



# Step 1: Run Detection



## Step 2: Run Length Detection



# Step 3: Storing



## Scatter (RLE512CD)

- Classical storage layout: (run value, run length)-pair
- Independent of vector width

## Continuous (RLE512CDAligned)

- Vector wise
- Run length and run value to different memory locations

# Evaluation – Load Instructions



# Evaluation- Vector Instructions



# Evaluation

## Runtime Comparison

- Intel Xeon Phi Knights Landing Processor
- RLE512CD (Aligned) outperforms state-of-the-art for small average run lengths



# Evaluation

## Runtime Comparison

- Intel Xeon 6130 Processor
- Similar results



# Summary

## Development of Intel's SIMD Extension

- Trend to larger vector registers
  - 128-bit (SSE)
  - 256-bit (AVX and AVX2)
  - 512-bit (AVX-512)
- Trend to more instructions

## Run Length Encoding

- Proposed novel implementation using AVX512-CD functionality



# Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action

Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner

HardBD & Active'18 Workshop in Paris, France on April 16, 2018