add avx2 version for slide_hash #27

guowangy · 2020-08-31T01:32:42Z

it can improve the slide_hash performance about 33% compared to SSE version.

perf-test result in CLX

SSE time: 30.59ms
AVX2 time: 20.49ms

Adds check for SSE2, SSE4.2, and the PCLMULQDQ instructions.

Excessive loop unrolling is detrimental to performance. This patch adds a preprocessor define, ADLER32_UNROLL_LESS, to reduce unrolling factor from 16 to 8. Updates configure script to set as default on x86

Adds a preprocessor define, CRC32_UNROLL_LESS, to reduce unrolling factor from 8 to 4 for the crc32 calculation.

For systems supporting SSE4.2, use the crc32 intstruction as a fast hash function. We hash 4 bytes, instead of 3, for certain levels. This shortens the hash chains, and also improves the quality of each hash entry.

Rather than copy the input data from strm->next_in into the window and then compute the CRC, this patch combines these two steps into one. It performs a SSE memory copy, while folding the data down in the SSE registers. A final step is added, when we write the gzip trailer, to reduce the 4 SSE registers to 32b. Adds some extra padding bytes to the window to allow for SSE partial writes.

In some (very rare) scenarios, the SIMD code in the `crc_folding` module can perform out-of-bounds reads or writes, which could lead to GPF crashes. Here's the deal: when the `crc_fold_copy` function is called with a non-zero `len` argument of less then 16, `src` is read through `_mm_loadu_si128` which always reads 16 bytes. If the `src` pointer points to a location which contains `len` bytes, but any of the `16 - len` out-of-bounds bytes falls in unmapped memory, this operation will trigger a GPF. The same goes for the `dst` pointer when written to through `_mm_storeu_si128`. With this patch applied, the crash no longer occurs. We first discovered this issue though Valgrind reporting an out-of-bounds access while running a unit-test for some code derived from `crc_fold_copy`. In general, the out-of-bounds read is not an issue because reads only occur in sections which are definitely mapped (assuming page size is a multiple of 16), and garbage bytes are ignored. While giving this some more thought we realized for small `len` values and `src` or `dst` pointers at a very specific place in the address space can lead to GPFs. - Minor tweaks to merge request by Jim Kukunas <james.t.kukunas@linux.intel.com> - removed C11-isms - use unaligned load - better integrated w/ zlib (use zalign) - removed full example code from commit msg

With deflate, the only destination for crc folding was the window, which was guaranteed to have an extra 15B of padding (because we allocated it). This padding allowed us to handle the partial store case (len < 16) with a regular SSE store. For inflate, this is no longer the case. For crc folding to be efficient, it needs to operate on large chunks of the data each call. For inflate, this means copying the decompressed data out to the user-provided output buffer (moreso with our reorganized window). Since it's user-provided, we don't have the padding guarantee and therefore need to fallback to a slower method of handling partial length stores.

The deflate_quick strategy is designed to provide maximum deflate performance. deflate_quick achieves this through: - only checking the first hash match - using a small inline SSE4.2-optimized longest_match - forcing a window size of 8K, and using a precomputed dist/len table - forcing the static Huffman tree and emitting codes immediately instead of tallying This patch changes the scope of flush_pending, bi_windup, and static_ltree to ZLIB_INTERNAL and moves END_BLOCK, send_code, put_short, and send_bits to deflate.h. Updates the configure script to enable by default for x86. On systems without SSE4.2, fallback is to deflate_fast strategy. Fixes intel#6 Fixes intel#8

From: Arjan van de Ven <arjan@linux.intel.com> As the name suggests, the deflate_medium deflate strategy is designed to provide an intermediate strategy between deflate_fast and deflate_slow. After finding two adjacent matches, deflate_medium scans left from the second match in order to determine whether a better match can be formed. Fixes intel#2

(Note emit_match() doesn't currently use the value at all.) Fixes intel#4

Merges branch issue3.

…pilation. Minor tweak by Jim Kukunas, dropping changes to configure script

This commit significantly improves inflate performance by reorganizing the window buffer into a contiguous window and pending output buffer. The goal of this layout is to reduce branching, improve cache locality, and enable for the use of crc folding with gzip input. The window buffer is allocated as a multiple of the user-selected window size. In this commit, a factor of 4 is utilized. The layout of the window buffer is divided into two sections. The first section, window offset [0, wsize), is reserved for history that has already been output. The second section, window offset [wsize, 4 * wsize), is reserved for buffering pending output that hasn't been flushed to the user's output buffer yet. The history section grows downwards, towards the window offset of 0. The pending output section grows upwards, towards the end of the buffer. As a result, all of the possible distance/length data that may need to be copied is contiguous. This removes the need to stitch together output from 2 separate buffers. In the case of gzip input, crc folding is used to copy the pending output to the user's buffers.

…ny residue

Since the inflate optimizations depend on an efficient memcpy routine, add an optimized version for platforms that don't have one.

window anymore

This fixes an issue where a repeat sequence longer than 258 would be encoded using longer distance values after the first match.

When we load new data into the window, we invalidate the next match, in case the match would improve. In this case, the hash has already been updated with this data, so when we look for a new match it will point it back at itself. As a result, a literal is generated even when a better match is available. This avoids that by catching this case and ensuring we're looking at the past.

jtkukunas and others added 30 commits June 21, 2018 20:13

For x86, add CPUID check.

4603884

Adds check for SSE2, SSE4.2, and the PCLMULQDQ instructions.

zutil.h: add zlikely/zunlikely macros

4fa77e8

zutil.h: add zalign macro

63f46f5

enable 16-bit longest_match for x86

c4b9af6

Add preprocessor define to tune Adler32 loop unrolling.

aa5fd5c

Excessive loop unrolling is detrimental to performance. This patch adds a preprocessor define, ADLER32_UNROLL_LESS, to reduce unrolling factor from 16 to 8. Updates configure script to set as default on x86

Add preprocessor define to tune crc32 unrolling.

d4726de

Adds a preprocessor define, CRC32_UNROLL_LESS, to reduce unrolling factor from 8 to 4 for the crc32 calculation.

Adds SSE2 optimized slide_hash

c0d4a12

adds SSE4.2 optimized hash function

43a5e76

For systems supporting SSE4.2, use the crc32 intstruction as a fast hash function. We hash 4 bytes, instead of 3, for certain levels. This shortens the hash chains, and also improves the quality of each hash entry.

deflate.c fix for window bits > 13

2073486

deflate: avoid use of uninitialized variable

afa49d8

(Note emit_match() doesn't currently use the value at all.) Fixes intel#4

check whether match or orig go too far back

99a86da

Merges branch issue3.

Include wmmintrin.h in configure test and crc_folding.c aid clang com…

3e77468

…pilation. Minor tweak by Jim Kukunas, dropping changes to configure script

in the case of overlap copy mempcy dist bytes then byte by byte for a…

f9716e7

…ny residue

Add Intel's optimized RTE memcpy

c960eeb

Since the inflate optimizations depend on an efficient memcpy routine, add an optimized version for platforms that don't have one.

integrate Intel's RTE memcpy with zmemcpy

58b51f6

infcover: remove OoM test for SetDictionary since we don't lazy alloc

641f59e

window anymore

update gitignore

4fcda80

reorganize longest_match

3c6b4f7

add zalways_inline for msvc and gcc

8311c71

force inline std2_longest_match

cb99979

inflate: fix MSVC compiler warnings

4b0ef4e

inflate: handle windowBits == 16

3eecf51

deflate_medium: add dist -1 to hash even for long matches

3dd73ae

This fixes an issue where a repeat sequence longer than 258 would be encoded using longer distance values after the first match.

guowangy added 3 commits August 21, 2020 16:30

x86: add avx2 check

b593167

Add AVX2 optimized slide_hash

e9858b2

test: add perf_test for slide_hash

c27041b

guowangy mentioned this pull request Sep 17, 2020

VPCLMULQDQ version for crc_fold_copy #28

Open

jtkukunas force-pushed the master branch from 6169794 to 1ac4c01 Compare April 15, 2022 00:07

jtkukunas force-pushed the master branch from 33f8f35 to bf55d56 Compare April 25, 2022 15:35

jtkukunas force-pushed the master branch from bf55d56 to b22695e Compare August 29, 2022 15:39

busykai force-pushed the master branch from b22695e to 6160a8f Compare November 30, 2022 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add avx2 version for slide_hash #27

add avx2 version for slide_hash #27

guowangy commented Aug 31, 2020

add avx2 version for slide_hash #27

Are you sure you want to change the base?

add avx2 version for slide_hash #27

Conversation

guowangy commented Aug 31, 2020