Optimize a scan of non state-chaning bytes with SSSE3 instructions #58

ngrodzitski · 2023-10-10T11:59:37Z

This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.

A _mm_cmpestri operation appears to be quite slow compared to alternative approach that involves
_mm_shuffle_epi8 for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any) that changes the state. That is done by figuring out the first byte that LUTs to zero.

The tricky part here is the following:

Find A, B arrays (uint8_t[16]) such that
* `A[i] & B[j] == 0` if `LUT[i | (j <<4)] == 0`
* `A[i] & B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value
for all i,j = 0..15.

To find A and B satisfying the above conditions a Z3 library is used. The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit) was required to handle the async API for z3-wrapper.

Using llhttp as a benchmark framework this change draws the following improvemnts:

Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz

http: "seanmonstar/httparse" (C)
BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s
AFTER:  8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s

~20% improvement

http: "nodejs/http-parser" (C)
BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s
AFTER:  8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s

~11% improvement

For more header-fields-heavy messages numbers might be even more convincing.

This commit optimizes the scan of non-state-changing bytes using SSE2 instructions. A [_mm_cmpestri](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_cmpestri) operation appears to be quite slow compared to alternative approach that involves [_mm_shuffle_epi8](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_shuffle_epi8) for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any) that changes the state. That is done by figuring out the first byte that LUTs to zero. The tricky part here is the following: ``` Find A, B arrays (uint8_t[16]) such that * `A[i] & B[j] == 0` if `LUT[i | (j <<4)] == 0` * `A[i] & B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value for all i,j = 0..15. ``` To find `A` and `B` satisfying the above conditions a [Z3](https://github.com/Z3Prover/z3) library is used. The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit) was required to handle the async API for z3-wrapper. Using llhttp as a benchmark framework this change draws the following improvemnts: ``` Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz http: "seanmonstar/httparse" (C) BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s AFTER: 8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s ~20% improvement http: "nodejs/http-parser" (C) BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s AFTER: 8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s ~11% improvement ``` For more header-fields-heavy messages numbers might be even more convincing.

ngrodzitski · 2023-10-10T20:14:32Z

Here are the details behind the change: https://gist.github.com/ngrodzitski/f9de8fb60ae44e7c5455f9b78a2e4134

The previous commit actually uses SSSE3 instruction.

ShogunPanda · 2023-10-28T06:38:27Z

@anonrig @RafaelGSS I'm not familiar with SSE. Can you please chime in here?

lemire · 2023-11-07T21:56:22Z

@ShogunPanda I'll have a look when I am back in Montreal.

lemire · 2023-11-08T09:41:38Z

This is very reasonable algorithmically. @ngrodzitski is using vectorized classification... See section "3.1.2 Vectorized Classification" in Parsing Gigabytes of JSON per Second. It is used extensively within simdutf, for example. The current implementation is specific to x64: one might want to extend it to other architectures like NEON (though it is not a limitation specific to this PR).

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

If I am correct, then this work might be in vain, as far as Node.js is concerned.

So before I review more carefully the PR, I'd be interested in knowing whether we are making sure that this code is actually being put to good use.

If not, then that's what we should fix first. (It is easily fixable, thankfully.)

@ngrodzitski : are your benchmarks relevant to Node.js, or did you manually compile the code with your own flags?

RafaelGSS · 2023-11-08T10:22:32Z

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

That's correct. That would need to be done in the deps/llhttp first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.

I assume this is a huge effort, right @richardlau?

ngrodzitski · 2023-11-08T10:32:30Z

@lemire
You are right, the source from which I learned this technique is simdjson. I've only added nice automation with z3 so that this works for generic LUT (if possible per se).

I can't say for Node.js, but I've recently ported restinio which initially used old nodejs/http_parser to llhttp, thus I've given llhttp a closer look which eventually brings me here because my interest is that llhttp is fast (and correct of course).

So regarding a performance check I use llhttp repo which has a simple benchmark in it.

Here How I did it:

Cloned llparse and llhttp locally.
Did the changes in llparse.
Used npm-link so that my llocal llparse version is used in llhttp.
Rebuild llhttp and run a benchmark.

So my data (20% and 11%) is for llhttp.

richardlau · 2023-11-08T12:08:43Z

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

That's correct. That would need to be done in the deps/llhttp first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.

I assume this is a huge effort, right @richardlau?

I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:

RafaelGSS · 2023-11-08T12:13:39Z

I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:

I meant, in case we want to support sse4.2, we'll need to build llhttp with that support on Node.js and considering this might be an arch-specific build, it should be a different distributed binary in the release section, right? If so, it should be hard to support I guess - mostly due to the maintenance burden.

richardlau · 2023-11-08T12:25:02Z

I don't think we'd need an arch specific build -- I think usually the dependency (in this case llhttp) would have to to detect and select the code to use at runtime. I don't know how easy that will be to do in llhttp so "huge effort" may or not be correct (albeit I'd expect most of the effort to be on the llhttp side rather than the Node.js side).
sse4.2 in Node.js shows hits in base64 and zlib (although nodejs/node@ae115d6 commented this out for zlib).

richardlau · 2023-11-08T12:33:14Z

And to follow up, the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).

lemire · 2023-11-08T13:22:46Z

@richardlau Yes. You definitively do not want to pass an architecture specific flag to the compiler when building Node.js, unless you want to build your own private version.

the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).

That's one way to do it. If you look at simdutf (which is already in Node.js), you will find that there is just one compilation unit (literally just one .cpp).

In any case, it is solvable here as well if we care.

lemire · 2023-11-08T13:24:16Z

If so, it should be hard to support I guess - mostly due to the maintenance burden.

See simdutf for a reference on how you can make it easy to build (just one .cpp or .c file would do).

lemire · 2023-11-08T13:27:57Z

@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?

RafaelGSS · 2023-11-08T15:22:21Z

@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?

I'm not sure about the impacts of this optimization, but I believe https://github.com/nodejs/node/blob/main/benchmark/http/simple.js should be enough if that's a general parsing optimization. Using framework benchmarks should be good as well (see: https://github.com/fastify/benchmarks)

ngrodzitski force-pushed the feature/simd-lut branch from 71da0d6 to 98cfe64 Compare October 10, 2023 12:30

ngrodzitski force-pushed the feature/simd-lut branch from 98cfe64 to 5b7c3a9 Compare October 10, 2023 16:30

ngrodzitski changed the title ~~Optimize a scan of non state-chaning bytes with SSE2 instructions~~ Optimize a scan of non state-chaning bytes with SSE3 instructions Oct 10, 2023

Fix SSE families

7602ea1

The previous commit actually uses SSSE3 instruction.

ngrodzitski force-pushed the feature/simd-lut branch from b9cd9fe to 7602ea1 Compare October 11, 2023 08:40

ngrodzitski changed the title ~~Optimize a scan of non state-chaning bytes with SSE3 instructions~~ Optimize a scan of non state-chaning bytes with SSSE3 instructions Oct 11, 2023

ngrodzitski mentioned this pull request Oct 27, 2023

Consider performance improvements comming from PR to llparse nodejs/llhttp#260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize a scan of non state-chaning bytes with SSSE3 instructions #58

Optimize a scan of non state-chaning bytes with SSSE3 instructions #58

ngrodzitski commented Oct 10, 2023 •

edited

ngrodzitski commented Oct 10, 2023

ShogunPanda commented Oct 28, 2023

lemire commented Nov 7, 2023

lemire commented Nov 8, 2023

RafaelGSS commented Nov 8, 2023

ngrodzitski commented Nov 8, 2023 •

edited

richardlau commented Nov 8, 2023

RafaelGSS commented Nov 8, 2023

richardlau commented Nov 8, 2023

richardlau commented Nov 8, 2023

lemire commented Nov 8, 2023

lemire commented Nov 8, 2023

lemire commented Nov 8, 2023

RafaelGSS commented Nov 8, 2023

Optimize a scan of non state-chaning bytes with SSSE3 instructions #58

Are you sure you want to change the base?

Optimize a scan of non state-chaning bytes with SSSE3 instructions #58

Conversation

ngrodzitski commented Oct 10, 2023 • edited

ngrodzitski commented Oct 10, 2023

ShogunPanda commented Oct 28, 2023

lemire commented Nov 7, 2023

lemire commented Nov 8, 2023

RafaelGSS commented Nov 8, 2023

ngrodzitski commented Nov 8, 2023 • edited

richardlau commented Nov 8, 2023

RafaelGSS commented Nov 8, 2023

richardlau commented Nov 8, 2023

richardlau commented Nov 8, 2023

lemire commented Nov 8, 2023

lemire commented Nov 8, 2023

lemire commented Nov 8, 2023

RafaelGSS commented Nov 8, 2023

ngrodzitski commented Oct 10, 2023 •

edited

ngrodzitski commented Nov 8, 2023 •

edited