Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: simd swar #134

Merged
merged 2 commits into from Apr 25, 2023
Merged

Conversation

AaronO
Copy link
Contributor

@AaronO AaronO commented Apr 20, 2023

This refactor moves the block-wise validators to a "swar" SIMD backend

The core logic of validate => extract => chain is (IMO) now more evident
(and how we stack validators, using the largest SIMD validator available then finishing with scalar if at end of buffer [uncommon])

Perf wise, this is roughly on par with master, slightly faster on some regards but not 100% fine-tuned or fully benched with neon or avx/sse. In part because it avoids duplicating validating work between SIMD and scalar validators.

@AaronO
Copy link
Contributor Author

AaronO commented Apr 20, 2023

Not completely in love with the name swar, open to suggestions, also unclear if SIMD disabled should also disable swar (doesn't use any special instructions) and only use scalar validators (very slow). No technical blockers more a matter of aligning intent.

@AaronO
Copy link
Contributor Author

AaronO commented Apr 21, 2023

Bench dump

x64

Decent -15% improvements on full req parsing:

master:
test req/req ... bench:                         257 ns/iter (+/- 6)
test req_short/req_short ... bench:             72 ns/iter (+/- 3)
test resp/resp ... bench:                       273 ns/iter (+/- 61)
test resp_short/resp_short ... bench:           68 ns/iter (+/- 5)

swar:
test req/req ... bench:                         225 ns/iter (+/- 2)
test req_short/req_short ... bench:             60 ns/iter (+/- 3)
test resp/resp ... bench:                       239 ns/iter (+/- 24)
test resp_short/resp_short ... bench:           63 ns/iter (+/- 1)

M1 Air

You'll header/count is great improve with swar

master:
test req/req ... bench:         117 ns/iter (+/- 13)
test req_short/req_short ... bench:          27 ns/iter (+/- 14)
test resp/resp ... bench:         120 ns/iter (+/- 2)
test resp_short/resp_short ... bench:          28 ns/iter (+/- 5)
test uri/uri_1b ... bench:           1 ns/iter (+/- 0)
test uri/uri_2b ... bench:           1 ns/iter (+/- 0)
test uri/uri_4b ... bench:           3 ns/iter (+/- 1)
test uri/uri_8b ... bench:           2 ns/iter (+/- 0)
test uri/uri_16b ... bench:           3 ns/iter (+/- 0)
test uri/uri_32b ... bench:           4 ns/iter (+/- 0)
test uri/uri_64b ... bench:           6 ns/iter (+/- 0)
test uri/uri_128b ... bench:          12 ns/iter (+/- 1)
test uri/uri_256b ... bench:          24 ns/iter (+/- 6)
test uri/uri_512b ... bench:          53 ns/iter (+/- 3)
test uri/uri_1024b ... bench:          95 ns/iter (+/- 1)
test uri/uri_2048b ... bench:         185 ns/iter (+/- 8)
test uri/uri_4096b ... bench:         366 ns/iter (+/- 24)
test header/name_1b ... bench:           9 ns/iter (+/- 0)
test header/name_2b ... bench:           9 ns/iter (+/- 0)
test header/name_4b ... bench:           9 ns/iter (+/- 0)
test header/name_8b ... bench:           9 ns/iter (+/- 0)
test header/name_16b ... bench:          10 ns/iter (+/- 0)
test header/name_32b ... bench:          13 ns/iter (+/- 0)
test header/name_64b ... bench:          20 ns/iter (+/- 0)
test header/name_128b ... bench:          32 ns/iter (+/- 0)
test header/name_256b ... bench:          58 ns/iter (+/- 1)
test header/name_512b ... bench:         109 ns/iter (+/- 2)
test header/name_1024b ... bench:         220 ns/iter (+/- 5)
test header/name_2048b ... bench:         424 ns/iter (+/- 8)
test header/name_4096b ... bench:         832 ns/iter (+/- 13)
test header/value_1b ... bench:           9 ns/iter (+/- 0)
test header/value_2b ... bench:          11 ns/iter (+/- 0)
test header/value_4b ... bench:          13 ns/iter (+/- 0)
test header/value_8b ... bench:           9 ns/iter (+/- 0)
test header/value_16b ... bench:          10 ns/iter (+/- 0)
test header/value_32b ... bench:          11 ns/iter (+/- 0)
test header/value_64b ... bench:          13 ns/iter (+/- 0)
test header/value_128b ... bench:          17 ns/iter (+/- 0)
test header/value_256b ... bench:          26 ns/iter (+/- 0)
test header/value_512b ... bench:          44 ns/iter (+/- 1)
test header/value_1024b ... bench:          89 ns/iter (+/- 2)
test header/value_2048b ... bench:         158 ns/iter (+/- 2)
test header/value_4096b ... bench:         301 ns/iter (+/- 8)
test header/count_1 ... bench:           9 ns/iter (+/- 0)
test header/count_2 ... bench:          16 ns/iter (+/- 0)
test header/count_4 ... bench:          29 ns/iter (+/- 0)
test header/count_8 ... bench:          53 ns/iter (+/- 1)
test header/count_16 ... bench:         112 ns/iter (+/- 449)
test header/count_32 ... bench:         203 ns/iter (+/- 4)
test header/count_64 ... bench:         396 ns/iter (+/- 7)
test header/count_128 ... bench:         788 ns/iter (+/- 23)
test version/http10 ... bench:           0 ns/iter (+/- 0)
test version/http11 ... bench:           0 ns/iter (+/- 0)
test version/partial ... bench:           1 ns/iter (+/- 0)
test method/get ... bench:           0 ns/iter (+/- 0)
test method/head ... bench:           2 ns/iter (+/- 0)
test method/post ... bench:           0 ns/iter (+/- 0)
test method/put ... bench:           2 ns/iter (+/- 0)
test method/delete ... bench:           3 ns/iter (+/- 0)
test method/connect ... bench:           3 ns/iter (+/- 0)
test method/options ... bench:           3 ns/iter (+/- 0)
test method/trace ... bench:           2 ns/iter (+/- 0)
test method/patch ... bench:           2 ns/iter (+/- 0)
test method/custom ... bench:           3 ns/iter (+/- 0)

swar:
test req/req ... bench:         114 ns/iter (+/- 0)
test req_short/req_short ... bench:          25 ns/iter (+/- 0)
test resp/resp ... bench:         119 ns/iter (+/- 2)
test resp_short/resp_short ... bench:          27 ns/iter (+/- 0)
test uri/uri_1b ... bench:           2 ns/iter (+/- 0)
test uri/uri_2b ... bench:           2 ns/iter (+/- 0)
test uri/uri_4b ... bench:           4 ns/iter (+/- 0)
test uri/uri_8b ... bench:           1 ns/iter (+/- 0)
test uri/uri_16b ... bench:           2 ns/iter (+/- 0)
test uri/uri_32b ... bench:           5 ns/iter (+/- 0)
test uri/uri_64b ... bench:           7 ns/iter (+/- 0)
test uri/uri_128b ... bench:          13 ns/iter (+/- 0)
test uri/uri_256b ... bench:          23 ns/iter (+/- 0)
test uri/uri_512b ... bench:          47 ns/iter (+/- 0)
test uri/uri_1024b ... bench:          97 ns/iter (+/- 2)
test uri/uri_2048b ... bench:         188 ns/iter (+/- 3)
test uri/uri_4096b ... bench:         369 ns/iter (+/- 2)
test header/name_1b ... bench:           9 ns/iter (+/- 0)
test header/name_2b ... bench:           8 ns/iter (+/- 0)
test header/name_4b ... bench:           9 ns/iter (+/- 0)
test header/name_8b ... bench:           9 ns/iter (+/- 0)
test header/name_16b ... bench:          12 ns/iter (+/- 0)
test header/name_32b ... bench:          14 ns/iter (+/- 0)
test header/name_64b ... bench:          20 ns/iter (+/- 1)
test header/name_128b ... bench:          34 ns/iter (+/- 1)
test header/name_256b ... bench:          58 ns/iter (+/- 3)
test header/name_512b ... bench:         109 ns/iter (+/- 3)
test header/name_1024b ... bench:         220 ns/iter (+/- 1)
test header/name_2048b ... bench:         424 ns/iter (+/- 9)
test header/name_4096b ... bench:         831 ns/iter (+/- 19)
test header/value_1b ... bench:           8 ns/iter (+/- 0)
test header/value_2b ... bench:          10 ns/iter (+/- 0)
test header/value_4b ... bench:          10 ns/iter (+/- 2)
test header/value_8b ... bench:          10 ns/iter (+/- 0)
test header/value_16b ... bench:          10 ns/iter (+/- 0)
test header/value_32b ... bench:          11 ns/iter (+/- 0)
test header/value_64b ... bench:          13 ns/iter (+/- 0)
test header/value_128b ... bench:          18 ns/iter (+/- 0)
test header/value_256b ... bench:          26 ns/iter (+/- 0)
test header/value_512b ... bench:          44 ns/iter (+/- 5)
test header/value_1024b ... bench:          89 ns/iter (+/- 12)
test header/value_2048b ... bench:         170 ns/iter (+/- 6)
test header/value_4096b ... bench:         302 ns/iter (+/- 11)
test header/count_1 ... bench:           9 ns/iter (+/- 0)
test header/count_2 ... bench:          15 ns/iter (+/- 0)
test header/count_4 ... bench:          26 ns/iter (+/- 1)
test header/count_8 ... bench:          51 ns/iter (+/- 1)
test header/count_16 ... bench:          96 ns/iter (+/- 5)
test header/count_32 ... bench:         196 ns/iter (+/- 7)
test header/count_64 ... bench:         336 ns/iter (+/- 12)
test header/count_128 ... bench:         662 ns/iter (+/- 6)
test version/http10 ... bench:           0 ns/iter (+/- 0)
test version/http11 ... bench:           0 ns/iter (+/- 0)
test version/partial ... bench:           1 ns/iter (+/- 0)
test method/get ... bench:           0 ns/iter (+/- 0)
test method/head ... bench:           2 ns/iter (+/- 0)
test method/post ... bench:           0 ns/iter (+/- 0)
test method/put ... bench:           2 ns/iter (+/- 0)
test method/delete ... bench:           3 ns/iter (+/- 0)
test method/connect ... bench:           3 ns/iter (+/- 0)
test method/options ... bench:           3 ns/iter (+/- 0)
test method/trace ... bench:           2 ns/iter (+/- 0)
test method/patch ... bench:           2 ns/iter (+/- 0)
test method/custom ... bench:           3 ns/iter (+/- 0)

swar + neon:
test req/req ... bench:         125 ns/iter (+/- 4)
test req_short/req_short ... bench:          27 ns/iter (+/- 0)
test resp/resp ... bench:         135 ns/iter (+/- 2)
test resp_short/resp_short ... bench:          27 ns/iter (+/- 0)
test uri/uri_1b ... bench:           2 ns/iter (+/- 0)
test uri/uri_2b ... bench:           3 ns/iter (+/- 0)
test uri/uri_4b ... bench:           4 ns/iter (+/- 0)
test uri/uri_8b ... bench:           2 ns/iter (+/- 0)
test uri/uri_16b ... bench:           2 ns/iter (+/- 0)
test uri/uri_32b ... bench:           3 ns/iter (+/- 0)
test uri/uri_64b ... bench:           5 ns/iter (+/- 0)
test uri/uri_128b ... bench:           7 ns/iter (+/- 0)
test uri/uri_256b ... bench:          14 ns/iter (+/- 0)
test uri/uri_512b ... bench:          26 ns/iter (+/- 0)
test uri/uri_1024b ... bench:          52 ns/iter (+/- 0)
test uri/uri_2048b ... bench:         108 ns/iter (+/- 0)
test uri/uri_4096b ... bench:         210 ns/iter (+/- 2)
test header/name_1b ... bench:           9 ns/iter (+/- 0)
test header/name_2b ... bench:           9 ns/iter (+/- 0)
test header/name_4b ... bench:           9 ns/iter (+/- 0)
test header/name_8b ... bench:           9 ns/iter (+/- 0)
test header/name_16b ... bench:          11 ns/iter (+/- 0)
test header/name_32b ... bench:          12 ns/iter (+/- 0)
test header/name_64b ... bench:          14 ns/iter (+/- 0)
test header/name_128b ... bench:          17 ns/iter (+/- 0)
test header/name_256b ... bench:          24 ns/iter (+/- 0)
test header/name_512b ... bench:          37 ns/iter (+/- 0)
test header/name_1024b ... bench:          64 ns/iter (+/- 0)
test header/name_2048b ... bench:         129 ns/iter (+/- 1)
test header/name_4096b ... bench:         237 ns/iter (+/- 3)
test header/value_1b ... bench:           9 ns/iter (+/- 0)
test header/value_2b ... bench:           9 ns/iter (+/- 0)
test header/value_4b ... bench:          10 ns/iter (+/- 0)
test header/value_8b ... bench:          10 ns/iter (+/- 0)
test header/value_16b ... bench:          12 ns/iter (+/- 0)
test header/value_32b ... bench:          13 ns/iter (+/- 0)
test header/value_64b ... bench:          14 ns/iter (+/- 0)
test header/value_128b ... bench:          18 ns/iter (+/- 0)
test header/value_256b ... bench:          24 ns/iter (+/- 0)
test header/value_512b ... bench:          37 ns/iter (+/- 0)
test header/value_1024b ... bench:          63 ns/iter (+/- 2)
test header/value_2048b ... bench:         125 ns/iter (+/- 5)
test header/value_4096b ... bench:         226 ns/iter (+/- 1)
test header/count_1 ... bench:           9 ns/iter (+/- 0)
test header/count_2 ... bench:          14 ns/iter (+/- 0)
test header/count_4 ... bench:          30 ns/iter (+/- 1)
test header/count_8 ... bench:          57 ns/iter (+/- 3)
test header/count_16 ... bench:         115 ns/iter (+/- 2)
test header/count_32 ... bench:         219 ns/iter (+/- 1)
test header/count_64 ... bench:         427 ns/iter (+/- 3)
test header/count_128 ... bench:         888 ns/iter (+/- 27)
test version/http10 ... bench:           0 ns/iter (+/- 0)
test version/http11 ... bench:           0 ns/iter (+/- 0)
test version/partial ... bench:           1 ns/iter (+/- 0)
test method/get ... bench:           0 ns/iter (+/- 0)
test method/head ... bench:           2 ns/iter (+/- 0)
test method/post ... bench:           0 ns/iter (+/- 0)
test method/put ... bench:           2 ns/iter (+/- 0)
test method/delete ... bench:           3 ns/iter (+/- 0)
test method/connect ... bench:           3 ns/iter (+/- 0)
test method/options ... bench:           3 ns/iter (+/- 0)
test method/trace ... bench:           2 ns/iter (+/- 0)
test method/patch ... bench:           2 ns/iter (+/- 0)
test method/custom ... bench:           3 ns/iter (+/- 0)

AaronO added a commit to AaronO/httparse that referenced this pull request Apr 24, 2023
Builds off seanmonstar#134 (swar), seanmonstar#138 (Bytes cursor)

Cleaner, faster and less macros !

## Key changes

- Broke down header-parsing into clean conceptual steps (whilst being faster !)
- Added InnerResult allowing for idiomatic `?` early-exits, removing need for parsing-helper macros
- Removed macros.rs, leaving only `byte_map!` (response header-parsing macros should become functions)

### TODO

- convert request header-parser, supporting its quirks
@AaronO AaronO mentioned this pull request Apr 24, 2023
1 task
AaronO added a commit to AaronO/httparse that referenced this pull request Apr 24, 2023
Builds off seanmonstar#134 (swar), seanmonstar#138 (Bytes cursor)

Cleaner, faster and less macros !

- Broke down header-parsing into clean conceptual steps (whilst being faster !)
- Added InnerResult allowing for idiomatic `?` early-exits, removing need for parsing-helper macros
- Removed macros.rs, leaving only `byte_map!` (response header-parsing macros should become functions)

- convert response header-parser, supporting its quirks
AaronO added a commit to AaronO/httparse that referenced this pull request Apr 24, 2023
Builds off seanmonstar#134 (swar), seanmonstar#138 (Bytes cursor)

Cleaner, faster and less macros !

- Broke down header-parsing into clean conceptual steps (whilst being faster !)
- Added InnerResult allowing for idiomatic `?` early-exits, removing need for parsing-helper macros
- Removed macros.rs, leaving only `byte_map!` (response header-parsing macros should become functions)
@seanmonstar
Copy link
Owner

unclear if SIMD disabled should also disable swar (doesn't use any special instructions) and only use scalar validators

After looking at what this is, I don't think it needs to be disabled with SIMD disabled. The point of that config was mostly so that we could test the scalar code even if build detection wanted to enable SIMD. Otherwise it'd be too hard to test it.

Copy link
Owner

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was easier to review than I thought when seeing the initial email. Looks great to me, one conflict to fixup. Thanks!

Moves the block-wise validators to a "swar" SIMD backend

The core logic of validate => extract => chain is now more evident
auto-merge was automatically disabled April 25, 2023 18:14

Head branch was pushed to by a user without write access

@seanmonstar seanmonstar merged commit 58a6293 into seanmonstar:master Apr 25, 2023
31 checks passed
AaronO added a commit to AaronO/httparse that referenced this pull request Apr 25, 2023
Builds off seanmonstar#134 (swar), seanmonstar#138 (Bytes cursor)

Cleaner, faster and less macros !

- Broke down header-parsing into clean conceptual steps (whilst being faster !)
- Added InnerResult allowing for idiomatic `?` early-exits, removing need for parsing-helper macros
- Removed macros.rs, leaving only `byte_map!` (response header-parsing macros should become functions)
AaronO added a commit to AaronO/httparse that referenced this pull request Apr 25, 2023
Builds off seanmonstar#134 (swar), seanmonstar#138 (Bytes cursor)

Cleaner, faster and less macros !

- Broke down header-parsing into clean conceptual steps (whilst being faster !)
- Added InnerResult allowing for idiomatic `?` early-exits, removing need for parsing-helper macros
- Removed macros.rs, leaving only `byte_map!` (response header-parsing macros should become functions)
AaronO added a commit to AaronO/httparse that referenced this pull request Jun 1, 2023
Builds off seanmonstar#134 (swar), seanmonstar#138 (Bytes cursor)

Cleaner, faster and less macros !

- Broke down header-parsing into clean conceptual steps (whilst being faster !)
- Added InnerResult allowing for idiomatic `?` early-exits, removing need for parsing-helper macros
- Removed macros.rs, leaving only `byte_map!` (response header-parsing macros should become functions)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants