Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

zzachimed · 2023-04-09T11:22:46Z

…de,in our unit

test.It show that code has 30% performance boost, and in decode it show 5% performance boost.
origin bench:
$ ./GbenchMarkForCompactProtocol
2023-04-10T00:56:33+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 1280 KiB (x64)
L3 Unified 49152 KiB (x2)
Load Average: 0.71, 0.57, 0.24

Benchmark Time CPU Iterations

BM_Int64_Encode_1 22.1 ns 22.1 ns 31677633
BM_Int64_Encode_2 23.7 ns 23.7 ns 29457528
BM_Int64_Encode_3 26.1 ns 26.1 ns 26844781
BM_Int64_Encode_4 27.1 ns 27.1 ns 25705377
BM_Int64_Encode_5 30.0 ns 29.9 ns 23345095
BM_Int64_Encode_6 32.3 ns 32.3 ns 21658363
BM_Int64_Encode_7 34.5 ns 34.5 ns 20256118
BM_Int64_Encode_8 36.3 ns 36.3 ns 19234451
BM_Int64_Encode_9 37.8 ns 37.8 ns 18510508
BM_Int64_Encode_10 39.8 ns 39.8 ns 17531197
BM_Int64_Decode_1 485 ns 486 ns 1439281
BM_Int64_Decode_2 489 ns 491 ns 1424556
BM_Int64_Decode_3 493 ns 495 ns 1413917
BM_Int64_Decode_4 491 ns 493 ns 1421200
BM_Int64_Decode_5 495 ns 497 ns 1408950
BM_Int64_Decode_6 496 ns 498 ns 1406069
BM_Int64_Decode_7 504 ns 506 ns 1383636
BM_Int64_Decode_8 507 ns 509 ns 1374477
BM_Int64_Decode_9 505 ns 507 ns 1380481
BM_Int64_Decode_10 503 ns 505 ns 1386843

optimized bench:
$ ./GbenchMarkForCompactProtocol
2023-04-10T01:09:25+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 1280 KiB (x64)
L3 Unified 49152 KiB (x2)
Load Average: 0.00, 0.24, 0.31

Benchmark Time CPU Iterations

BM_Int64_Encode_1 21.0 ns 21.0 ns 30414973
BM_Int64_Encode_2 22.3 ns 22.2 ns 28778678
BM_Int64_Encode_3 25.6 ns 25.6 ns 27278662
BM_Int64_Encode_4 29.1 ns 29.1 ns 23984143
BM_Int64_Encode_5 29.3 ns 29.2 ns 23856317
BM_Int64_Encode_6 29.6 ns 29.6 ns 23585080
BM_Int64_Encode_7 29.9 ns 29.8 ns 23423250
BM_Int64_Encode_8 29.7 ns 29.7 ns 23496002
BM_Int64_Encode_9 28.9 ns 28.9 ns 24128142
BM_Int64_Encode_10 28.9 ns 28.8 ns 24132967
BM_Int64_Decode_1 480 ns 480 ns 1430995
BM_Int64_Decode_2 481 ns 481 ns 1390746
BM_Int64_Decode_3 490 ns 492 ns 1395035
BM_Int64_Decode_4 495 ns 496 ns 1409327
BM_Int64_Decode_5 500 ns 501 ns 1398344
BM_Int64_Decode_6 495 ns 497 ns 1408096
BM_Int64_Decode_7 499 ns 501 ns 1398316
BM_Int64_Decode_8 496 ns 497 ns 1409067
BM_Int64_Decode_9 501 ns 503 ns 1392703
BM_Int64_Decode_10 500 ns 502 ns 1395121

Did you create an Apache Jira ticket? (Request account here, not required for trivial changes)
If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
Did you squash your changes to a single commit? (not required, but preferred)
Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

…de,in our unit test.It show that code has 30% peformance boost, and in decode it show 5% performance boost. origin bench: $ ./GbenchMarkForCompactProtocol 2023-04-10T00:56:33+08:00 Running ./GbenchMarkForCompactProtocol Run on (128 X 3300 MHz CPU s) CPU Caches: L1 Data 48 KiB (x64) L1 Instruction 32 KiB (x64) L2 Unified 1280 KiB (x64) L3 Unified 49152 KiB (x2) Load Average: 0.71, 0.57, 0.24 ------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------- BM_Int64_Encode_1 22.1 ns 22.1 ns 31677633 BM_Int64_Encode_2 23.7 ns 23.7 ns 29457528 BM_Int64_Encode_3 26.1 ns 26.1 ns 26844781 BM_Int64_Encode_4 27.1 ns 27.1 ns 25705377 BM_Int64_Encode_5 30.0 ns 29.9 ns 23345095 BM_Int64_Encode_6 32.3 ns 32.3 ns 21658363 BM_Int64_Encode_7 34.5 ns 34.5 ns 20256118 BM_Int64_Encode_8 36.3 ns 36.3 ns 19234451 BM_Int64_Encode_9 37.8 ns 37.8 ns 18510508 BM_Int64_Encode_10 39.8 ns 39.8 ns 17531197 BM_Int64_Decode_1 485 ns 486 ns 1439281 BM_Int64_Decode_2 489 ns 491 ns 1424556 BM_Int64_Decode_3 493 ns 495 ns 1413917 BM_Int64_Decode_4 491 ns 493 ns 1421200 BM_Int64_Decode_5 495 ns 497 ns 1408950 BM_Int64_Decode_6 496 ns 498 ns 1406069 BM_Int64_Decode_7 504 ns 506 ns 1383636 BM_Int64_Decode_8 507 ns 509 ns 1374477 BM_Int64_Decode_9 505 ns 507 ns 1380481 BM_Int64_Decode_10 503 ns 505 ns 1386843 optmizied bench: $ ./GbenchMarkForCompactProtocol 2023-04-10T01:09:25+08:00 Running ./GbenchMarkForCompactProtocol Run on (128 X 3300 MHz CPU s) CPU Caches: L1 Data 48 KiB (x64) L1 Instruction 32 KiB (x64) L2 Unified 1280 KiB (x64) L3 Unified 49152 KiB (x2) Load Average: 0.00, 0.24, 0.31 ------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------- BM_Int64_Encode_1 21.0 ns 21.0 ns 30414973 BM_Int64_Encode_2 22.3 ns 22.2 ns 28778678 BM_Int64_Encode_3 25.6 ns 25.6 ns 27278662 BM_Int64_Encode_4 29.1 ns 29.1 ns 23984143 BM_Int64_Encode_5 29.3 ns 29.2 ns 23856317 BM_Int64_Encode_6 29.6 ns 29.6 ns 23585080 BM_Int64_Encode_7 29.9 ns 29.8 ns 23423250 BM_Int64_Encode_8 29.7 ns 29.7 ns 23496002 BM_Int64_Encode_9 28.9 ns 28.9 ns 24128142 BM_Int64_Encode_10 28.9 ns 28.8 ns 24132967 BM_Int64_Decode_1 480 ns 480 ns 1430995 BM_Int64_Decode_2 481 ns 481 ns 1390746 BM_Int64_Decode_3 490 ns 492 ns 1395035 BM_Int64_Decode_4 495 ns 496 ns 1409327 BM_Int64_Decode_5 500 ns 501 ns 1398344 BM_Int64_Decode_6 495 ns 497 ns 1408096 BM_Int64_Decode_7 499 ns 501 ns 1398316 BM_Int64_Decode_8 496 ns 497 ns 1409067 BM_Int64_Decode_9 501 ns 503 ns 1392703 BM_Int64_Decode_10 500 ns 502 ns 1395121

emmenlau · 2023-10-20T09:38:37Z

Whow, nice work! But not easy to validate. One option would be to keep this optional and disabled by default, so people can experiment with the new code?

Jens-G added the c++ label Apr 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

zzachimed commented Apr 9, 2023

emmenlau commented Oct 20, 2023

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

Are you sure you want to change the base?

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

Conversation

zzachimed commented Apr 9, 2023

Benchmark Time CPU Iterations

optimized bench: $ ./GbenchMarkForCompactProtocol 2023-04-10T01:09:25+08:00 Running ./GbenchMarkForCompactProtocol Run on (128 X 3300 MHz CPU s) CPU Caches: L1 Data 48 KiB (x64) L1 Instruction 32 KiB (x64) L2 Unified 1280 KiB (x64) L3 Unified 49152 KiB (x2) Load Average: 0.00, 0.24, 0.31

Benchmark Time CPU Iterations

emmenlau commented Oct 20, 2023

optimized bench:
$ ./GbenchMarkForCompactProtocol
2023-04-10T01:09:25+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 1280 KiB (x64)
L3 Unified 49152 KiB (x2)
Load Average: 0.00, 0.24, 0.31