Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SVE Support #432

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add SVE Support #432

wants to merge 2 commits into from

Conversation

wx257osn2
Copy link
Contributor

  • Add ADA_SVE macro in include/ada/common_defs.h enabled if defined(__ARM_FEATURE_SVE)
  • Implement has_tabs_or_newline using Arm SVE

src/unicode.cpp Outdated
@@ -44,7 +46,27 @@ constexpr bool to_lower_ascii(char* input, size_t length) noexcept {
}
return non_ascii == 0;
}
#if ADA_NEON
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is causing build errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I fixed it.

const svuint8_t mask2 = svdup_n_u8('\n');
const svuint8_t mask3 = svdup_n_u8('\t');
svbool_t running = svdup_n_b8(false);
const size_t lanes = svcntb();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on the lane, the remaining items needs to be checked as well. for example, user_input.size() % lanes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Arm SVE has masking instructions, so we doesn't need to consider about remaining elements.

const svuint8_t mask3 = svdup_n_u8('\t');
svbool_t running = svdup_n_b8(false);
const size_t lanes = svcntb();
for (size_t i = 0; i < user_input.size(); i += lanes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if lanes is bigger than user_input.size() initially? wouldn't there be a overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lanes > user_input.size() , e.g. lanes = 16 and user_input.size() = 10 , then the loop runs only once and the mask becomes 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 . No overflow happens.

src/unicode.cpp Outdated Show resolved Hide resolved
@lemire
Copy link
Member

lemire commented Jun 4, 2023

Optimizations are always eagerly invited. However, we need to be careful with optimized code that does not run in our continuous integration tests nor on our development machines.

As is, the code won't build: see my fix. Even with the fix, by default SVE will not be enabled, you need to pass a compiler flag (e.g., -march=armv8-a+sve). Let us use a graviton 3 system (GCC 11) and run benchdata three times with the current main branch, and 3 times with the PR...

Before...

BasicBench_AdaURL_aggregator_href   27475530 ns     27472820 ns           25 GHz=2.59283 cycle/byte=8.19443 cycles/url=711.761 instructions/byte=27.0696 instructions/cycle=3.30341 instructions/ns=8.56519 instructions/url=2.35124k ns/url=274.511 speed=316.243M/s time/byte=3.16212ns time/url=274.66ns url/s=3.64087M/s
BasicBench_AdaURL_aggregator_href   27473560 ns     27467550 ns           26 GHz=2.5946 cycle/byte=8.19026 cycles/url=711.4 instructions/byte=27.0627 instructions/cycle=3.30425 instructions/ns=8.5732 instructions/url=2.35064k ns/url=274.185 speed=316.304M/s time/byte=3.16152ns time/url=274.607ns url/s=3.64157M/s
BasicBench_AdaURL_aggregator_href   27487336 ns     27484277 ns           25 GHz=2.59334 cycle/byte=8.19364 cycles/url=711.693 instructions/byte=27.067 instructions/cycle=3.30342 instructions/ns=8.56691 instructions/url=2.35102k ns/url=274.431 speed=316.111M/s time/byte=3.16344ns time/url=274.774ns url/s=3.63935M/s

After...

BasicBench_AdaURL_aggregator_href   27470432 ns     27466735 ns           25 GHz=2.5941 cycle/byte=8.20583 cycles/url=712.752 instructions/byte=26.7438 instructions/cycle=3.25912 instructions/ns=8.4545 instructions/url=2.32295k ns/url=274.759 speed=316.313M/s time/byte=3.16142ns time/url=274.599ns url/s=3.64168M/s
BasicBench_AdaURL_aggregator_href   27451202 ns     27447098 ns           26 GHz=2.59269 cycle/byte=8.19058 cycles/url=711.428 instructions/byte=26.7439 instructions/cycle=3.2652 instructions/ns=8.46564 instructions/url=2.32295k ns/url=274.397 speed=316.54M/s time/byte=3.15916ns time/url=274.402ns url/s=3.64428M/s
BasicBench_AdaURL_aggregator_href   27483095 ns     27479923 ns           25 GHz=2.59405 cycle/byte=8.20717 cycles/url=712.869 instructions/byte=26.7438 instructions/cycle=3.25859 instructions/ns=8.45295 instructions/url=2.32295k ns/url=274.809 speed=316.161M/s time/byte=3.16294ns time/url=274.731ns url/s=3.63993M/s

As you can see, the SVE code is not faster than the NEON code.

You can get a benefit with SVE, but you have to rewrite the code with a tail...

BasicBench_AdaURL_aggregator_href   27167271 ns     27161413 ns           26 GHz=2.59365 cycle/byte=8.09789 cycles/url=703.376 instructions/byte=26.7557 instructions/cycle=3.30404 instructions/ns=8.56953 instructions/url=2.32398k ns/url=271.191 speed=319.869M/s time/byte=3.12628ns time/url=271.546ns url/s=3.68261M/s
BasicBench_AdaURL_aggregator_href   27195788 ns     27192649 ns           26 GHz=2.59409 cycle/byte=8.11229 cycles/url=704.627 instructions/byte=26.7557 instructions/cycle=3.29817 instructions/ns=8.55576 instructions/url=2.32398k ns/url=271.628 speed=319.501M/s time/byte=3.12988ns time/url=271.859ns url/s=3.67838M/s
BasicBench_AdaURL_aggregator_href   27245132 ns     27239912 ns           26 GHz=2.59493 cycle/byte=8.11311 cycles/url=704.698 instructions/byte=26.7532 instructions/cycle=3.29753 instructions/ns=8.55687 instructions/url=2.32376k ns/url=271.567 speed=318.947M/s time/byte=3.13532ns time/url=272.331ns url/s=3.672M/s
ada_really_inline bool has_tabs_or_newline(
    std::string_view user_input) noexcept {
  const svuint8_t mask1 = svdup_n_u8('\r');
  const svuint8_t mask2 = svdup_n_u8('\n');
  const svuint8_t mask3 = svdup_n_u8('\t');
  svbool_t running = svdup_n_b8(false);
  const size_t lanes = svcntb();
  size_t i = 0;
  for (; i + lanes <= user_input.size(); i += lanes) {
    const svbool_t mask = svptrue_b8();
    svuint8_t word = svld1_u8(mask, (const uint8_t*)user_input.data() + i);
    running = svorr_b_z(mask,
                        svorr_b_z(mask, running,
                                  svorr_b_z(mask, svcmpeq_u8(mask, word, mask1),
                                            svcmpeq_u8(mask, word, mask2))),
                        svcmpeq_u8(mask, word, mask3));
  }
  if (i < user_input.size()) {
    const svbool_t mask = svwhilelt_b8_u64(i, user_input.size());
    svuint8_t word = svld1_u8(mask, (const uint8_t*)user_input.data() + i);
    running = svorr_b_z(mask,
                        svorr_b_z(mask, running,
                                  svorr_b_z(mask, svcmpeq_u8(mask, word, mask1),
                                            svcmpeq_u8(mask, word, mask2))),
                        svcmpeq_u8(mask, word, mask3));
  }
  return svptest_any(svptrue_b8(), running);
}

See https://lemire.me/blog/2023/03/10/trimming-spaces-from-strings-faster-with-sve-on-an-amazon-graviton-3-processor/ for a related discussion and other ways to skin this cat.

It is a 1% gain which might be significant, but SVE on the graviton 3 has 32-byte SIMD registers whereas it appears that all SVE implementations on commodity processors are moving to 16-byte SIMD registers. Also I did not test on other compilers.

There may well be other optimizations worth considering with SVE.

Co-authored-by: Daniel Lemire <daniel@lemire.me>
@wx257osn2
Copy link
Contributor Author

As is, the code won't build: see my fix.

Oops, I missed. Thanks.

You can get a benefit with SVE, but you have to rewrite the code with a tail...

It seems that the performance of svwhilelt on Neonverse V1 is not good. It seems that at least when the all split codes are small enough to be in the instruction cache, your proposal (splitting the loop for remaining elements) has better performance.

I think that SVE code will give some benefits when the latency of instructions or register width will be improved, but anyway it doesn't so on Graviton 3.

@lemire
Copy link
Member

lemire commented Jun 5, 2023

@wx257osn2 It could be that this PR could become profitable in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants