Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ARM SVE optimization with assembly code #751

Open
hzhuang1 opened this issue Oct 25, 2022 · 3 comments
Open

Implement ARM SVE optimization with assembly code #751

hzhuang1 opened this issue Oct 25, 2022 · 3 comments

Comments

@hzhuang1
Copy link
Contributor

The whole patch set is in #748.

In this patch set, some features are included.

  1. Change dispatch breakpoint to XXH3_accumulate() (Full acc loop #744). This pull request is prepared for ARM SVE dispatch.
  2. Add SVE intrinsic code for XXH3.
  3. Use dispatch as a common framework for both x86 and aarch64. Import the assembly implementation of aarch64 SVE.
@hzhuang1
Copy link
Contributor Author

Let's start from #744.

@hzhuang1
Copy link
Contributor Author

Let's start from #744.

I thought for a while. The effect of #744 isn't intuitive. So I created #752 that just supported ARM SVE intrinsic.

In #752, we could observe the performance is even downgraded versus scalar on the test platform. But it's only the intrinsic implementation for easy reviewing and a starting point of optimization.

After #752, we could keep up on #744 that exposes XXH3_accumulate() interface to all silicons. With this self-maintained interface, we could avoid to access memory frequently without hacking XXHASH that improves the performance in huge.

When both of them are handled, we could continue on the assembly implementation.

Logically, this new sequence could be much more intuitive.

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Nov 10, 2022

#752 is merged. Thanks a lot.

Now we're moving to #756 that simplifies #744. With this patch, full accumulating loop could be customized on different architectures. On SVE, we could avoid accessing stacks and apply SVE specific prefetching instructions. The performance is improved a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant