Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32-bit ARM very slow xxhash #505

Closed
jtoivainen opened this issue Mar 2, 2021 · 4 comments
Closed

32-bit ARM very slow xxhash #505

jtoivainen opened this issue Mar 2, 2021 · 4 comments
Labels

Comments

@jtoivainen
Copy link

Compiled with gcc 4.8.5 and tested on dual core 32-bit ARM Cortex-A9, all the xxhash algorithms are very slow and lose even crc32 implementation, regardless the input size. Tested from dev and v0.8.0, does not make a difference. NEON path is activated.

lscpu:
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32

Seems that these codes are not so tested with 32-bit ARM?

@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 2, 2021

A classical issue with gcc on 32-bit ARM is that
it doesn't exploit (by default) CPU's ability to support unaligned memory access
(which I presume the A9 is able to)
resulting in very slow memory accesses.

There are a few things that can be attempted here.

On the library side, there are multiple access methods which are provided:
https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L1117

The most aggressive is XXH_FORCE_MEMORY_ACCESS=2.
It can lead to incorrect binary generation depending on other settings and compiler capabilities.
But should it work, it will at least give a sense of what the speed should be.

A bit safer, XXH_FORCE_MEMORY_ACCESS=1 will generate correct code.
But the compiler needs to be aware that the cpu is able to access unaligned memory addresses.

This may require specifying some additional compilation flag.
For example, -march=armv7a tells gcc that the arm chip is capable to access unaligned memory addresses,
resulting in much better code generation in combination with XXH_FORCE_MEMORY_ACCESS=1.
(Unfortunately, according to godbolt, this flag associated with default XXH_FORCE_MEMORY_ACCESS=0 still provides bad performance on older version of gcc. Newer versions gcc-6 fix this issue ).

Not sure if it's an option, but clang tends also to generate better binary code for arm.

Finally, on the user side
presuming that XXH_FORCE_ALIGN_CHECK=1 (https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L1179),
providing aligned input (on 4 bytes boundaries for XXH32)
should use the direct aligned memory access code path.
This one should not depend on the compiler detecting unaligned memory access capabilities,
since memory access is detected aligned to begin with.

XXH64 is likely to have poor performance on 32-bit arm.
But XXH32 should be fine.
The newer XXH3 is on the balance, baseline should be okay,
and with neon support, it is expected to beat XXH32 in speed.

edit : fixed memory access value, as underlined by @easyaspi314

@easyaspi314
Copy link
Contributor

If I am not mistaken, the correct general purpose flags for that CPU would be this:

-O2   -fomit-frame-pointer -march=armv7-a      -mfpu=neon     -mthumb        -munaligned-access
^duh  ^saves a register    ^sets arch version  ^enables neon  ^use thumb-2   ^force enable unaligned access

However, as Yann said, I would also recommend Clang as it tends to generate better code, especially with NEON.

Additionally, did you try a newer GCC version?, GCC's ARM and AArch64 backends are rather mediocre, and it was pretty bad until recently.

@easyaspi314
Copy link
Contributor

@Cyan4973 btw, XXH_FORCE_MEMORY_ACCESS=3 is safe, 2 is the evil one.

@jtoivainen
Copy link
Author

Seems that compiler optimizations was not turned on correctly by the tool chain. When setting those on explicitly for the build, xxhashes are calculated as (fast as) they should. Thanks @Cyan4973 and @easyaspi314.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants