Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A53 systems: enable FP, Neon and crypto in userspace #804

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kjbracey2
Copy link
Contributor

Apart from a couple of components, user space is not using Neon, and nothing is using hardware floating point.

Add the necessary switches to activate FP, Neon and crypto to EXTRACFLAGS (which most components already use), or more specifically CPUCFLAGS.

Generally this will mean the compiler will use hardware floating point instructions instead of calling library routines anywhere float and double are used. It will also sometimes make use of the Neon to do 64-bit integer arithmetic.

The compiler is unlikely to actually do any significant vectorisation using the Neon, and won't directly use the crypto.

However, by setting these flags, components with custom Neon/crypto support will be able to detect their presence.

As an example, Nettle can detect Neon from the flags. However that detection wasn't working as intended, so it's been corrected. It will now use its arm/v6 and arm/neon routines.

ffmpeg's autodetection gets easily confused - as evidenced by the EXTRACFLAGSx that nobbled an attempt to pass it in. Explicitly specify the cpu parameter to help it out. It can make use of the floating point.

One cheat is to define __ARM_ARCH_7A__ manually. (The compiler will be defining __ARM_ARCH_8A__). This is to support older code that might have a series of if/else tests but is not aware of __ARM_ARCH_8A__. It's possible some component might be ARMv8-A aware and testing from oldest first, in which case that would be tripped up. But that doesn't leave them any worse off than the previous situation, where we had the tools set to ARMv7-A anyway. I haven't confirmed if this cheat is really necessary. Would need an extensive code audit.

Tested on RT-AX88U, although I've not exercised many of the components. It builds and the router works, at least.

The platform.mak stuff is a mess, but I think it's a bit less of a mess now than when I started - I've tidied up some unnecessary conditions. But maybe that causes too much divergence from upstream. Feedback welcome.

For A53-based systems, specify that we're -mcpu=cortex-a53 specifically.
Previously we selected generic -march=armv7-a or -march=armv8-a.

Also enable use of hardware floating point (with softfp ABI to not
effect binary compatibility).

Remove unused HOSTCONFIG_64 macros.
ARMv6 detection relied on seeing armv6 or armv7 in the toolchain name,
which is not the case. And with ARMv6 not detected, the Neon detection
wasn't attempted.

Always attempt Neon detection for arm, and if found, assume ARMv6 too.
This avoids having to think about changing the ARMv6 detection.
@RMerl
Copy link
Owner

RMerl commented Jan 13, 2022

nettle building is broken.

make[7]: Entering directory '/home/merlin/amng.ax86/release/src/router/nettle'
siv-cmac-aes256.o.d:168: *** missing separator.  Stop.

@kjbracey2
Copy link
Contributor Author

kjbracey2 commented Jan 13, 2022

Does the ax88u build work for you, at least? That was the one tested - I admit to not compiling every other model.

(So did I mess up the PR preparation, or is this a model variant problem?)

Actually, error coming from a "o.d" file is weird. Result of an incremental change? Try just a make nettle-clean?

@kjbracey2
Copy link
Contributor Author

kjbracey2 commented Jan 13, 2022

Clean build of rt-ax86u in src-rt-5.02p1axhnd.675x of the tip of the PR branch works for me - at least it's got as far as generating libnettle.a, and is still running.

@RMerl
Copy link
Owner

RMerl commented Jan 14, 2022

Clean build of rt-ax86u in src-rt-5.02p1axhnd.675x of the tip of the PR branch works for me - at least it's got as far as generating libnettle.a, and is still running.

Might have been a fluke caused by the parallel building of that component as it builds fine tonight, again doing with the same clean rsync from repo. If it happens again I will disable PARALLEL_BUILD for nettle.

Have you been able to measure any performance improvement from these changes to confirm these are actually doing anything? Keep in mind that userspace is compiled as 32-bit armv7, only the kernel is compiled as aarch64, and also only on the bcm490x. bcm675x like the RT-AX58U also use a 32-bit armv7 kernel.

A quick test on openssl (which is probably the main place where performance improvements would be really beneficial) showed no difference in performance.

The platform.mak stuff is a mess, but I think it's a bit less of a mess now than when I started - I've tidied up some unnecessary conditions. But maybe that causes too much divergence from upstream. Feedback welcome.

It's best not to do any cleanup to that file, because indeed it makes merging upstream changes quite difficult. I actually reverted a few of my own changes a few months ago as merging 45581 proved to be very difficult with all my previous changes in place.

@kjbracey2
Copy link
Contributor Author

kjbracey2 commented Jan 14, 2022

Have you been able to measure any performance improvement from these changes to confirm these are actually doing anything? Keep in mind that userspace is compiled as 32-bit armv7, only the kernel is compiled as aarch64, and also only on the bcm490x. bcm675x like the RT-AX58U also use a 32-bit armv7 kernel.

User space is AArch32, but compiling it as "ARMv7A-with-no-floating-point-or-neon" is suboptimal when you're really "ARMv8A-with-floating-point-and-neon".

(Be careful with terminology - ARMv7-A is 32-bit only, but ARMv8-A can be 32-bit or 64-bit. Everywhere you said "armv7" there I think you meant "AArch32".)

I've not measured anything, but I've inspected code generation of various components. Anything using floats+doubles is a lot happier, and it's clear the compiler knows how to work on uint64_ts with the Neon rather than 32 bits at a time with the standard instructions.

Unlike x86, there really isn't a night-and-day difference on base speed on general code between 32 and 64 bit. The 32-bit instruction set is already good. And it will give you access to the VFP, Neon and Crypto, as long as you tell people it's there.

Only significant limitation is that full use of Neon and Crypto needs hand-crafted assembler and people are increasingly not bothering to make 32-bit versions. (Part of the issue is that with Neon you've got the instructions, but only half the register file, which might mean a significantly different approach).

A quick test on openssl (which is probably the main place where performance improvements would be really beneficial) showed no difference in performance.

openssl was maybe the only userspace component that was already aware it had Neon - detected at runtime for assembly selection via HWCAPs. So it at least managed to run its AES-accelerated crypto, which is one of its biggest jobs. As most of its speed-critical work is handcrafted assembly selected via HWCAPs, the compiler flags won't affect it much.

Openssl does partially suffer from the "only handcrafted for 64-bit" problem. aesv8-armx.pl can generate Crypto Extension code for either AArch32 or AArch64, but sha512-armv4.pl has only Neon but not Crypto for AArch32.

I'd only expect you to see a crypto boost in something using nettle, which had failed to activate any of its armv6 or neon assembly.

Otherwise, this is a general hygiene patch - there's megabytes of general code being spat out, and it makes my teeth itch seeing it making library calls for floating point and doing 64-bit ops the hard way. (I was started on this by #803 - some curiosity made me look at the output for the prints, and I saw all the library calls.) I was hoping for a bit of a size reduction, but it doesn't seem to have been significant.

@kjbracey2
Copy link
Contributor Author

kjbracey2 commented Jan 14, 2022

Oh, one question - this is what I pieced together for the CPU possibilities there are. Have I got this right?

RT-AC68U: 4709 (6.x.4708)             2x A9
---
RT-AC88U: 4709 (7.14.114.x)           2x A9
---
RT-AC86U: 4906 (5.02hnd)              2x A53
---
RT-AX56U: 6755 (5.02axhnd.675x)       4x A7 BRCM_CHIP=47622
RT-AX58U: 6750 (5.02axhnd.675x)       3x A7 BRCM_CHIP=63178n
      v2: 6756
---
RT-AX88U: 49408 (5.02axhnd)           4x A53 BRCM_CHIP=4908
RT-AX86U: 4908 (5.02p1axhnd.675x)     4x A53 BRCM_CHIP=4908 (62118?) (also 86S - 2x A53 - same firmware)
RT-AX68U: 4906 (5.02p1axhnd.675x)     2x A53 BRCM_CHIP=4908 (62118?)

And I believe the A7 and A9 were base - no FP or Neon, while the A53s have all the trimmings.

And all the A53-based ones run a 64-bit kernel.

@RMerl
Copy link
Owner

RMerl commented Jan 17, 2022

At a quick glance it looks about right. Earlier RT-AC68U used a 800 MHz bcm4708. Later revisions went 4709 (and then 4709c0).

One thing to note about the 490x, this is actually a B53. No idea what Broadcom changed over the original A53 design, but we can assume that the two are very similar.

@RMerl
Copy link
Owner

RMerl commented Jan 17, 2022

Openssl does partially suffer from the "only handcrafted for 64-bit" problem. aesv8-armx.pl can generate Crypto Extension code for either AArch32 or AArch64, but sha512-armv4.pl has only Neon but not Crypto for AArch32.

is there something that could be optimized there? I only had a quick look at it over the weekend, but had trouble figuring out how the build system worked for openssl (last time I spent time on it, it was back in the 1.0.x days when I did a LOT of optimization there, gaining well over 10% in performance when running OpenVPN on an RT-N66U MIPS CPU).

Also did a good bit of work when trying to bring 1.1.1 back to the performance levels of 1.0.2, but with limited success (even started bisecting to figure out where 1.1.1 lost a lot of its performance). One thing I discovered then was for instance OpenSSL ran faster with -O2 than -O3 on my test router.

@kjbracey2
Copy link
Contributor Author

kjbracey2 commented Jan 17, 2022

is there something that could be optimized there?

Surely, yes, but only if you're in the business of hand-coding ARM/Neon assembly. It's the upstream package that lacks aarch32+SHA256 crypto acceleration assembly.

I've done lots of ARM optimisation in my time, but Neon's new to me. Maybe if I get bored... Haven't figured out if the half-size Neon register file in 32-bit mode matters for SHA256. I suspect it may do. AES doesn't really care...

OpenSSL does at least have non-crypto-extension Neon SHA256. But that will probably be slower than the crypto extension AES-CBC, so it will possibly dominate overall crypto time.

If the OpenSSL manages to avoid that 300Mbit/s non-accelerated bottleneck, it's worth worrying about? Otherwise it's all faster than that anyway.

I can't see anything from a Makefile configuration point of view - I think OpenSSL is already basically optimally configured w.r.t the assembly it does have.

Compiler options in this patch shouldn't hurt, and maybe make -O3 worth revisiting? -O3 has a better chance of being beneficial if the compiler has a better idea of what the CPU is - and it can activate some automatic vectorisation with the Neon, in theory.

@kjbracey2
Copy link
Contributor Author

Oops, correction time. There is actually an aarch32 crypto extension sha256 implementation in there, I'd just missed it. It's tucked at the bottom of sha256-armv4.pl...

@RMerl
Copy link
Owner

RMerl commented Jan 17, 2022

I can't see anything from a Makefile configuration point of view - I think OpenSSL is already basically optimally configured w.r.t the assembly it does have.

Ok, that's what I wanted to know. I have to trust in the OpenSSL devs to know what they are doing at the asm level, I was just wondering if there were any performance left on the table due to us not using the correct platform profile.

Compiler options in this patch shouldn't hurt, and maybe make -O3 worth revisiting? -O3 has a better chance of being beneficial if the compiler has a better idea of what the CPU is - and it can activate some automatic vectorisation with the Neon, in theory.

My gut feeling tells me that the -O3 performance loss were due to unwrapped loops no longer fitting within the CPU cache.. Might be worth retesting at some point, however it's currently not a priority for me as I have other things that need my attention development-wise.

@paldier
Copy link
Contributor

paldier commented Jan 18, 2022

bcm6750 dose not support hwcrypto
bcm6755/6756 bcm4906/4908/4912/4915 support hwcrypto
all of them support fpu(neon/vfp3)

@kjbracey2
Copy link
Contributor Author

bcm6750 dose not support hwcrypto
bcm6755/6756 bcm4906/4908/4912/4915 support hwcrypto
all of them support fpu(neon/vfp3)

The 675x are all Cortex-A7 based, as far as I can see. I don't see how any of them can have ARM crypto extensions. (Although everything has Broadcom's own SPU accelerator).

It's possible some 675x have Neon - if so, I've not found it configured anywhere, and it would be worthwhile getting that done like this patch.

I've failed to find any /proc/cpuinfo output via Google showing any FP/Neon on any Asus router with a 32-bit ARM myself. The A9-based devices did not have FP or Neon for sure. Can you show the output from a device that does? (It should show hardware capabilities regardless of what's enabled in the software).

@paldier
Copy link
Contributor

paldier commented Jan 18, 2022

https://github.com/SWRT-dev/softcenter_tools/blob/52ff614504526fb6d828fc3f4ddf6a40145e1faa/sc.sh#L22
Just build anything and try to run it
Asus prebuilt files are built with flags softfp or soft, i don't think it's good idea to modify the build environment.

@kjbracey2
Copy link
Contributor Author

kjbracey2 commented Jan 18, 2022

Thanks - the code there seems to suggest all 675x have Neon, which would make sense. And looking more closely at the Merlin kernel config, I can see it enabled in the 63178 kernel for the AX58U. I'd missed that.

So I think this patch can be extended to cover those systems - the benefit would be basically the same as for the A53 ones.

Changing the flags as per this PR won't affect binary compatibility - we stay using the soft/softfp ABI, so FP values are passed in integer registers to external functions. What wouldn't work in this system would be using -mfloat-abi=hard as in your link.

What do you think is the crypto difference between 6750 and 6755? I'm pretty certain it can't be the ARM crypto extensions, unless we're wrong about those chips being Cortex-A7s. Could maybe be a Cortex-A32?

@JackMerlin
Copy link
Contributor

Great work!

@RMerl RMerl force-pushed the master branch 2 times, most recently from b4d0ac1 to 42dc10f Compare March 23, 2022 19:20
ozonep referenced this pull request in ozonep/Asuswrt-Merlin-AdGuardHome-Installer Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants