-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A53 systems: enable FP, Neon and crypto in userspace #804
base: master
Are you sure you want to change the base?
Conversation
For A53-based systems, specify that we're -mcpu=cortex-a53 specifically. Previously we selected generic -march=armv7-a or -march=armv8-a. Also enable use of hardware floating point (with softfp ABI to not effect binary compatibility). Remove unused HOSTCONFIG_64 macros.
ARMv6 detection relied on seeing armv6 or armv7 in the toolchain name, which is not the case. And with ARMv6 not detected, the Neon detection wasn't attempted. Always attempt Neon detection for arm, and if found, assume ARMv6 too. This avoids having to think about changing the ARMv6 detection.
nettle building is broken.
|
Does the ax88u build work for you, at least? That was the one tested - I admit to not compiling every other model. (So did I mess up the PR preparation, or is this a model variant problem?) Actually, error coming from a "o.d" file is weird. Result of an incremental change? Try just a |
Clean build of rt-ax86u in src-rt-5.02p1axhnd.675x of the tip of the PR branch works for me - at least it's got as far as generating libnettle.a, and is still running. |
Might have been a fluke caused by the parallel building of that component as it builds fine tonight, again doing with the same clean rsync from repo. If it happens again I will disable PARALLEL_BUILD for nettle. Have you been able to measure any performance improvement from these changes to confirm these are actually doing anything? Keep in mind that userspace is compiled as 32-bit armv7, only the kernel is compiled as aarch64, and also only on the bcm490x. bcm675x like the RT-AX58U also use a 32-bit armv7 kernel. A quick test on openssl (which is probably the main place where performance improvements would be really beneficial) showed no difference in performance.
It's best not to do any cleanup to that file, because indeed it makes merging upstream changes quite difficult. I actually reverted a few of my own changes a few months ago as merging 45581 proved to be very difficult with all my previous changes in place. |
User space is AArch32, but compiling it as "ARMv7A-with-no-floating-point-or-neon" is suboptimal when you're really "ARMv8A-with-floating-point-and-neon". (Be careful with terminology - ARMv7-A is 32-bit only, but ARMv8-A can be 32-bit or 64-bit. Everywhere you said "armv7" there I think you meant "AArch32".) I've not measured anything, but I've inspected code generation of various components. Anything using floats+doubles is a lot happier, and it's clear the compiler knows how to work on Unlike x86, there really isn't a night-and-day difference on base speed on general code between 32 and 64 bit. The 32-bit instruction set is already good. And it will give you access to the VFP, Neon and Crypto, as long as you tell people it's there. Only significant limitation is that full use of Neon and Crypto needs hand-crafted assembler and people are increasingly not bothering to make 32-bit versions. (Part of the issue is that with Neon you've got the instructions, but only half the register file, which might mean a significantly different approach).
openssl was maybe the only userspace component that was already aware it had Neon - detected at runtime for assembly selection via HWCAPs. So it at least managed to run its AES-accelerated crypto, which is one of its biggest jobs. As most of its speed-critical work is handcrafted assembly selected via HWCAPs, the compiler flags won't affect it much. Openssl does partially suffer from the "only handcrafted for 64-bit" problem. I'd only expect you to see a crypto boost in something using nettle, which had failed to activate any of its armv6 or neon assembly. Otherwise, this is a general hygiene patch - there's megabytes of general code being spat out, and it makes my teeth itch seeing it making library calls for floating point and doing 64-bit ops the hard way. (I was started on this by #803 - some curiosity made me look at the output for the prints, and I saw all the library calls.) I was hoping for a bit of a size reduction, but it doesn't seem to have been significant. |
Oh, one question - this is what I pieced together for the CPU possibilities there are. Have I got this right?
And I believe the A7 and A9 were base - no FP or Neon, while the A53s have all the trimmings. And all the A53-based ones run a 64-bit kernel. |
At a quick glance it looks about right. Earlier RT-AC68U used a 800 MHz bcm4708. Later revisions went 4709 (and then 4709c0). One thing to note about the 490x, this is actually a B53. No idea what Broadcom changed over the original A53 design, but we can assume that the two are very similar. |
is there something that could be optimized there? I only had a quick look at it over the weekend, but had trouble figuring out how the build system worked for openssl (last time I spent time on it, it was back in the 1.0.x days when I did a LOT of optimization there, gaining well over 10% in performance when running OpenVPN on an RT-N66U MIPS CPU). Also did a good bit of work when trying to bring 1.1.1 back to the performance levels of 1.0.2, but with limited success (even started bisecting to figure out where 1.1.1 lost a lot of its performance). One thing I discovered then was for instance OpenSSL ran faster with -O2 than -O3 on my test router. |
I can't see anything from a Makefile configuration point of view - I think OpenSSL is already basically optimally configured w.r.t the assembly it does have. Compiler options in this patch shouldn't hurt, and maybe make -O3 worth revisiting? -O3 has a better chance of being beneficial if the compiler has a better idea of what the CPU is - and it can activate some automatic vectorisation with the Neon, in theory. |
Oops, correction time. There is actually an aarch32 crypto extension sha256 implementation in there, I'd just missed it. It's tucked at the bottom of |
Ok, that's what I wanted to know. I have to trust in the OpenSSL devs to know what they are doing at the asm level, I was just wondering if there were any performance left on the table due to us not using the correct platform profile.
My gut feeling tells me that the -O3 performance loss were due to unwrapped loops no longer fitting within the CPU cache.. Might be worth retesting at some point, however it's currently not a priority for me as I have other things that need my attention development-wise. |
bcm6750 dose not support hwcrypto |
The 675x are all Cortex-A7 based, as far as I can see. I don't see how any of them can have ARM crypto extensions. (Although everything has Broadcom's own SPU accelerator). It's possible some 675x have Neon - if so, I've not found it configured anywhere, and it would be worthwhile getting that done like this patch. I've failed to find any |
https://github.com/SWRT-dev/softcenter_tools/blob/52ff614504526fb6d828fc3f4ddf6a40145e1faa/sc.sh#L22 |
Thanks - the code there seems to suggest all 675x have Neon, which would make sense. And looking more closely at the Merlin kernel config, I can see it enabled in the 63178 kernel for the AX58U. I'd missed that. So I think this patch can be extended to cover those systems - the benefit would be basically the same as for the A53 ones. Changing the flags as per this PR won't affect binary compatibility - we stay using the soft/softfp ABI, so FP values are passed in integer registers to external functions. What wouldn't work in this system would be using What do you think is the crypto difference between 6750 and 6755? I'm pretty certain it can't be the ARM crypto extensions, unless we're wrong about those chips being Cortex-A7s. Could maybe be a Cortex-A32? |
Great work! |
b4d0ac1
to
42dc10f
Compare
Apart from a couple of components, user space is not using Neon, and nothing is using hardware floating point.
Add the necessary switches to activate FP, Neon and crypto to
EXTRACFLAGS
(which most components already use), or more specificallyCPUCFLAGS
.Generally this will mean the compiler will use hardware floating point instructions instead of calling library routines anywhere
float
anddouble
are used. It will also sometimes make use of the Neon to do 64-bit integer arithmetic.The compiler is unlikely to actually do any significant vectorisation using the Neon, and won't directly use the crypto.
However, by setting these flags, components with custom Neon/crypto support will be able to detect their presence.
As an example, Nettle can detect Neon from the flags. However that detection wasn't working as intended, so it's been corrected. It will now use its arm/v6 and arm/neon routines.
ffmpeg
's autodetection gets easily confused - as evidenced by theEXTRACFLAGSx
that nobbled an attempt to pass it in. Explicitly specify thecpu
parameter to help it out. It can make use of the floating point.One cheat is to define
__ARM_ARCH_7A__
manually. (The compiler will be defining__ARM_ARCH_8A__
). This is to support older code that might have a series of if/else tests but is not aware of__ARM_ARCH_8A__
. It's possible some component might be ARMv8-A aware and testing from oldest first, in which case that would be tripped up. But that doesn't leave them any worse off than the previous situation, where we had the tools set to ARMv7-A anyway. I haven't confirmed if this cheat is really necessary. Would need an extensive code audit.Tested on RT-AX88U, although I've not exercised many of the components. It builds and the router works, at least.
The platform.mak stuff is a mess, but I think it's a bit less of a mess now than when I started - I've tidied up some unnecessary conditions. But maybe that causes too much divergence from upstream. Feedback welcome.