Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64 kernels: Use ARM64 crypto #805

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

kjbracey2
Copy link
Contributor

Enable accelerated implementations of AES, AES ECB/CBC/CTR/XTS, CRC32, GHASH, SHA1, SHA256.

These are not enabled in the kernel, which seems like a mistake. They are faster than bcmspu, and much faster than the generic code currently enabled.

Now, userspace code is best off running its own crypto in userspace, as openvpn does. But the kernel and drivers want crypto sometimes too.

The bcmspu only supports asynchronous operations, so if any crypto is required synchronously, the kernel will currently be falling back to generic.

So on its own, this will speed up synchronous operations. Testing using the tcrypt module, I got speed-ups of ~300% for sha1 and crc, 728% for sha256, 300-400% for AES ciphers (all 256-byte updates).

But the bcmspu driver is registering asynchronously with higher priority, so arm-ce won't displace it. But doing rmmod bcmspu may be worthwhile (at least on 4-core devices if not all) , because the arm-ce seems to be always faster than bcmspu, particularly for small blocks. In fact, the bcmspu is slower than the generic code for 256-byte blocks.

So for asynchronous operations, potential speedup by disabling bcmspu is ~800% for sha1 and sha256, 2500-3000% for AES ciphers (256-byte updates), or 300-600% for AES ciphers with 8192-byte updates.

Now, I don't know how much this will affect the system. But if anybody is using the kernel crypto then this seems like a potential easy win. And if nobody's using it, why are we bothering to load the bcmspu module? (I tried using a debug print to bcmspu's usage stats, but it just crashed the system).

Killing bcmspu might be a hit for things that aren't accelerated, like sha512. At least with small blocks, the arm generic was marginally faster than bcmspu, but no doubt bcmspu would catch up on larger blocks. You could conceivably fudge the arm-ce priorities to make sure the system favours them over bcmspu, leaving bcmspu to handle things arm-ce can't do.

In principle you should be able to disable some of the generic implementations to save code size, but it seems they're selected in a bunch of places, so I gave up on that.

Enable accelerated implementations of AES, AES ECB/CBC/CTR/XTS,
CRC32C, GHASH, SHA1, SHA256.
@RMerl
Copy link
Owner

RMerl commented Jan 13, 2022

Back in the day I ran benchmarks with Strongswan. bcmspu was faster than the ARM64 cyphers, one of the reasons being that bcmspu frees up the CPU to handle routing/NAT.

RT-AC86U (384.4 alpha 2)

[292] local 10.10.10.1 port 2754 connected with 192.168.1.51 port 5001
[ ID] Interval       Transfer     Bandwidth
[292]  0.0-30.0 sec  1.07 GBytes    307 Mbits/sec



BCMSPU + AF_ALG:
[292] local 10.10.10.1 port 12302 connected with 192.168.1.51 port 5001
[ ID] Interval       Transfer     Bandwidth
[292]  0.0-30.0 sec  1.04 GBytes    297 Mbits/sec


AF_ALG + AARCH64 modules
[292] local 10.10.10.1 port 1715 connected with 192.168.1.51 port 5001
[ ID] Interval       Transfer     Bandwidth
[292]  0.0-30.0 sec    835 MBytes    233 Mbits/sec

@kjbracey2
Copy link
Contributor Author

kjbracey2 commented Jan 13, 2022

That somewhat surprises me. Do you know what sort of ops that would have been performing?

The performance difference mainly depends on the block size - it appears that (from within the kernel) bcmspu has a cap of about 70,000 operations per second on its AES, but a peak throughput of ~200MByte/s, so you need to be feeding it maybe 4K blocks or bigger to saturate it.

I would have expected the 4-core devices to not be managing to fully load the 4 ARMs on network tasks (is it that well parallelised?) so there would be 1 A53 available for crypto work. The arm-ce driver has an AES throughput of ~1GByte/s, and can do 30 million small ops per second, so if one's available, that's at least 5x faster, possibly more.

Anyway, this patch on its own won't make anything worse, because the bcmspu drivers are all higher-priority (400 or 1500, compared to arm-ce's 300). arm-ce can only take over from the generic CPU drivers (which will currently be handling synchronous ops, crc32, any ghash bcmspu doesn't cover, and any times bcmspu says "I'm busy"). But having it in will permit easy experimentation by killing and inserting bcmspu.

@RMerl
Copy link
Owner

RMerl commented Jan 14, 2022

Do you know what sort of ops that would have been performing?

I was running an iperf test through an IPSEC tunnel between my desktop and an RT-AC86U.

I would have expected the 4-core devices to not be managing to fully load the 4 ARMs on network tasks (is it that well parallelised?) so there would be 1 A53 available for crypto work.

The RT-AC86U used in that early test is only dual-core.

Unless there's something actually using these ciphers, I see no real reason to add them to the kernel. The Broadcom software stack uses their own implementation (through libbcmcrypto.so), and Strongswan is getting better performance through bcmspu in these tests I did back in the day when investigating ARM64 ciphers vs bcmspu.

@kjbracey2
Copy link
Contributor Author

I'll try to reproduce those iperf/IPSEC tests. I'm moderately confident that the results might be reversed with 4 cores.

There's another factor I need to investigate which is that these early 4.1 arm-ce drivers are bit system unfriendly - they don't yield as they should and will block pre-emption for the entire operation duration. I don't know if that would impact that particular benchmark though, which I would expect to be using small blocks. But then if it was, the bcmspu wouldn't be so fast...

And if nobody is using these ciphers, then we shouldn't be activating the generic ones or loading bcmspu. I'll try again to get the bcmspu stats out - like I said, last time I tried printing them (via the sysfs file you named in some previous SNB thread), it just crashed.

I'll carry on fiddling away - you can park this for now.

@kjbracey2
Copy link
Contributor Author

Well, the IPSec speed test looks like a whole project. Spent a couple of hours trying to figure out how to get a tunnel working, and not managed it so far.

Couldn't figure out how to set it up from Ubuntu - Strongswan documentation is all over the place.

Think I've configured Windows, but connection attempts just time out. Router logs show it isn't rejecting it - it just times out the half-open connection with lack of response.

Is it because I'm trying to do it from inside my LAN? Does something firewall-y stop the handshake working from inside?

Kernel hackery is easier than this... Will have to get back to it on Monday.

@RMerl
Copy link
Owner

RMerl commented Jan 14, 2022

Might be because you are doing it within the LAN (which would need to go through the NAT loopback - not sure IPSEC is able to do that). I do my tests here by connecting to the target router's WAN interface (as it sits within my LAN).

@kjbracey2
Copy link
Contributor Author

Got it working LAN side with Ubuntu - the trick is to install the network-manager-strongswan package, which adds a simple VPN to the GUI. So I can do iperf3 tests now.

In terms of operations, it appears that the router only supports authenc(hmac(sha256),cbc(aes)) (128-bit AES), which simplifies things.

It looks like we're not currently throughput limited by the crypto operations here - there's some other factor causing a 250-300Mbit/s cap, so it doesn't at the moment matter much what crypto driver is in use. The bcmspu is somewhat above that speed - maybe 600Mbit/s with 1K blocks - I'd need to retest. The arm-ce can do at least 3Gbit/s, when properly selected. But the actual IPsec doesn't want to go about 300Mbit/s, no matter what.

There's an easily fixed priority glitch in 4.1 which means it often fails to select aes-cbc-ce, but fixing that, which triples crypto test speed, up to a peak of 8Gbit/s on large blocks, doesn't noticeably affect the strongswan.

It appears the IPsec is running at 100% of 1 CPU, regardless of crypto driver, but it must be mainly non-crypto faffing. 3 CPUs are doing nothing. And that's the same with bcmspu - still 100% CPU load.

Going to keep researching.

@kjbracey2
Copy link
Contributor Author

Hang on, doesn't the router have a 300Mbit/s or so performance limit when hardware acceleration isn't operational due to QoS? Is that what we're simply bumping into - our network data path is already slower than either ARM or bcmspu crypto's theoretical throughput?

I am able to get 900Mbit/s iperf TCP tests between LAN PC and router with the VPN disconnected. Does that mean the hardware acceleration is working then? I kind of assumed the fact I'd enabled Cake QoS would kill it totally, but does it only kill it on the WAN interface? (Using ifb probably helps there - there's no qdisc attached to br0 at all).

I'm not actually familiar with what the hardware acceleration consists of and what would be relevant on a router<->LAN iperf. If a large chunk of it is the TCP segmentation offload, then IPSec immediately defeats that...

@RMerl
Copy link
Owner

RMerl commented Jan 17, 2022

The current high-end HND models can reach around 350 Mbps of NAT throughput when HW acceleration is disabled or bypassed. It`s possible that IPSEC might bypass it (I don't know), you'd have to monitor CPU usage to see if strongswan is capping the CPU, or something else (like IO) is.

It`s quite possible that it is indeed a bottleneck. Wireguard requires HW acceleration to be disabled to work properly, and I believe its throughput caps at around 300 Mbps as well.

When I discussed IPSEC performance with an Asus engineer a few years ago, they indicated that CPU usage was an important factor, which was one of the things we both monitored when comparing bcmspu vs kernel ciphers.

Here are all the results that I kept in my Onenote notebook from the tests at the time:

https://1drv.ms/u/s!AuCcWdNeYuXMgaBEwu3XkBkA7EsAlA?e=kSJwoL

One thing to note: if my memory is correct, I ran the test between a WAN side client (that ran the IPSEC Client) and a LAN side client (that was behind the router's IPSEC server). It's been a few years however. This is to ensure that the iperf remote doesn't become a bottleneck (which it would if you ran the iperf server on the router itself).

@kjbracey2
Copy link
Contributor Author

you'd have to monitor CPU usage to see if strongswan is capping the CPU, or something else (like IO) is.

It was clear that all tests, whether using arm-ce or bcmspu, were hitting pretty much exactly 100% of 1 CPU according to htop. (Or the equivalent spread over a few). I'm still figuring out what my profiling options to actually isolate that are. Suggestions welcome!

if my memory is correct, I ran the test between a WAN side client (that ran the IPSEC Client) and a LAN side client (that was behind the router's IPSEC server). It's been a few years however. This is to ensure that the iperf remote doesn't become a bottleneck (which it would if you ran the iperf server on the router itself).

Doesn't look like that's the case - as I said, I can do 900Mbit/s (basically line rate) with an iperf server on the router and no IPSec. I haven't actually looked at the cpu usage of that for comparison...

I'll have a dig through your notes. Thanks!

@JackMerlin
Copy link
Contributor

Gosh, this is a great PR. If it works as expected, I'm sure it will be a huge change. we're always stuck in Broadcom's walled garden. I'm looking forward to testing more, maybe you can compile some betas for distribution on the SNB forums, and looking forward to this change being merged into Merlin soon.

@guidomedina
Copy link

guidomedina commented Feb 27, 2022

I'm wondering if this affects (in a good way) the performance of things like standard WiFi AES encryption and AiMesh, I have a couple of RT-AC86U interconnected via AIMesh, this could be very interesting.

@JackMerlin
Copy link
Contributor

I'm wondering if this affects (in a good way) the performance of things like standard WiFi AES encryption and AiMesh, I have a couple of RT-AC86U interconnected via AIMesh, this could be very interesting.

No, WiFi comes from coprocessors that already have hardware acceleration.

@RMerl RMerl force-pushed the master branch 2 times, most recently from b4d0ac1 to 42dc10f Compare March 23, 2022 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants