Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: seccomp_rule_add became very slow since v2.4.0 #153

Closed
varqox opened this issue May 1, 2019 · 23 comments
Closed

RFE: seccomp_rule_add became very slow since v2.4.0 #153

varqox opened this issue May 1, 2019 · 23 comments
Assignees
Milestone

Comments

@varqox
Copy link

varqox commented May 1, 2019

Hello,
the issue was introduced by the commit ce3dda9 between v2.3.3 and v2.4.0. Consider the following example: foo.c.zip.
It adds a very large number of rules. And works around 100 times slower after the abovementioned commit.

foo.c execution time using v2.4.1: 0.448
foo.c execution time using v2.3.3: 0.077

I dug a little and found out that db_col_transaction_start() copies already existing filter collection and uses arch_filter_rule_add() to duplicate filter rules. But arch_filter_rule_add() calls arch_syscall_translate() which calls arch_syscall_resolve_name() which works in O(number of syscalls on the given architecture). So adding one rule works at least in O(number of already added rules * number of syscalls on used architectures) which IMO is really bad.
I counted the number of calls to arch_filter_rule_add() in the above example and it equals 201152.

Before that commit number of calls to arch_filter_rule_add() was 896. And from what I understand from the code, also db_col_transaction_start() copies already existing filter collection and does not use arch_filter_rule_add(). Which gives us estimation: time of adding a rule around O(number of already added rules + number of syscalls on the given architecture), which is much better.

However IMO it should not be related to the number of already added rules, because adding n rules works in O(n^2) then. But that is a topic to a different discussion, so it should not be a problem for small filters or filters generated infrequently.

Why this problem matters?
Some filters need executing programs PID (e.g. allowing the thread to send signals only to itself). So if the restricted program needs to be executed a considerable number of times, it becomes a very visible overhead. I have a filter of around 300 rules and libseccomp overhead is around 0.16s per run of the sandboxed process (I run the process dozens of times).

Thank you in advance for your help!

@pcmoore pcmoore changed the title seccomp_rule_add became very slow since v2.4.0 RFE: seccomp_rule_add became very slow since v2.4.0 May 1, 2019
@pcmoore
Copy link
Member

pcmoore commented May 1, 2019

Hi @varqox.

Yes, the syscall resolver functions could use some improvement, in fact if you look at the code you will see several comments like the following:

/* XXX - plenty of room for future improvement here */

If you want to look at improving that code, we should could use the help!

@drakenclimber
Copy link
Member

As @pcmoore mentioned there are ample opportunities for speeding up the creation of a seccomp filter using libseccomp. Your research above outlined one of several areas that could use improvement. This has not been a concern for my users, so I have not focused on it.

Regarding the runtime performance, I am currently working on using a binary tree for large filters like the one you provided in foo.c. Initial results with my in-house customers are looking promising, but I would love to get another set of eyes on the changes. See pull request #152

@varqox
Copy link
Author

varqox commented May 1, 2019

OK, I see that syscall resolving could be improved, but it is not the root cause of the issue. Which is, as I see it, the creation of a snapshot in the db_col_transaction_start(). There arch_filter_rule_add() is called which is slow because of resolving the syscall, which is resolved in the original rule.

I see it as follows: We want to duplicate the whole set of current filters (aka struct db_filter) with all their rules, so we construct all the filters from scratch instead of taking advantage of what we already have and just copy all the filters. We don't have to build from scratch, we have fully build filter we just want a copy of. Maybe I missed something, but it looks like a lot of improvement may be done to the db_col_transaction_start() function.

@pcmoore
Copy link
Member

pcmoore commented May 2, 2019

With all of the state in the internal libseccomp db collection, duplicating it is not a trivial task, regenerating the collection from the original rules is much easier (from a code perspective). Keeping track of the original rules also allows us to offer the ability to "remove" an existing rule (possible future feature).

This isn't to say the transaction code couldn't be improved - it definitely can - but the current code is the way it is for a reason, primarily simplicity.

@crosbymichael
Copy link

We are seeing timeouts from users because of this change. It really slowed things down by an order of magnitude.

@pcmoore
Copy link
Member

pcmoore commented Aug 26, 2019

Another thought, we can probably change this so that we only duplicate the rules on a transaction start, not the entire tree, and only recreate the tree on a failed transaction. It isn't perfect, but that should get back a good chunk of the time.

@pcmoore pcmoore added this to the v2.5 milestone Aug 26, 2019
@crosbymichael
Copy link

We need to do something because start times for containers and exec processes are seeing a huge performance regression and causing people to pin to 2.3x

@pcmoore
Copy link
Member

pcmoore commented Aug 26, 2019

I'm not going to comment further on the "huge" nature of the problem, that perspective has been made multiple times already and I consider it both relative and use-case dependent. However, I did want to remind everyone that libseccomp releases prior to v2.4 are vulnerable to a potential vulnerability which has been made public (issue #139).

For those who are concerned about this issue, it is currently marked for a v2.5 release.

@crosbymichael
Copy link

You made a refactoring and it has "huge" performance impacts in a minor release and you are not being helpful by blowing this off saying it's use-case dependent. Please take this seriously as people are going to start noticing as distros update to 2.4

@pcmoore
Copy link
Member

pcmoore commented Aug 26, 2019

@crosbymichael the change was not simply a refactoring, it was necessary to fix problems and support the changes in the kernel (most notably the need to support both multiplexed and direct call syscalls, e.g. socket syscalls on 32-bit x86).

I am not blowing this off, I've continued to think about ways to solve this problem (see my comments above), and the fact that I've marked this as something for the next minor release. At this point it is hard for me not to perceive your comments as inflammatory, if that is not your intention I suggest taking more care when commenting in the future. If you are unhappy with the progress on this issue you are always welcome to help out by submitting a patch/PR for review.

@pcmoore
Copy link
Member

pcmoore commented Oct 22, 2019

Note to self and whoever else is considering trying to solve this ...

I was recently reminded why we do what we do with respect to transactions (copy everything up front); we do this because we need to be able to rollback a transaction without failing. Why?
A normal seccomp_rule_add() operation needs to keep the filter intact even in the case of failure; if we fail a multi-part transaction (e.g. socket/ipc syscalls on x86/s390/s390x/etc.) as part of a normal rule addition, we MUST be able to revert to the filter at the start of the transaction without fail (regardless of memory pressure, etc.).

Duplicating the tree without the rules is going to continue to be challenging due to the nature of the tree and the linking inside the tree, but we might be able to selectively choose when when we need to create an internal transaction, skipping it for the many cases when it is not needed.

@pcmoore
Copy link
Member

pcmoore commented Oct 30, 2019

I spent a little more time looking at this and because of the way we destructively modify the decision tree during a rule addition I'm not sure we can avoid wrapping rule additions with a transaction. This means instead of finding ways to limit our use of transactions internally we need to find a way to speed them up, thankfully I think I may have found a solution: shadow trees.

Currently we build a new tree each time we create a new transaction and discard it on success, which as we've seen can be prohibitively slow in some use cases. My thought is that instead of discarding the duplicate tree on commit, we attempt to add the rule we just added to the duplicate tree (making it a copy of the current filter) and keep it as a "shadow transaction" to speed up the next transaction snapshot. Some notes:

  • db_col_transaction_start() should attempt to use the shadow transaction if present, but if not it should fallback to current behavior.
  • db_col_transaction_abort() should behave the same way as it does now; this means a failed transaction will clear the shadow transaction (it needs to tree to restore the filter), but the next successful transaction will restore the shadow. A failed transaction should be infrequent enough that this shouldn't be a major issue.
  • We may need to clear the shadow transaction on other operations, e.g. arch/ABI ops?, but that is something that we'll need to check out. Regardless, clearing the shadow transaction should be trivial.
  • This has the advantage of not just speeding up rule adds, but speeding up transactions in general. This may not be significant now, but it will be useful when we expose the transaction functionality to users (this would be necessary if we ever want to do a BSD "pledge" like mechanism).

@pcmoore
Copy link
Member

pcmoore commented Nov 1, 2019

I had some time after dinner tonight so I made a quick pass at implementing the shadow transaction idea above. The code is still crude, and my tests (below) even cruder, but it looks like we are seeing some performance gains with this approach:

  • Baseline test overhead
# time for i in {0..20000}; do /bin/true; done
real    0m10.479s
user    0m7.641s
sys     0m3.924s
  • Current master branch
# time for i in {0..20000}; do ./42-sim-adv_chains > /dev/null; done

real    0m16.303s
user    0m12.584s
sys     0m4.501s
  • Patched
time for i in {0..20000}; do ./42-sim-adv_chains > /dev/null; done

real    0m15.021s
user    0m11.540s
sys     0m4.387s

If we subtract out the test overhead we are looking at roughly 20% increase in performance on this "test", but I expect the benefit for complex filter sets to be better (much better?) than this.

@varqox and/or @crosbymichael once I clean up the patches a bit and create a PR, would you be able to test this in your environment?

@varqox
Copy link
Author

varqox commented Nov 1, 2019

My example test case is already here:

Hello,
the issue was introduced by the commit ce3dda9 between v2.3.3 and v2.4.0. Consider the following example: foo.c.zip.
It adds a very large number of rules. And works around 100 times slower after the abovementioned commit.

foo.c execution time using v2.4.1: 0.448
foo.c execution time using v2.3.3: 0.077

But as soon as the PR is ready, I can test it in my environment.

@pcmoore
Copy link
Member

pcmoore commented Nov 2, 2019

Hi @varqox, yes I saw that you included a test case in the original report, but I'm more interested in hearing how it performs in real use. If you could try out PR #180 and report back I would really appreciate it - thanks!

@xinfengliu
Copy link

Hi @pcmoore ,

Thanks for making this PR.
I built and tested your PR #180, the result is promising for my test case. I watch this issue because customers use docker health-check and suffered from the performance issue in libseccomp 2.4.x.
In my test case, this PR's performance is comparable to libseccomp 2.3.3. Details are as below:

Environment

Ubuntu 19.04 VM (2 CPU, 2G memory) on MacBook Pro (15-inch, mid 2015)
Kernel 5.0.0-32-generic
Docker CE 19.03.2

Test case:

Prepare 20 containers:

for i in $(seq 1 20)
do
  docker run -d --name bb$i busybox sleep 3d
done

Run test by firing docker exec on all containers at the same time

for i in $(seq 1 20)
do 
  /usr/bin/time -f "%E real" docker exec bb$i true & 
done

Results

libseccomp 2.3.3

0:01.05 real
0:01.12 real
0:01.16 real
0:01.20 real
0:01.23 real
0:01.27 real
0:01.31 real
0:01.35 real
0:01.37 real
0:01.38 real
0:01.40 real
0:01.41 real
0:01.40 real
0:01.40 real
0:01.45 real
0:01.46 real
0:01.47 real
0:01.48 real
0:01.48 real
0:01.49 real

libseccomp 2.4.1

0:00.98 real
0:01.63 real
0:01.67 real
0:01.95 real
0:02.55 real
0:02.70 real
0:02.70 real
0:02.96 real
0:03.04 real
0:03.16 real
0:03.17 real
0:03.21 real
0:03.23 real
0:03.27 real
0:03.24 real
0:03.29 real
0:03.27 real
0:03.29 real
0:03.28 real
0:03.27 real

Your PR build

0:00.95 real
0:01.12 real
0:01.20 real
0:01.23 real
0:01.28 real
0:01.29 real
0:01.31 real
0:01.37 real
0:01.38 real
0:01.40 real
0:01.43 real
0:01.43 real
0:01.44 real
0:01.45 real
0:01.42 real
0:01.47 real
0:01.48 real
0:01.48 real
0:01.48 real
0:01.50 real

Misc notes

  • I set 2.4.1 in AC_INIT in configure.ac before I built this PR.
  • This custom build is installed in /usr/local/lib, I verified it by running ldd /usr/bin/runc to make sure the custom build is being used during the test.
  • I ran the test for a few times, there are very small variances in the results. So I'm confident they are not accidental results.

@pcmoore
Copy link
Member

pcmoore commented Nov 4, 2019

That's great, thanks for the help @xinfengliu!

@varqox
Copy link
Author

varqox commented Nov 9, 2019

Hi @pcmoore,
Thanks for this PR.
In my case, it restores libseccomp performance to a level comparable with v2.3.3.

Results

foo.c

g++ foo.c -lseccomp -o foo -O3
for ((i=0; i<10; ++i)); do time ./foo; done 

libseccomp 2.3.3

./foo  0.01s user 0.00s system 98% cpu 0.018 total
./foo  0.02s user 0.00s system 98% cpu 0.020 total
./foo  0.02s user 0.00s system 98% cpu 0.019 total
./foo  0.02s user 0.00s system 98% cpu 0.018 total
./foo  0.02s user 0.00s system 98% cpu 0.019 total
./foo  0.02s user 0.00s system 98% cpu 0.019 total
./foo  0.02s user 0.00s system 98% cpu 0.019 total
./foo  0.02s user 0.00s system 98% cpu 0.019 total
./foo  0.02s user 0.00s system 98% cpu 0.018 total
./foo  0.02s user 0.00s system 98% cpu 0.019 total

Average: 0.0188 s

libseccomp 2.4.2

./foo  0.19s user 0.00s system 99% cpu 0.195 total
./foo  0.19s user 0.00s system 99% cpu 0.194 total
./foo  0.19s user 0.00s system 99% cpu 0.193 total
./foo  0.19s user 0.00s system 99% cpu 0.196 total
./foo  0.19s user 0.00s system 99% cpu 0.195 total
./foo  0.20s user 0.00s system 99% cpu 0.196 total
./foo  0.19s user 0.00s system 99% cpu 0.194 total
./foo  0.20s user 0.00s system 99% cpu 0.197 total
./foo  0.19s user 0.00s system 99% cpu 0.195 total
./foo  0.19s user 0.00s system 99% cpu 0.194 total

Average: 0.1949 s

PR #180

./foo  0.01s user 0.01s system 98% cpu 0.012 total
./foo  0.01s user 0.00s system 97% cpu 0.013 total
./foo  0.01s user 0.00s system 96% cpu 0.013 total
./foo  0.01s user 0.01s system 97% cpu 0.014 total
./foo  0.01s user 0.00s system 97% cpu 0.012 total
./foo  0.01s user 0.00s system 98% cpu 0.013 total
./foo  0.01s user 0.00s system 98% cpu 0.012 total
./foo  0.01s user 0.00s system 98% cpu 0.013 total
./foo  0.01s user 0.00s system 97% cpu 0.013 total
./foo  0.01s user 0.00s system 97% cpu 0.011 total

Average: 0.0126 s

It seems like this PR gives some speedup over v2.3.3 in this synthetic test.

Building and loading (from seccomp_init() to seccomp_load()) filters in my sandbox plus some sandbox initialization overhead

libseccomp 2.3.3

Measured: 0.0052 s
Measured: 0.0040 s
Measured: 0.0046 s
Measured: 0.0042 s
Measured: 0.0038 s
Measured: 0.0038 s
Measured: 0.0039 s
Measured: 0.0036 s
Measured: 0.0042 s
Measured: 0.0044 s
Measured: 0.0036 s
Measured: 0.0037 s
Measured: 0.0044 s
Measured: 0.0035 s
Measured: 0.0035 s
Measured: 0.0035 s
Measured: 0.0040 s
Measured: 0.0037 s
Measured: 0.0043 s
Measured: 0.0042 s
Measured: 0.0035 s
Measured: 0.0034 s
Measured: 0.0038 s
Measured: 0.0035 s
Measured: 0.0035 s
Measured: 0.0037 s
Measured: 0.0038 s

Average: 0.0039 s

libseccomp 2.4.2

Measured: 0.0496 s
Measured: 0.0480 s
Measured: 0.0474 s
Measured: 0.0475 s
Measured: 0.0479 s
Measured: 0.0479 s
Measured: 0.0492 s
Measured: 0.0485 s
Measured: 0.0491 s
Measured: 0.0490 s
Measured: 0.0484 s
Measured: 0.0483 s
Measured: 0.0480 s
Measured: 0.0482 s
Measured: 0.0474 s
Measured: 0.0483 s
Measured: 0.0507 s
Measured: 0.0472 s
Measured: 0.0482 s
Measured: 0.0471 s
Measured: 0.0498 s
Measured: 0.0489 s
Measured: 0.0474 s
Measured: 0.0494 s
Measured: 0.0483 s
Measured: 0.0498 s
Measured: 0.0492 s

Average: 0.0466 s

PR #180

Measured: 0.0058 s
Measured: 0.0059 s
Measured: 0.0054 s
Measured: 0.0046 s
Measured: 0.0059 s
Measured: 0.0048 s
Measured: 0.0045 s
Measured: 0.0051 s
Measured: 0.0052 s
Measured: 0.0053 s
Measured: 0.0048 s
Measured: 0.0048 s
Measured: 0.0045 s
Measured: 0.0044 s
Measured: 0.0044 s
Measured: 0.0059 s
Measured: 0.0044 s
Measured: 0.0046 s
Measured: 0.0046 s
Measured: 0.0044 s
Measured: 0.0044 s
Measured: 0.0062 s
Measured: 0.0047 s
Measured: 0.0044 s
Measured: 0.0044 s
Measured: 0.0044 s
Measured: 0.0044 s

Average: 0.0049 s

Although in synthetic test PR give better times than v2.3.3, in real world it is slightly slower (possibly because of more complicated rules and running seccomp_merge() to merge two big filters). However, it still gives roughly tenfold speedup over v2.4.2.

@pcmoore
Copy link
Member

pcmoore commented Nov 11, 2019

Thanks for verifying the performance @varqox! As soon as @drakenclimber responds to the last round of comments (and I fix any remaining issues he may bring up), we'll get this merged.

@pcmoore
Copy link
Member

pcmoore commented Nov 11, 2019

Ah, nevermind, I just noticed that @drakenclimber did mark the PR as approved. I'm going to go ahead and merge that now.

@pcmoore
Copy link
Member

pcmoore commented Nov 11, 2019

I just merged PR #180 so I think we can mark this as closed, if anyone notices any remaining performance problems feel free to comment and/or re-open. Thanks everyone for your patience and help!

@pcmoore pcmoore closed this as completed Nov 11, 2019
@yulianedyalkova
Copy link

@pcmoore do you plan on releasing soon with those changes?

@pcmoore
Copy link
Member

pcmoore commented Nov 15, 2019

This is currently part of the libseccomp v2.5 release milestone, you can track our progress towards the v2.5 release using the link below:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants