Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmdabpf sets a too-low limit for rlimit #1915

Open
jasonk000 opened this issue Mar 13, 2024 · 6 comments
Open

pmdabpf sets a too-low limit for rlimit #1915

jasonk000 opened this issue Mar 13, 2024 · 6 comments

Comments

@jasonk000
Copy link
Contributor

jasonk000 commented Mar 13, 2024

Making the same mistake as 640KB ought to be enough for anybody, I picked a 100MB rlimit setting during bpf.c setup. Now, when another (non-PCP) bpf application is loaded, we have a conflict.

Proposal:

  • If the existing memlock rlimit value is greater than the 100MB target, do not try to change it.
  • Alternatively, we could simply ignore rlimit and document the failure / push it to user responsibility (eg: systemd unit).

Reproduce:

  • Start other bpf applications that use > 100MB of space
  • Set ulimit -l unlimited for the root user
  • Start dbpmda and open the pmda
    • The PMDA sets rlimit 100MB
  • dbpmda fails to start

With another program running, consuming ~350MB of maps and progs

(root) /home/user # ulimit -l unlimited

(root) /home/user # dbpmda
dbpmda> open dso /usr/lib/pcp/pmdas/bpf/pmda_bpf.so bpf_init 157
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading caches
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading configuration: /var/lib/pcp/pmdas/bpf/bpf.conf
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loaded configuration: /var/lib/pcp/pmdas/bpf/bpf.conf
[Wed Mar 13 18:34:52] dbpmda(4849) Info: setrlimit RMLIMIT_MEMLOCK ok
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading modules
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading modules
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading runqlat.so
[Wed Mar 13 18:34:52] dbpmda(4849) Info: loading: runqlat.so from /var/lib/pcp/pmdas/bpf/modules/runqlat.so
[Wed Mar 13 18:34:52] dbpmda(4849) Info: initialising runqlat.so
[Wed Mar 13 18:34:52] dbpmda(4849) Info: booting: runqlat_bpf
[Wed Mar 13 18:34:52] dbpmda(4849) Warning: libbpf: Error in bpf_object__probe_loading():Operation not permitted(1). Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough value.
[Wed Mar 13 18:34:52] dbpmda(4849) Warning: libbpf: failed to load object 'runqlat_bpf'
[Wed Mar 13 18:34:52] dbpmda(4849) Warning: libbpf: failed to load BPF skeleton 'runqlat_bpf': -1
[Wed Mar 13 18:34:52] dbpmda(4849) Error: bpf load failed: -1, Operation not permitted
...

In this case, bpftrace also shows that bpf charge is failing:

# sudo bpftrace -e 'kretprobe:__bpf_prog_charge /comm == "dbpmda"/ { printf("%d\n", retval); }'
Attaching 1 probe...
-1
-1
-1
-1

because 85322 pages (350MB) are already counted against the user, and the current limit is 25600 pages (100MB).

# sudo bpftrace -e 'kprobe:__bpf_prog_charge /comm == "dbpmda"/ { printf("user_struct->locked_vm.counter=%d pagesu_normal_store
=%d curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=%d\n", ((struct user_struct *)arg0)->locked_vm.counter, arg1, (curtask->signal->rliu_panic
m[8].rlim_cur >> 12) ); }'
Attaching 1 probe...
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
user_struct->locked_vm.counter=85322 pages=1 curtask->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> 12=25600
@jasonk000
Copy link
Contributor Author

For what it's worth, in 5.11+ kernels this is changed to cgroup accounting, so the effect is only apparent on older kernels which also have the pmdabpf built (ie: this might be something only we are seeing on legacy OS distributions).

@natoscott
Copy link
Member

@jasonk000 What was the original goal of setting the MEMLOCK limit in pmdabpf? (making sure we can lock down space we need but nothing extremely large, I guess?)

Another option might be to look at physical memory size (sysconf(3) _SC_PHYS_PAGES or _SC_AVPHYS_PAGES) and then calculate a value to ensure we set a lock limit that's a fraction of what is available as physical memory on the machine (e.g. no more than a 10th - some heuristic, anyway - perhaps clamped to a minimum value of 100MB?)

Another option could be to say we trust ourselves, and anyone configuring pmdabpf modules must already have root access, so remove all limits for the pmdabpf process (RLIM_INFINITY?)?

@jasonk000
Copy link
Contributor Author

jasonk000 commented Mar 14, 2024

Background context -- In the original stages, before the memory using resources were accounted against the cgroup, they were tracked and limited by using memlock. This is relevant for maps, since you can specify arbitrary sized maps which can be very big (we have some maps that are defined as over 100MB). As a common approach, many libbpf users take a setrlimit approach because it was a more clear error scenario than to fail with EPERM (since bpf normally runs as root, a setrlimit to a high enough value is nearly always going to work). With newer kernels (5.11+) this is no longer necessary since the limiting is applied at cgroup rather than at rlimit. I don't think there's a lot of value picking some arbitrary value because it will always be either too-high, or it will be too-low, unless the operator knows the app they are using.

Personally I think of a couple of approaches, infinity works. What I am currently doing on private build is using systemd to set LimitMEMLOCK=infinity on pmcd.service and then commented out the rlimit code and relying on the operator and systemd to enforce controls.. I think this is the "right" answer for systemd use cases, but I'm not familiar with other service management platforms.

@natoscott
Copy link
Member

OK, should we just change bpf_setrlimit to unilaterally set RMLIMIT_MEMLOCK rlim_cur/max to infinity then? That's simpler than the current code in there and means no systemd changes are needed for everyone else...

$ git diff .
src/pmdas/bpf/bpf.c
--- /tmp/git-blob-eMifIb/bpf.c  2024-03-14 15:01:22.590938902 +1100
+++ src/pmdas/bpf/bpf.c 2024-03-14 15:01:14.394923471 +1100
@@ -127,33 +127,19 @@ int bpf_printfn(enum libbpf_print_level
     return 0;
 }
 
-#define WANT_MEM       (100LL*1024*1024)
 /**
- * setrlimit required for BPF loading ... try for WANT_MEM, but will
- * accept whatever we can get
+ * setrlimit required for BPF loading ... try for infinite memlocking
+ * as we have no idea how much we will really need (it depends on all
+ * loaded eBPF programs).
  */
 void bpf_setrlimit()
 {
-    struct rlimit      rnew;
-    int                        ret;
-    ret = getrlimit(RLIMIT_MEMLOCK, &rnew);
-    if (ret < 0) {
-        pmNotifyErr(LOG_ERR, "bpf_setrlimit: getrlimit RMLIMIT_MEMLOCK failed: %s", pmErrStr(-errno));
-       return;
-    }
-    if (rnew.rlim_max == RLIM_INFINITY)
-       rnew.rlim_cur = rnew.rlim_max = WANT_MEM;
-    else if (rnew.rlim_max > WANT_MEM)
-       rnew.rlim_cur = rnew.rlim_max = WANT_MEM;
-    else {
-       rnew.rlim_cur = rnew.rlim_max;
-        pmNotifyErr(LOG_INFO, "bpf_setrlimit: setrlimit RMLIMIT_MEMLOCK %lld not %lld", (long long)rnew.rlim_cur, WANT_MEM);
-    }
-    ret = setrlimit(RLIMIT_MEMLOCK, &rnew);
-    if (ret == 0)
+    struct rlimit      rnew = { RLIM_INFINITY, RLIM_INFINITY };
+
+    if (setrlimit(RLIMIT_MEMLOCK, &rnew) == 0)
         pmNotifyErr(LOG_INFO, "setrlimit RMLIMIT_MEMLOCK ok");
     else
-        pmNotifyErr(LOG_ERR, "setrlimit RMLIMIT_MEMLOCK (%lld,%lld) failed: %s", (long long)rnew.rlim_cur, (long long)rnew.rlim_max, pmErrStr(-errno));
+        pmNotifyErr(LOG_ERR, "setrlimit RMLIMIT_MEMLOCK (infinity) failed: %s", pmErrStr(-errno));
 }
 
 /**

@jasonk000
Copy link
Contributor Author

I think that works too, so long as we do not bail on error from setrlimit(). I do wonder how many supported environments are building pmdabpf for <= linux 5.11. It might be better to make this dead code and remove it, (which is why I leaned to removing code + systemd unit, less to maintain..).

@natoscott
Copy link
Member

OK, removing it entirely works for me too - especially if we can make the systemd unit conditional on kernel version, or even put the responsibility of managing that onto the user entirely (maybe we could just document it in the man page?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants