Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bpftrace for older kernels: Fallback to no func_infos, no BTF #3149

Merged

Conversation

bernhardkaindl
Copy link
Contributor

@bernhardkaindl bernhardkaindl commented May 4, 2024

Fix #3011:

In #2913 (comment), @viktormalik said:

So the problem is that the older kernel doesn't cope well with BTF and func_infos.

He detailed the problem in #2913 (comment):

So, I was able to find the problem - older kernels do not have torvalds/linux@51c39bb and therefore do not support BTF_KIND_FUNC entries with global linkage. Libbpf sanitizes this for older kernels since torvalds/linux@2d3eb67 (same series) but it only does that when the program is loaded via bpf_object (which we don't at this point).

He implemented an initial interim fix in #2934 (as loaded via bpf_object) might be a bigger change, but #3011 shows that this is not enough. In it, @giorio94 bisected that #2804 causes bpftrace to fail on all kernels that do not support the types that LLVM now generates. And func_infos fail with 4.19 fail too.

I now found that the issue is that current code skips loading the BPF program when loading the BTF into the kernel failed:

// Don't attempt to load the program if the BTF load failed.

        // Don't attempt to load the program if the BTF load failed.
        // This will fall back to the error handling for failed program load,
        // which is more robust.

But there is no such fall back to handle this error yet, so add it. And, I found:

  • As @viktormalik said, I also found that older kernels don't cope well with func_infos, so fall back no func_infos relocation on the 2nd attempt.
  • I was able to fix loading the BTF load on for my setup, but 4.19 still rejects loading the BPF program, so even if loading the BTF succeeds, loading the BPF it does not work, fall back to not loading the BTF (as a last resort) in the 3rd attempt.

These changes fix bpftrace to work on 4.19 again.

I plan also submit my BTF fixup changes to fix BTF load on 4.19, they add converting the kinds FUNC, FUNC_PROTO, VAR, and DATASEC. But as said, these only fix BTF load but passing the btf_fd for the loaded BTF still makes the BPF load fail on 4.19.

For now, this is the smallest change and only adds fallbacks for older kernels, that have issues with BTF and func_infos so it does not affect newer kernels at all.

Of course, the best solution is to load everything by using libbpf as bpf_object, but that is a bigger change.

And, I'd also like showing the error code and error message when BPF load fails (added as well).

@danobi
Copy link
Member

danobi commented May 5, 2024

I think to think more deeply about edge cases here, but on the surface this seems reasonable in isolation.

However, I think @viktormalik is/has been working a lot on the libbpf migration. If it's on-track to land this release cycle, then it'd probably be better to wait for that (given this PR or the libbpf migration will ship at the same time). Viktor, can you comment?

@viktormalik
Copy link
Contributor

I think to think more deeply about edge cases here, but on the surface this seems reasonable in isolation.

However, I think @viktormalik is/has been working a lot on the libbpf migration. If it's on-track to land this release cycle, then it'd probably be better to wait for that (given this PR or the libbpf migration will ship at the same time). Viktor, can you comment?

Yes, I've a working implementation now and will start posting PRs. It's a lot of code so I'll need to split it into multiple parts. See #2334 for a preliminary plan of updates.

I'm not sure if it's a good idea to land this in this release cycle, though. Since it's quite a disruptive change, I'd prefer having it in unreleased state to have as many people use it in their workflows as possible. On top of that, we haven't done a release in a while and already have a nice set of unreleased features, so maybe let's do a release soon and plan this for the next one?

@bernhardkaindl
Copy link
Contributor Author

bernhardkaindl commented May 7, 2024

Thanks @viktormalik for the insight and sharing your plans for having it in unreleased state to have as many people use it in their workflows as possible and rather do a release, already having a nice set of unreleased features without your big disruptive change soon.

That would fall right in line with my use case: I'd like to first integrate a stable release of bpftrace that works across all supported kernels, including 4.19. The reason is that 4.19 is used in production, where bpftrace would be very useful to have. If the bpftrace upstream master branch works with 4.19, I can deploy it directly and can contribute to the bpftrace master branch code base with PRs as I'm able to address any findings that I'd see directly on master. If the bpftrace master branch takes the stance that it does not wish to integrate fixes that allow me and my users to use it on our production environments running 4.19, then I'm faced with having to create a bpftrace-4.19-compat fork for tracing in the actively used production environment. Of course, I'd rather prefer to "live" on master if I can use it directly!

Also, having bpftrace master work on 4.19 would be a decisive signal as it shows that there is a future outlook that working within the upstream community is viable and beneficial (I hope in both directions) to share the work. If the decision is going to delay and instead to only support 4.19 when it comes for free with using libbpf objects, it says that bpftrace upstream only tracks if it works on 4.19 or not, but does not apply fixes for it. That would send a signal to rather stay on the bpftrace-0.19.1 branch that supports 4.19 rather than on master, and any upstream work would unfortunately not be received, used or tested by my users.

Please see this just as a user of bpftrace sharing his use case, not more! I'm merely asking for a decision to be made in order to make my plans for using a branch or fork in 0.19.1 or work directly on upstream master regarding me and my users. (@ajor)

@giorio94 reacted with (heart) to this PR to fix #3011, so I guess it would be very welcome for him too.

Copy link
Contributor

@viktormalik viktormalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would generally be useful to have a support for the 4.19 stable kernel in master (and the next release). Then, when we switch to libbpf-based loading which should (theoretically) be backwards compatible, we'll have a baseline to compare with.

The PR itself LGTM, just a few comments below.

Also, we historically used the loading "attempts" for older kernels which required the kernel version (and we had 3 different ways of getting it). I'm wondering if this change could potentially conflict with that.

src/attached_probe.cpp Outdated Show resolved Hide resolved
src/attached_probe.cpp Outdated Show resolved Hide resolved
@bernhardkaindl bernhardkaindl force-pushed the fallback-to-no-func_info-no-btf branch 2 times, most recently from 4567cfd to fdb9ff6 Compare May 10, 2024 01:51
@bernhardkaindl
Copy link
Contributor Author

bernhardkaindl commented May 10, 2024

Thanks @viktormalik!

Also, we historically used the loading "attempts" for older kernels which required the kernel version (and we had 3 different ways of getting it). I'm wondering if this change could potentially conflict with that.

Thanks for the hint! I did not know, and I overlooked that! Yes, I fixed that issue with this push!

Could we break the outer (attempt) loop if btf_fd < 0 here?

Yes, I hadn't added the outer loop yet, but I added it now. Done!

I'd rather see a more general comment somewhere (maybe before the "attempt" loop header?) explaining what features are omitted in which attempt.

Yes, that makes total sense! Done now on top of the new outer loop!

src/attached_probe.cpp Outdated Show resolved Hide resolved
src/ast/elf_parser.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@viktormalik viktormalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a couple of comments below.

src/attached_probe.cpp Outdated Show resolved Hide resolved
src/attached_probe.cpp Outdated Show resolved Hide resolved
src/utils.h Outdated Show resolved Hide resolved
src/utils.cpp Outdated Show resolved Hide resolved
@bernhardkaindl
Copy link
Contributor Author

@viktormalik:

LGTM overall, just a couple of comments below.

Thanks, applied the changes.

Comment on lines 888 to 889
<< "bpf_prog_load() failed with error " << progfd_ << ": "
<< std::strerror(-progfd_) << std::endl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary now when the same is printed for each failed attempt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, correct, that is no longer necessary as it is already logged for each failed attempt!

Removed them! - I backed out these two lines from the commit!

This push removes the hunk that added these two lines, no other changes in this push.

Copy link
Contributor

@viktormalik viktormalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Let's wait for #3166 to land and rebase on top of that to make sure that the tests pass.

Also handle the case where EM_BPF may not be defined on older systems.

Co-authored-by: Viktor Malik <viktor.malik@gmail.com>
Signed-off-by: Bernhard Kaindl <bernhardkaindl7@gmail.com>
@bernhardkaindl
Copy link
Contributor Author

BTW: GitHub implicitly re-bases the PR git checkout for the CI builds onto the target branch automatically. That is the reason why all PRs broke when master broke CI.

@danobi danobi self-assigned this May 18, 2024
@viktormalik viktormalik added this to the v0.21.0 milestone May 20, 2024
@bernhardkaindl
Copy link
Contributor Author

@viktormalik, as https://github.com/bpftrace/bpftrace/releases/tag/v0.20.4 is released, and you assigned v0.21.0 as milestone, and you reviewed it, what is the next milestone to merge this PR?

@viktormalik
Copy link
Contributor

@bernhardkaindl I want to get this merged asap, to get it into the upcoming 0.21 release. Just waiting for a review from @danobi.

@viktormalik viktormalik merged commit faff4d8 into bpftrace:master May 23, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bpftrace v0.20.1, Linux 4.19 - ERROR: Error loading program
3 participants