Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DECISION/VOTE]: Clarify acceptability of a BPF probe-centric approach for custom driver management #1527

Open
incertum opened this issue Nov 30, 2023 · 15 comments
Labels
kind/feature New feature or request lifecycle/stale
Milestone

Comments

@incertum
Copy link
Contributor

@erthalion comment:
#1376 (comment)

The way I see it, we still need to agree on one important point before moving
forward: I think even a "less workaround" solution will imply having more
responsibility on the bpf probe (e.g. about negotiation which progs to attach
where), because otherwise chances are low that the implementation will be
flexible enough. My takeaway from the discussion is that folks are strongly
against that, and to not waste anyone's time we need to clarify if such
approach (well thought and thoroughly developed) is not going to be frowned
upon.

@Stringy comment:
#1376 (comment)

agree with @erthalion that a proper solution will most likely involve shifting some ownership of a custom driver's internals back to the "user" (provided the schema is followed).

@falcosecurity/libs-maintainers, please share your stance on this proposal. +1 indicates your support for enhancing custom driver management for libs adopters. The possible use case for Falco is not yet relevant.

Remember that libs serves as the foundation not only for Falco but also for other tools built upon it. Libs only adopters are crucial to the success of The Falco Project. Supporting and assisting libs only adopters in overcoming challenges that ultimately align with Falco's best interests benefits the entire community.

@incertum incertum added the kind/feature New feature or request label Nov 30, 2023
@incertum
Copy link
Contributor Author

+1 I'm in favor, as I wasn't particularly convinced by hiding driver management away during the last major SC refactor. Flexibility will allow for faster innovation. While I envision a highly "intelligent" driver in the future, I also believe that end users should have some degree of control. For example, the base_syscalls option is functioning exceptionally well and enables adopters to tailor Falco to their specific use cases and environments.

@incertum
Copy link
Contributor Author

@Molter73 since you are out I assume +1 for you as well since you opened the proposal in the first place.

@gnosek
Copy link
Contributor

gnosek commented Dec 1, 2023

Hmm but what is it that we're actually discussing here?

  • Switching from sys_enter/sys_exit to per-syscall tracepoints?
  • Adding some infrastructure so that the above can be done in a fork without maintaining a patch to the loader?
  • Adding the same infrastructure to piggyback on the existing bpf probe to load unrelated stuff?

For (3), does the unrelated stuff involve scap events in any way (existing schema or extensions) or does it simply use the libs as a convenient loader?

... or something else entirely?

@erthalion
Copy link
Contributor

Hmm but what is it that we're actually discussing here?

  • Switching from sys_enter/sys_exit to per-syscall tracepoints?
  • Adding some infrastructure so that the above can be done in a fork without maintaining a patch to the loader?
  • Adding the same infrastructure to piggyback on the existing bpf probe to load unrelated stuff?

It's close the second point, except the discussion is not about adding anything
particular, but rather defining an overall mindset for the future development.
A BPF probe-centric approach was possible until recently, when it was tightened
up to allow userspace control more precisely what the probe is doing.

In the PR referenced here, we're planning to make probe management more
flexible, so it's important to clarify if the proposed change will not be
rejected in favour of having more tight control. As soon as this "mindset"
question is resolved, everything else is on us.

@gnosek
Copy link
Contributor

gnosek commented Dec 6, 2023

My personal opinion is that if the loader changes are needed only for a forked version of libs, they should live in the fork if possible. Please note that this does not mean I'm against eventually upstreaming your changes, but I'm simply not clear on what we accomplish by merging just the loader bits. You still have a whole fork to maintain, right? Are the loader bits more problematic for whatever reason? Do you want to use upstream falco (libs) with a custom probe?

@erthalion
Copy link
Contributor

erthalion commented Dec 6, 2023

You still have a whole fork to maintain, right?

Nope, we were actually half way through dropping the fork in favour of the
upstream, when we found out it's not possible due to changes under our feet.

Are the loader bits more problematic for whatever reason? Do you want to use upstream falco (libs) with a custom probe?

Yes, the loaded (for both bpf & modern_probe) is very opinionated about how the
probe should work. At the same time indeed it's the only piece of puzzle that
prevents anyone, including us, from using a custom probe.

@incertum
Copy link
Contributor Author

incertum commented Dec 7, 2023

My personal opinion is that if the loader changes are needed only for a forked version of libs, they should live in the fork if possible. Please note that this does not mean I'm against eventually upstreaming your changes, but I'm simply not clear on what we accomplish by merging just the loader bits.

@gnosek I actually understand this completely. Another way to look at it is that The Falco Project officially supports and benefits from adopters who build tools using libs. From my perspective, this makes such requests first-class citizens as well. And there will be a commitment to iterate and polish everything. With this in mind, does the justification seem improved from your perspective?

@gnosek
Copy link
Contributor

gnosek commented Dec 7, 2023

Nope, we were actually half way through dropping the fork in favour of the
upstream, when we found out it's not possible due to changes under our feet.

Hmm not sure I understand how that's possible. You do need a custom probe (that's the entire issue) so you do have to maintain it somewhere, right? I'm genuinely trying to understand.

Yes, the loaded (for both bpf & modern_probe) is very opinionated about how the
probe should work. At the same time indeed it's the only piece of puzzle that
prevents anyone, including us, from using a custom probe.

I feel that it has to be opinionated, or at the very least, the opinion has to live somewhere (we don't want to yolo attach all tracepoints we find). With a custom probe, the default loader has no idea what to do with the extra sections and I have a feeling they'd always be a second class citizen (the default loader wouldn't know when to attach a tracepoint).

The Falco Project officially supports and benefits from adopters who build tools using libs. From my perspective, this makes such requests first-class citizens as well. And there will be a commitment to iterate and polish everything. With this in mind, does the justification seem improved from your perspective?

Yup, this is a very good point, though I still feel that these custom probes should either get upstreamed eventually or just keep on living in their fork if they're not headed upstream. Still, I can imagine a better loader would reduce the initial friction of playing with a new probe so it might be worth a try.

Looking through #1375, it seems like the vast majority of the patch is making the BPF program array dynamically-sized. Given the extent of changes required, I think I'd rather introduce a separate bpf_attached_prog_list for custom programs (btw, I have opinions about the name :P).

For v1, we can unconditionally load all custom programs in load_single_prog and append their info to the new array (like in your patch), for v2 I'd rather see unused programs unloaded (or ideally not even loaded in the first place, avoiding potential verifier issues) but that's a general issue, not with your patch.

Again for v1, enforce_sc_set can unconditionally attach/detach the programs based on handle->capturing, but I'd much rather see a callback there (defaulting to the handle->capturing logic so e.g. upstream falco doesn't need to care). In an ideal world, we'd have a way to control the custom tracepoints with some sort of callback (e.g. via ->configure) but I understand nobody needs this right now so let's ignore it ;)

In scap_bpf_close, we detach and clean up all the custom programs from the extra table.

(I had other, way more complex ideas for this but I believe this is a decent balance between giving custom probes the flexibility and minimizing impact on the built in tracepoints).

@erthalion
Copy link
Contributor

Nope, we were actually half way through dropping the fork in favour of the
upstream, when we found out it's not possible due to changes under our feet.

Hmm not sure I understand how that's possible. You do need a custom probe (that's the entire issue) so you do have to maintain it somewhere, right? I'm genuinely trying to understand.

No worries, it's fine :) In our case the custom bpf probe lives outside the
Falco tree, and we combine everything together in a drop-in manner during the
build time. @Stringy can give more details about the process.

@incertum
Copy link
Contributor Author

incertum commented Dec 7, 2023

Quick note: @gnosek I like the progression you described here #1527 (comment) very much and appreciate the technical details as well 😉 I also think that down the road (assuming the custom probe you maintain @erthalion is successful) it would be nice to consider upstreaming the probe.

So is #1527 (comment) an ok, we have a path forward?

@leogr
Copy link
Member

leogr commented Dec 13, 2023

I really appreciated the ongoing discussion and recognized some potential benefits, though I don't have strong opinions yet.

I think clarity on the proposal's long-term implications, especially regarding maintenance costs and overall project direction, has been overlooked a bit, IMO. After having gone through all the comments, I realized I have more questions than answers 😅 I've tried to collect them and report them below to help reach clarity.

Scope:

  • Is this proposal tailored exclusively for libs adopters, or does it may have broader implications for the Falco project?
    • (I know "The possible use case for Falco is not yet relevant.", but I'm more interested in understanding if we want to explicitly limit this proposal to libs only in order to set level expectations correctly)
  • How will this affect prioritization in the Falco project?
  • Are there alternatives under consideration?

Maintenance Costs:

  • Considering the libs project's high maintenance demands, what is the expected maintenance cost of implementing this proposal?
  • Most importantly, is anyone willing to commit to maintaining this?
  • Also, what is the expected impact of this proposal on existing components (ie the verifier with the legacy probe is already a pain) and ongoing efforts like code cleanups, refactoring, API revisions, etc.?
  • If the proposal's long-term goals include upstreaming alternative probes, how do we plan to maintain them?

Drivers function parity:

  • Will this proposal aim to discontinue the current function parity between our drivers?
  • Eventually, how do we plan to manage potential disparities between different drivers?

Event schema compatibility:

  • Is maintaining compatibility with the current event schema a primary goal of this proposal? (Will it be entirely opaque for Falco?)
  • Or do we expect this to drive significant event schema changes in the future? (if so, how will Falco users be impacted?)

Other ideas/proposals (eg LSM):

  • How does this proposal relate to ongoing discussions and ideas around using LSM hooks / kprobes?
  • Can we envision a common route or strategy to address these types of technical challenges in a cohesive manner?

I really believe focusing on defining the proposal in detail before proceeding to a vote is crucial. In particular, I think goals, non-goals, and full extent must be specified before deciding. 🙏

@erthalion
Copy link
Contributor

Thanks for the discussion, folks. I'll try to follow up step-by-step.


So is #1527 (comment) an ok, we have a path forward?

@gnosek @incertum Just to make sure I got it right:

For v1, we can unconditionally load all custom programs in load_single_prog and append their info to the new array (like in your patch), for v2 I'd rather see unused programs unloaded (or ideally not even loaded in the first place, avoiding potential verifier issues) but that's a general issue, not with your patch.

Again for v1, enforce_sc_set can unconditionally attach/detach the programs based on handle->capturing, but I'd much rather see a callback there (defaulting to the handle->capturing logic so e.g. upstream falco doesn't need to care). In an ideal world, we'd have a way to control the custom tracepoints with some sort of callback (e.g. via ->configure) but I understand nobody needs this right now so let's ignore it ;)

Does it mean the high level plan here is essentially what we are proposing,
namely make the loader flexible enough to load custom programs from the probe
(based on certain conditions)?


I think clarity on the proposal's long-term implications, especially regarding maintenance costs and overall project direction, has been overlooked a bit, IMO. After having gone through all the comments, I realized I have more questions than answers 😅 I've tried to collect them and report them below to help reach clarity.

@leogr I think it's important to distinguish those parts of the proposal that
are about "decision/vote" on the high level approach (whether to give the bpf
probe more responsibility), and implementation consequences. Having this in
mind:

Scope:

Is this proposal tailored exclusively for libs adopters, or does it may have broader implications for the Falco project?
(I know "The possible use case for Falco is not yet relevant.", but I'm more interested in understanding if we want to explicitly limit this proposal to libs only in order to set level expectations correctly)

I don't see any implications for the whole Falco project.

How will this affect prioritization in the Falco project?

Don't think there would be any special impact, aside of regular PR review
process.

Are there alternatives under consideration?

As far as I see, the only alternative so far is do not have this feature at
all.

Maintenance Costs:

Considering the libs project's high maintenance demands, what is the expected maintenance cost of implementing this proposal?

This one depends on the implementation. The way we see it, the only maintenance
overhead would be to test the fact that new way of loading programs from the
probe works.

Most importantly, is anyone willing to commit to maintaining this?

Depending on the implementation the scope could be different, but if everything
goes fine there will be me, @Molter73 and @Stringy.

Also, what is the expected impact of this proposal on existing components (ie the verifier with the legacy probe is already a pain) and ongoing efforts like code cleanups, refactoring, API revisions, etc.?

Again, heavily implementation dependent, but we don't see any significant
impact on other parts of the project.

If the proposal's long-term goals include upstreaming alternative probes, how do we plan to maintain them?

It's a very long-term vision, probably there could be different "flavours" of
the probe. The maintenance in this case would be more runs of the testing
pipeline, one for each flavour.

Drivers function parity:

Will this proposal aim to discontinue the current function parity between our drivers?
Eventually, how do we plan to manage potential disparities between different drivers?

What kind of disparities do you have in mind here?

Event schema compatibility:

Is maintaining compatibility with the current event schema a primary goal of this proposal? (Will it be entirely opaque for Falco?)

Yep.

Or do we expect this to drive significant event schema changes in the future? (if so, how will Falco users be impacted?)

No, unless it fits into the future community development roadmap.

Other ideas/proposals (eg LSM):

How does this proposal relate to ongoing discussions and ideas around using LSM hooks / kprobes?

This proposal is orthogonal to those discussions, but could enhance any
relevant implementation.

Can we envision a common route or strategy to address these types of technical challenges in a cohesive manner?

Well, that's the whole topic of the proposal -- the best strategy we see is to
put more responsibility on the bpf probe itself, introducing a flexible
interface to negotiate expectations what to load and, potentially in the
future, which data is going to be produced.

@leogr
Copy link
Member

leogr commented Dec 21, 2023

Thank you @erthalion ! A few more comments point-by-point 👇

Scope:
Is this proposal tailored exclusively for libs adopters, or does it may have broader implications for the Falco project?
(I know "The possible use case for Falco is not yet relevant.", but I'm more interested in understanding if we want to explicitly limit this proposal to libs only in order to set level expectations correctly)

I don't see any implications for the whole Falco project.

I'm unsure about this, and I would like the proposal to be clear in this regard. My goal just is to ensure everyone is on the same page.

For instance, the implementation of this proposal might require some modification to the Driver API. The driver API is versioned, and the versioning allows to check the compatibility between a given version of Falco and a given version of a driver. Thanks to that, a given release of Falco is compatible with more driver releases.
This is just an example (probably not the better one). Also, impacting the whole project in a minor way wouldn't likely be an issue, just the proposal should clarify that so we are aligned.

How will this affect prioritization in the Falco project?

Don't think there would be any special impact, aside of regular PR review process.

I actually meant if this would be an item for the Falco project roadmap or not. If not, this should be kind of best effort, which is ok, but will have very low priority compared to Falco items (for example, the two months before the release we are usually at full capacity on Falco items).

Maintenance Costs:
Considering the libs project's high maintenance demands, what is the expected maintenance cost of implementing this proposal?

This one depends on the implementation. The way we see it, the only maintenance overhead would be to test the fact that new way of loading programs from the probe works.

It would be desirable to define a precise strategy and understand if this may affect the actual testing process since driver testing has been a huge effort historically (especially because of the verifier and testing across different versions of kernels, clang, etc..). Likely, the proposal need to define the implementation strategy, otherwise it would be hard to understand implications.

Most importantly, is anyone willing to commit to maintaining this?

Depending on the implementation the scope could be different, but if everything goes fine there will be me, @Molter73 and @Stringy.

Good to know! 👍

Drivers function parity:
Will this proposal aim to discontinue the current function parity between our drivers?
Eventually, how do we plan to manage potential disparities between different drivers?

What kind of disparities do you have in mind here?

No one atm. Just wondering. Generally speaking, if a user-facing feature can be only implemented and not in the kmod, this will create a disparity.

Event schema compatibility:
Is maintaining compatibility with the current event schema a primary goal of this proposal? (Will it be entirely opaque for Falco?)

Yep.

Ok, this is crystal clear now 👍

PS
I will need to find a bit of time to share some thoughts about possible LSM ideas. The goal is just to understand if we can join efforts or not.

@incertum
Copy link
Contributor Author

driver testing has been a huge effort historically

This proposal for now is about scap changes only to load a custom probe. Therefore, I don't see this impacting our drivers testing.

parity between our drivers

Since this is for libs only and not Falco for now, I also don't see a problem here. In fact we already have driver disparities in libs today. For instance we support some archs only for some drivers. Also I don't recall any such discussions for expanding archs support, so I am still a bit puzzled why we now shift criteria.

Does it mean the high level plan here is essentially what we are proposing,
namely make the loader flexible enough to load custom programs from the probe
(based on certain conditions)?

re the question for me @erthalion: I don't see why this would pose a problem or big impact for libs. Rather the opposite, it will allow for more custom libs adoption. So yes "make the loader flexible enough to load custom programs from the probe" 👍


@leogr would propose to keep this discussion more focused on supporting "loading a custom probe" and having "an agreement that we will allow the bpf loader to be flexible to do that". I think it will make libs much more flexible and drive libs adoption which is excellent for the overall project health. In addition, three engineers who already commit to maintaining it is a privilege in open source.

Everything else I would defer to a formal proposal "overall driver and event filtering modernization" where we design the future of all things LSM hooks, kernel side filtering, event schemas and event handling etc. If no one else wants to lead starting such a proposal I can do it.

@Andreagit97 Andreagit97 added this to the TBD milestone Jan 31, 2024
@poiana
Copy link
Contributor

poiana commented Apr 30, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request lifecycle/stale
Projects
None yet
Development

No branches or pull requests

6 participants