New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datapath,endpoint: explicitly remove TC filters during endpoint teardown #32167
Conversation
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly remove TC filters during endpoint teardown
Makes sense along with the inline comments. However, consider defining a tear down flow in the code. One option could be defining states for steps that need to happen in a certain order + unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Remove the policy tail call entry for the endpoint. This will disable
// policy evaluation for the endpoint and will result in missing tail calls if
// e.g. bpf_host or bpf_overlay call into the endpoint's policy program.
Thinking about DROP_MISSED_TAIL_CALL
errors for in-flight packets, I wonder if there is a way to differentiate the race conditions during tear down from genuine error conditions. Before sending the drop notification with DROP_MISSED_TAIL_CALL
reason, what if we were to do another lookup in the ipcache map to identify the cases we can ignore (and potentially send a different reason DROP_EP_DELETED
)? (This is assuming that ipcache entry is removed prior to maps clean-up for an ep.)
(Meant to request changes while posting previous review comments.)
@aditighag See #32151, that does what you describe. Doing subsequent lookups in the lxcmap/ipcache to decide how to handle missed tail calls is all subject to TOCTOU problems, so the best thing we can do if it happens is flag it and drop the packet. |
26e218a
to
7da33fd
Compare
/test |
Personally I'd split this into three patches (link-down, remove TC filters, consolidate BPF map management), they all feel fairly separate and have their own commit description. But otherwise this seems like a great improvement 👍. And with the hard link-down, I'm also much less concerned about pinning down the "perfect" order in which we need to tear down the BPF map entries. If there are any subtle dependencies between the different maps, we'll find out eventually. |
Technically yes, but it doesn't make sense to backport them partially, for example. And if we run into conflicts cherry-picking one of the commits, it's more difficult to track what the overall intention was in the upstream patch.
Careful, this still matters to some extent. If host or overlay find an entry in the endpoint map, they'll still try to jump into the endpoint's policy program. (ignoring TOCTOU here) With #32151 this is now less of an issue, and at least after issuing an ifdown, we don't need to worry about the tc entrypoints potentially firing anymore. We should still try to tear things down gracefully to reduce the likelihood of undefined behaviour. |
Prior to this commit, we left it up to the kernel to clean up tc attachments when the CNI finally removes the veth when a Pod goes away. This leaves a window of time where an endpoint's tc programs can potentially be invoked after the endpoint's internal tail call maps have already been cleared and the endpoint has been removed from the endpoint map and ipcache, resulting in undefined behaviour. This patch clearly defines the endpoint teardown sequence as follows: - remove (endpoint) routes - set the interface down - detach tc(x) hooks - remove endpoint from endpoint map - remove endpoint policy program(s) - delete conntrack map pins - remove policy prog array map pin - remove internal tail call map pin - remove custom calls map pin This puts the agent more in control of the teardown sequence and will allow us to reason better about failures related to missed tail calls and other flakes. Signed-off-by: Timo Beckers <timo@isovalent.com>
7da33fd
to
6d80a75
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misread the doc for errors.As
.
Prior to this commit, we left it up to the kernel to clean up tc attachments when the CNI finally removes the veth when a Pod goes away. This leaves a window of time where an endpoint's tc programs can potentially be invoked after the endpoint's internal tail call maps have already been cleared and the endpoint has been removed from the endpoint map and ipcache, resulting in undefined behaviour.
This patch clearly defines the endpoint teardown sequence as follows:
This puts the agent more in control of the teardown sequence and will allow us to reason better about failures related to missed tail calls and other flakes.
cc @julianwiedmann