Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.11 backports 2023-06-08 (ipsec fixes) #26021

Merged
merged 24 commits into from
Jun 13, 2023

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Jun 8, 2023

Once this PR is merged, you can update the PR labels via:

$ for pr in 25724 25744 25784 25735 25936 25953 26072 26093; do contrib/backporting/set-labels.py $pr done 1.11; done

@joamaki joamaki added kind/backports This PR provides functionality previously merged into master. area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. release-blocker/1.11 This issue will prevent the release of the next version of Cilium. backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. labels Jun 8, 2023
@joamaki

This comment was marked as outdated.

@joamaki joamaki requested a review from pchaigno June 8, 2023 10:06
@joamaki
Copy link
Contributor Author

joamaki commented Jun 8, 2023

This PR replaced #25895. Only change here is the addition of the last two commits.

@joamaki
Copy link
Contributor Author

joamaki commented Jun 8, 2023

Waiting on #25993 to be merged to fix test runs.

@joamaki joamaki force-pushed the pr/v1.11-backport-2023-06-08-ipsec branch from a0a33d1 to a5ab01c Compare June 8, 2023 14:50
@michi-covalent michi-covalent mentioned this pull request Jun 8, 2023
44 tasks
@pchaigno
Copy link
Member

pchaigno commented Jun 9, 2023

I think we should merge this without waiting for #25993.

@joamaki joamaki marked this pull request as ready for review June 9, 2023 11:26
@joamaki joamaki requested a review from a team as a code owner June 9, 2023 11:26
@gandro
Copy link
Member

gandro commented Jun 12, 2023

I have pushed two more PRs to this branch, since those PRs also need to be backported as part of the IPSec fixes to v1.11 and depend on other commits. I will update the PR description accordingly.

@gandro gandro requested a review from pchaigno June 12, 2023 13:20
@gandro gandro mentioned this pull request Jun 12, 2023
1 task
@gandro

This comment was marked as outdated.

@qmonnet
Copy link
Member

qmonnet commented Jun 13, 2023

@qmonnet

This comment was marked as outdated.

pchaigno and others added 7 commits June 13, 2023 14:14
[ upstream commit 600c7d4 ]

The XFRM IN policies and states didn't change so we should never need
to remove any stale XFRM IN configs. Let's thus simplify the logic to
find stale policies and states accordingly.

I would expect this incorrect removal to cause a few drops on agent
restart, but after multiple attempts to reproduce on small (3 nodes)
and larger (20) clusters (EKS & GKE) with a drop-sensitive application
(migrate-svc), I'm not able to see such drops. I'm guessing this is
because we reinstall the XFRM IN configs right after we removed them so
there isn't really much time for a packet to be received and dropped.

Fixes: 688dc9a ("ipsec: Remove stale XFRM states and policies")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit bbc50a3 ]

This small bit of refactoring will make later changes a bit easier.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 2ada282 ]

No functional changes. Best viewed with git show -b or the equivalent on
GitHub to not show space-only changes.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 64f4c23 ]

No functional changes. Best viewed with git show -b or the equivalent on
GitHub to not show space-only changes.

Same as the previous commit but on a different set of conditions.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 66c45ac ]

This is a partial revert of commit 3e59b68 ("ipsec: Per-node XFRM
states & policies for EKS & AKS").

One change that commit 3e59b68 ("ipsec: Per-node XFRM states &
policies for EKS & AKS") make on EKS and AKS was to switch from using
NodeInternalIPs to using CiliumInternalIPs for outer IPsec (ESN) IP
addresses. That made the logic more consistent with the logic we use for
other IPAM schemes (e.g., GKE).

It however causes serious connectivity issues on upgrades and
downgrades. This is mostly because typically not all nodes are updated
to the new Cilium version at the same time. If we consider two pods on
nodes A and B trying to communicate, then node A may be using the old
NodeInternalIPs while node B is already on the new CiliumInternalIPs.
When node B sends traffic to node A, node A doesn't have the XFRM state
IN necessary to decrypt it. The same happens in the other direction.

This commit reintroduces the NodeInternalIPs for EKS and AKS. Subsequent
commits will introduce additional changes to enable a proper migration
path from NodeInternalIPs to CiliumInternalIPs.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 6b3b50d ]

On EKS and AKS, we currently use NodeInternalIPs for the IPsec tunnels.
A subsequent commit will allow us to change that to switch to using
CiliumInternalIPs (as done on GKE).

For that to be possible without breaking inter-node connectivity for the
whole duration of the switch, we need an intermediate mode where both
CiliumInternalIPs and NodeInternalIPs are accepted on ingress.

The idea is that we will then have a two-steps migration from
NodeInternalIP to CiliumInternalIP:

  1. All nodes are using NodeInternalIP.
  2. Upgrade to the version of Cilium that supports both NodeInternalIP
     and CiliumInternalIP and encapsulates IPsec traffic with
     NodeInternalIP.
  3. Via an agent flag, tell Cilium to switch to encapsulating IPsec
     traffic with CiliumInternalIP.
  4. All nodes are using CiliumInternalIP.

This commit implements the logic for step 2 above. To that end, we will
duplicate the XFRM IN states such that we have both:

    src 0.0.0.0 dst [NodeInternalIP]   # existing
    src 0.0.0.0 dst [CiliumInternalIP] # new

thus matching and being able to receive IPsec packets with an outer
destination IP of either NodeInternalIP or CiliumInternalIP.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 963e45b ]

On EKS and AKS, IPsec used NodeInternalIPs for the encapsulation. This
commit introduces a new flag to allow switching from NodeInternalIPs to
CiliumInternalIPs; it defaults to the former.

This new flag allows for step 3 of the migration plan defined in the
previous commit.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
pchaigno and others added 16 commits June 13, 2023 14:14
[ upstream commit e880002 ]

reinitializeIPSec only runs the interface detection if EncryptInterface
is empty. Since it sets it after detecting interfaces, it will only run
the detection once.

Let's change that to run the detection even if the EncryptInterface list
isn't empty. That will allow us to rerun the detection when new ENI
devices are added on EKS.

One consequence of this change is that we will now attach to all
interfaces even if the user configured --encrypt-interface. That is fine
because --encrypt-interface shouldn't actually be used in ENI mode. In
ENI mode, we want to attach to all interfaces as we don't have a
guarantee on which interface the IPsec traffic will come in.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit bf0940b ]

If IPsec is enabled along with the ENI IPAM mode we
need to load the bpf_network program onto new ENI devices
when they're added at runtime.

To fix this, we subscribe to netlink link updates to detect when
new (non-veth) devices are added and reinitialize IPsec to load the BPF
program onto the devices. The compilation of the bpf_netowrk program
has been moved to Reinitialize() to avoid spurious recompilation on
reinitialize.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 3201a5e ]

This commit simply refactors some existing code into a new
getNodeIDForNode function. This function will be called from elsewhere
in a subsequent commit.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 9cc8a89 ]

Our logic to clean up old XFRM configs on node deletion currently relies
on the destination IP to identify the configs to remove. That doesn't
work with ENI and Azure IPAMs, but until recently, it didn't need to. On
ENI and Azure IPAMs we didn't have per-node XFRM configs.

That changed in commit 3e59b68 ("ipsec: Per-node XFRM states &
policies for EKS & AKS"). We now need to clean up per-node XFRM configs
for ENI and Azure IPAM modes as well, and we can't rely on the
destination IP for that because the XFRM policies don't match on that
destination IP.

Instead, since commit 73c36d4 ("ipsec: Match OUT XFRM states &
policies using node IDs"), we match the per-node XFRM configs using node
IDs encoded in the packet mark. The good news is that this is true for
all IPAM modes (whether Azure, ENI, cluster-pool, or something else).

So our cleanup logic can now rely on the node ID of the deleted node to
clean up its XFRM states and policies. This commit implements that.

Fixes: 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 3e898f2 ]

This commit lowers the priority of the catch-all default-drop XFRM OUT
policies, from 1 to 100. For context, 0 is the highest possible
priority.

This change will allow us to introduce several levels of priorities for
XFRM OUT policies in subsequent commits.

Diff before/after this patch:

      src 0.0.0.0/0 dst 0.0.0.0/0
    -     dir out action block priority 1
    +     dir out action block priority 100
          mark 0xe00/0xf00

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit a11d088 ]

This is a revert, or rather a reimplementation, of commit 688dc9a
("ipsec: Remove stale XFRM states and policies").

In that commit, we would remove the old XFRM OUT policies and states
because they conflict with the new ones and prevent the installation to
proceed. This removal however causes a short window of packet drops on
upgrade, between the time the old XFRM configs are removed and the new
ones are added. These drops would show up as XfrmOutPolBlock because
packets then match the catch-all default-drop XFRM policy.

Instead of removing the old XFRM configs, a better, less-disruptive
approach is to deprioritize them and add the new ones in front. To that
end, we "lower" the priority of the old XFRM OUT policy from 0 to 50 (0
is the highest-possible priority). By doing this the XFRM OUT state is
also indirectly deprioritized because it is selected by the XFRM OUT
policy.

As with the code from commit 688dc9a ("ipsec: Remove stale XFRM
states and policies"), this whole logic can be removed in v1.15, once
we are sure that nobody is upgrading with the old XFRM configs in place.
At that point, we will be able to completely clean up those old XFRM
configs.

The priority of 50 was chosen arbitrarily, to be between the priority of
new XFRM OUT configs (0) and the priority of the catch-all default-drop
policy (100), while leaving space if we need to add additional rules of
different priorities.

Diff before/after upgrade (v1.13.0 -> this patch, GKE):

      src 10.24.1.0/24 dst 10.24.2.0/24
    -     dir out priority 0
    +     dir out priority 50
          mark 0x3e00/0xff00
          tmpl src 10.24.1.77 dst 10.24.2.207
              proto esp spi 0x00000003 reqid 1 mode tunnel

Fixes: 688dc9a ("ipsec: Remove stale XFRM states and policies")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 0af2303 ]

This commit reverts the bpf_host changes of commit 4c7cce1 ("bpf:
Remove IP_POOLS IPsec code").

The IP_POOLS IPsec code was a hack to avoid having one XFRM OUT policy
and state per remote node. Instead, we had a single XFRM OUT policy and
state that would encrypt traffic as usual, but encapsulate it with
placeholder IP addresses, such as 0.0.0.0 -> 192.168.0.0. Those outer IP
addresses would then be rewritten to the proper IPs in bpf_host. To that
end, bpf_lxc would pass the destination IP address, the tunnel endpoint,
to bpf_host via a skb->cb slot. The source IP address was hardcoded in
the object file.

Commit 4c7cce1 ("bpf: Remove IP_POOLS IPsec code") thus got rid of
that hack to instead have per-node XFRM OUT policies and states. The
kernel therefore directly writes the proper outer IP addresses.

Unfortunately, the transition from one implementation to the other isn't
so simple. If we simply remove the old IP_POOLS code as done in commit
4c7cce1, then we will have drops on upgrade. We have two cases,
depending on which of bpf_lxc or bpf_host is reloaded first:

1. If bpf_host is reloaded before the new bpf_lxc is loaded, then it
   won't rewrite the outer IP addresses anymore. In that case, we end up
   with packets of the form 0.0.0.0 -> 192.168.0.0 leaving on the wire.
   Obviously, they don't go far and end up dropped.
2. If bpf_lxc is reloaded before the new bpf_host, then it will reuse
   skb->cb for something else and the XFRM layer will handle the outer
   IP addresses. But because bpf_host is still on the old
   implementation, it will try to use skb->cb to rewrite the outer IP
   addresses. We thus end up with gibberish outer destination IP
   addresses.

One way to fix this is to have bpf_host support both implementations.
This is what this commit does. The logic to rewrite the outer IP
addresses is reintroduced in bpf_host, but it is only executed if the
outer source IP address is 0.0.0.0. That way, we will only rewrite the
outer IP addresses if bpf_lxc is on the old implementation and the XFRM
layer didn't write the proper outer IPs.

Fixes: 4c7cce1 ("bpf: Remove IP_POOLS IPsec code")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit ca9c056 ]

As explained in the previous commit, we need to switch our IPsec
logic from one implementation to another. This implementation requires
some synchronized work between bpf_lxc and bpf_host. To enable this
switch without causing drops, the previous commit made bpf_host support
both implementations.

This is quite enough though. For this to work, we need to ensure that
bpf_host is always reloaded before any bpf_lxc is loaded. That is, we
need to load the bpf_host program that supports both implementations
before we actually start the switch from one implementation to the
second.

This commit makes that change in the order of BPF program reloads.
Instead of regenerating the bpf_host program (i.e., the host endpoint's
datapath) in a goroutine like other BPF programs, we will regenerate it
first, as a blocking operation.

Regenerating the host endpoint's datapath separately like this will
delay the agent startup. This regeneration was measured to take around 1
second on an EKS cluster (though it can probably grow to a few seconds
depending on the node type and current load). That should stay fairly
small compared to the overall duration of the agent startup (around 30
seconds). Nevertheless, this separate regeneration is only performed
when we actually need: for IPsec with EKS or AKS IPAM mode.

Fixes: 4c7cce1 ("bpf: Remove IP_POOLS IPsec code")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit c0d9b8c ]

Commit 73c36d4 ("ipsec: Match OUT XFRM states & policies using node
IDs") changed our XFRM states to match on packet marks of the form
0xXXXXYe00/0xffffff00 where XXXX is the node ID and Y is the SPI. The
previous format for the packet mark in XFRM states was 0xYe00/0xff00.

According to the Linux kernel these two states conflict (because
0xXXXXYe00/0xffffff00 ∈ 0xYe00/0xff00). That means we can't add the new
state while the old one is around.

Thus, in commit ddd491b ("ipsec: Custom check for XFRM state
existence"), we removed any old conflicting XFRM state before adding
the new ones. That however causes packet drops on upgrades because we
may remove the old XFRM state before bpf_lxc has been updated to use the
new 0xXXXXYe00/0xffffff00 mark. Instead, we would need both XFRM state
formats to coexist for the duration of the upgrade.

Impossible, you say! Don't despair. Things are actually a bit more
complicated (it's IPsec and Linux after all).

While Linux doesn't allow us to add 0xXXXXYe00/0xffffff00 when
0xYe00/0xff00 exists, it does allow adding in the reverse order. That
seems to be because 0xXXXXYe00/0xffffff00 ∈ 0xYe00/0xff00 but
0xYe00/0xff00 ∉ 0xXXXXYe00/0xffffff00 [1].

Therefore, to have both XFRM states coexist, we can remove the old
state, add the new one, then re-add the old state. That is allowed
because we never try to add the new state when the old is present.

During the short period of time when we have removed the old XFRM
state, we can have a packet drops due to the missing state. These drops
should be limited to the specific node pair this XFRM state is
handling. This will also only happen on upgrades. Finally, this
shouldn't happen with ENI and Azure IPAM modes because they don't have
such old conflicting states.

I tested this upgrade path on a 20-nodes GKE cluster running our
drop-sensitive application, migrate-svc, scaled up to 50 clients and 30
backends. I didn't get a single packet drop despite the application
consistently sending packets back and forth between nodes. Thus, I think
the window for drops to happen is really small.

Diff before/after the upgrade (v1.13.0 -> thi patch, GKE):

      src 10.24.1.77 dst 10.24.2.207
          proto esp spi 0x00000003 reqid 1 mode tunnel
          replay-window 0
          mark 0x3e00/0xff00 output-mark 0xe00/0xf00
          aead rfc4106(gcm(aes)) 0xfc2d0c4e646b87ff2d0801b57997e3598eab0d6b 128
    -     anti-replay context: seq 0x0, oseq 0x2c, bitmap 0x00000000
    +     anti-replay context: seq 0x0, oseq 0x16, bitmap 0x00000000
          sel src 0.0.0.0/0 dst 0.0.0.0/0
    + src 10.24.1.77 dst 10.24.2.207
    +     proto esp spi 0x00000003 reqid 1 mode tunnel
    +     replay-window 0
    +     mark 0x713f3e00/0xffffff00 output-mark 0xe00/0xf00
    +     aead rfc4106(gcm(aes)) 0xfc2d0c4e646b87ff2d0801b57997e3598eab0d6b 128
    +     anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
    +     sel src 0.0.0.0/0 dst 0.0.0.0/0

We can notice that the counters for the existing XFRM state also
changed (decreased). That's expected since the state got recreated.

1 - I think this is because XFRM states don't have priorities. So when
    two XFRM states would match a given packet (in our case a packet
    with mark XXXXYe00), the oldest XFRM state is taken. Thus, by not
    allowing to add a more specific match after a more generic one, the
    kernel ensures that the more specific match is always taken when
    both match a packet. That likely corresponds to user expectations.
    That is, if both 0xXXXXYe00/0xffffff00 and 0xYe00/0xff00 match a
    packet, we would probably expect 0xXXXXYe00/0xffffff00 to be used.
Fixes: ddd491b ("ipsec: Custom check for XFRM state existence")
Fixes: 73c36d4 ("ipsec: Match OUT XFRM states & policies using node IDs")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
[ upstream commit 4704655 ]

Now that the code is reloading the bpf_network program at runtime we
should not fatal if we fail to reload the program since this may be caused
by ongoing interface changes (e.g. interface was being removed). Change
the log.Fatal into log.Error and keep loading to other interfaces.

Fixes: bf0940b ("loader: Reinitialize IPsec on device changes on ENI")
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 592777d ]

reloadIPSecOnLinkChanges() did not ignore veth device updates causing
reload to be triggered when new endpoints were created. Ignore any
updates with "veth" as device type.

The draining of updates during settle wait was broken due to unintentional
breaking out of the loop. Removed the break.

Fixes: bf0940b ("loader: Reinitialize IPsec on device changes on ENI")
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit ac54f29 ]

We recently changed our XFRM configuration to have one XFRM OUT policy
per remote node, regardless of the IPAM mode being used. In doing so, we
also moved the XFRM FWD policy to be installed once per remote node.

With ENI and Azure IPAM modes, this wouldn't cause any issue because the
XFRM FWD policy is the same regardless of the remote node. On other IPAM
modes, however, the XFRM FWD policy is for some reason different
depending on the remote node that triggered the installation. As a
result, for those IPAM modes, one FWD policy is installed per remote
node. And the deletion logic triggered on node deletions wasn't updated
to take that into account. We thus have a leak of XFRM FWD policies.

In the end, our FWD policy just needs to allow everything through
without encrypting it. It doesn't need to be specific to any remote
node. We can simply completely wildcard the match, to look like:

    src 0.0.0.0/0 dst 0.0.0.0/0
        dir fwd priority 2975 ptype main
        tmpl src 0.0.0.0 dst 192.168.134.181
            proto esp reqid 1 mode tunnel
            level use

So we match all packets regardless of source and destination IPs. We
don't match on the packet mark.

There's a small implementation hurdle here. Because we used to install
FWD policies of the form "src 0.0.0.0/0 dst 10.0.1.0/24", the kernel was
able to deduce which IP family we are matching against and would adapt
the 0.0.0.0/0 source CIDR to ::/0 as needed. Now that we are matching on
0/0 for both CIDRs, it cannot deduce this anymore. So instead, we must
detect the IP family ourself and use the proper CIDRs.

In addition to changing the XFRM FWD policy to the above, we can also
stop installing it once per remote node. It's enough to install it when
we receive the event for the local node, once.

Fixes: 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit d0ab559 ]

We use this wildcard IPv6 CIDR in the catch-all default-drop OUT policy
as well as in the FWD policy. It was incorrectly set to ::/128 instead
of ::/0 and would therefore not match anything instead of matching
everything. This commit fixes it.

Fixes: e802c29 ("ipsec: Refactor wildcard IP variables")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 25064d1 ]

After applying a backport of 9cc8a89 ("ipsec: Fix leak of XFRM
policies with ENI and Azure IPAMs") to 1.11.16, I noticed that we were
getting occasional spikes of "no inbound state" xfrm errors
(XfrmInNoStates). These lead to packet loss and brief outages for
applications sending traffic to the node on which the spikes occur.

I noticed that the "No node ID found for node." logline would appear at
the time of these spikes and from the code this is logged when the node
ID cannot be resolved. Looking a bit further the call to
`DeleteIPsecEndpoint` will end up deleting the xfrm state for any state
that matches the node id as derived from the mark in the state.

The problem seems to be that the inbound state for 0.0.0.0/0 -> node IP
has a mark of `0xd00` which when shifted >> 16 in
`getNodeIDFromXfrmMark` matches nodeID 0 and so the inbound state gets
deleted and the kernel drops all the inbound traffic as it no longer
matches a state.

This commit updates that logic to skip the XFRM state and policy
deletion when the node ID is zero.

Fixes: 9cc8a89 ("ipsec: Fix leak of XFRM policies with ENI and Azure IPAMs")
Signed-off-by: Steven Johnson <sjdot@protonmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 57eac9d ]

With commit 9cc8a89 ("ipsec: Fix leak of XFRM policies with ENI and
Azure IPAMs") we rely on the node ID to find XFRM states and policies
that belong to remote nodes, to clean them up when remote nodes are
deleted.

This commit makes sure that we only do this for XFRM states and policies
that actually match on these node IDs. That should only be the same if
the mark mask matches on node ID bits. Thus if should look like
0xffffff00 (matches on node ID, SPI, and encryption bit) or 0xffff0f00
(matches on node and encryption bit).

Fixes: 9cc8a89 ("ipsec: Fix leak of XFRM policies with ENI and Azure IPAMs")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 1e1e2f7 ]

Commit 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS &
AKS") changed the XFRM config to have one state and policy per remote
node in IPAM modes ENI and Azure. The IPsec cleanup logic was therefore
also updated to call deleteIPsec() whenever a remote node is deleted.

However, we missed that the cleanup logic also tries to remove the
per-node IP route. In case of IPAM modes ENI and Azure, the IP route
however stays as before: we have a single route for all remote nodes. We
therefore don't have anything to cleanup.

Because of this unnecessary IP route cleanup attempt, an error message
was printed for every remote node deletion:

    Unable to delete the IPsec route OUT from the host routing table

This commit fixes it to avoid attempting this unnecessary cleanup.

Fixes: 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@pchaigno pchaigno force-pushed the pr/v1.11-backport-2023-06-08-ipsec branch from d52c308 to a331e32 Compare June 13, 2023 12:16
On kernels before 4.19, the XFRM output mark is not fully supported.
Thus, when comparing XFRM states, if we compare the output marks, the
existing states will never match the new state. The new state will have
an output mark, but the states installed in the kernel don't have it
(because the kernel ignored it).

As a result, our current logic will assume that the state we want to
install doesn't already exist, it will try to install it, fail because
it already exist, assume there's a conflicting state, throw an error,
remove the conflicting state, and install the new (but identical) one.

The end result is therefore the same: the new state is in place in the
kernel. But on the way to installing it, we will emit an unnecessary
error and temporarily remove the state (potentially causing packet
drops).

Instead, we can safely ignore the output-marks when comparing states. We
don't expect any states with same IPs, SPI, and marks, but different
output-marks anyway. The only way this could happen is if someone
manually added such a state. Even if they did, the only impact would be
that we wouldn't overwrite the manually-added state with the different
output-mark.

This patch is only necessary on v1.12 and earlier versions of Cilium
because v1.13 dropped support for Linux <4.19.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno
Copy link
Member

pchaigno commented Jun 13, 2023

/test-backport-1.11

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

FAIL: Failed to reach 192.168.56.11:80 from testserver-host-2nx98

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.19/65/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

Job 'Cilium-PR-K8s-1.18-kernel-4.9' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Unable to download helm chart v1.10 from GitHub

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.9/61/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.18-kernel-4.9 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@gandro
Copy link
Member

gandro commented Jun 13, 2023

  • test-1.18-4.9 failed extracting the Helm chart FAIL: Unable to download helm chart v1.10 from GitHub which is an unrelated failure (see above for link), restarting
  • test-1.22-4.19 hit CI: Host firewall: unable to upgrade connection #15455 which has been observed on a previous v1.11 backport PR too
  • ConformanceKind1.19 hit a issue with 1.1.1.1 which is happening on other PRs too and unrelated.

Restarting two Jenkins pipelines.

@gandro
Copy link
Member

gandro commented Jun 13, 2023

/test-1.18-4.9

@gandro
Copy link
Member

gandro commented Jun 13, 2023

/test-1.22-4.19

@pchaigno
Copy link
Member

I reviewed the last three backported PRs from me and also ran manual tests to cover the last version of this backport PR.

@gandro gandro removed the request for review from pchaigno June 13, 2023 18:34
@gandro
Copy link
Member

gandro commented Jun 13, 2023

All CI except ConformanceKind1.19 has passed. The failures in ConformanceKind1.19 are due to 1.1.1.1 rejecting our traffic and unrelated to this PR (as it's happening on other PRs and branches too). All other CI and manual testing give a green signal, so let's merge.

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 13, 2023
@gandro gandro merged commit c3ffd99 into cilium:v1.11 Jun 13, 2023
52 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/encryption Impacts encryption support such as IPSec, WireGuard, or kTLS. backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.11 This issue will prevent the release of the next version of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants