New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213
Comments
The upgrade command I used was
|
When I detected the problem I was still using 'kube-apiserver' in the toEntities (which works with 1.14.10) and replaced it with 'all' for troubleshooting. |
I did some additional testing on a single node k8s cluster running version k8s 1.29.2 and used cilium 1.15.4. I created a dummy jenkins pod:
The idea is that it continuously tries to connect to the api server and this will show 'Could not resolve host' if the kubernetes API server lookup fails' and will also timeout in the later call to the api server that is blocked as well. Doing this, I quickly found out that using the short label 'run: jenkins' actually worked so I used this to allow traffic for the API server using network policies. Then I could quickly test DNS access using various network policy label selectors. I used the following three network policies
So, (1) a default allow nothing policy, (2) allowing DNS and some other protocols, (3) allowing access to the API server. Then I tried using each of the labels in the jenkins pod one by one i n the jenkins-allow-external-traffic network policy. I found then that all labels worked apart from the last three: apps.kubernetes.io/pod-index, controller-revision-hash, statefulset.kubernetes.io/pod-name. Now of course, not all of these labels would make sense to use in network policy in practice, but the last one does and I think it is not the job of the CNI to silently discard certain labels. Is there a reason for this new behavior? I could not find any documentation for this on the k8s website and it has definitely changed in version 1.15.4 compared to 1.14.10. I also double-checked this test setup with 1.14.10 and that version still works as expected. |
I think I have found it based on the 'identity relevant labels' documented at https://docs.cilium.io/en/stable/operations/performance/scalability/identity-relevant-labels/. There has been an extension of these labels in version 1.15.4. I think in any case three things should be done:
Also, it should be documented in a more prominent place that cilium by default, for performance reasons, limits the labels that can be used for pod selection. I have never encountered this before and it cost me hours of troubleshooting to find the cause. So pre-flight checks and a validating webhook would be very welcome. |
Is your issue similar to #30073? |
I don't think it is since I am not using audit mode. Also the issue is about a difference between 1.15.4 and 1.14.10. |
Hi @ErikEngerd, it does look likely that this is related to the PR linked above (#28003). As @joestringer noted there, that really should have been called out as a bigger change, and mentioned in the upgrade guide, as you say. I've marked this for the attention of SIG Policy, so they can prioritize accordingly. |
This is precisely the issue indeed |
Is there an existing issue for this?
What happened?
I upgraded from 1.14.10 to 1.15.4 following the upgrade instructions including the pre-flight checks that passed.
I am using a fairly minimal values.yaml for the cilium installation:
I have an existing setup where I use network policies extensively. I now see a number of policies failing. One example is my jenkins server. Where I see a lot of traffic that is no longer being allowed. See the log output for details. This involves standard network policies for pod to pod communication not involving the host network, communication to the api server for which I have a cilium network policy.
The involved pod is:
There are no network policies in the kube-system namespace so access to the API server should be determined only based on policies in the jenkins namespace.
I have the feeling that matching of the pod labels is not working correctly. What does appear to be working is the default-allow-nothing rule (default deny policy). If the matchLabels would not match anymore with cilium 1.15.4 then the specific network policies for jenkins will not apply to the jenkins pod leading to exactly the behavior I am seeing with all ingress and egress traffic being dropped.
For instance, here is my policy to allow traffic to the API server:
This policy is definitely working since when I delete it with cilium version 1.14.10, I see traffic from jenkins to the apiserver getting DROPPED, and the message disappear again when I apply the network policy again.
As said before there are many other networkpolicies that relate to the jenkins pod that are failing. They all use the same label selector for the jenkins pod. However, the above one is the easiest to troubleshoot since it can be reproduced using a single pod in a single namespace and a single networkpolicy.
Cilium Version
1.15.4
Kernel Version
Linux baboon 6.1.0-16-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.67-1 (2023-12-12) x86_64 GNU/Linux
Debian 12
Kubernetes Version
1.28.5
Regression
Yes, version 1.14.10 was working fine
Sysdump
cilium-sysdump-20240428-140900.zip
Relevant log output
Anything else?
No response
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: