Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213

Open
3 tasks done
ErikEngerd opened this issue Apr 28, 2024 · 8 comments
Open
3 tasks done

Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213

ErikEngerd opened this issue Apr 28, 2024 · 8 comments
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.

Comments

@ErikEngerd
Copy link

ErikEngerd commented Apr 28, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I upgraded from 1.14.10 to 1.15.4 following the upgrade instructions including the pre-flight checks that passed.

I am using a fairly minimal values.yaml for the cilium installation:

ipam:
  operator:
    clusterPoolIPv4PodCIDRList:
    - 10.220.0.0/16

hubble:
  relay:
    enabled: true
  ui:
    enabled: true 
serviceAccounts:
  cilium:
    name: cilium
  operator:
    name: cilium-operator
tunnelPort: 8473
tunnelProtocol: vxlan

I have an existing setup where I use network policies extensively. I now see a number of policies failing. One example is my jenkins server. Where I see a lot of traffic that is no longer being allowed. See the log output for details. This involves standard network policies for pod to pod communication not involving the host network, communication to the api server for which I have a cilium network policy.

The involved pod is:

> k get pods -n jenkins --show-labels
NAME                READY   STATUS    RESTARTS   AGE    LABELS
wamblee-jenkins-0   2/2     Running   0          143m   app.kubernetes.io/component=jenkins-controller,app.kubernetes.io/instance=wamblee,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=jenkins,apps.kubernetes.io/pod-index=0,controller-revision-hash=wamblee-jenkins-7f469f5cd6,statefulset.kubernetes.io/pod-name=wamblee-jenkins-0

There are no network policies in the kube-system namespace so access to the API server should be determined only based on policies in the jenkins namespace.

I have the feeling that matching of the pod labels is not working correctly. What does appear to be working is the default-allow-nothing rule (default deny policy). If the matchLabels would not match anymore with cilium 1.15.4 then the specific network policies for jenkins will not apply to the jenkins pod leading to exactly the behavior I am seeing with all ingress and egress traffic being dropped.

For instance, here is my policy to allow traffic to the API server:

kind: CiliumNetworkPolicy
apiVersion: cilium.io/v2
metadata:
  name: jenkins-api-server-access
  namespace: jenkins
spec:
  endpointSelector:
    matchLabels:
      statefulset.kubernetes.io/pod-name: wamblee-jenkins-0
  egress:
    - toEntities:
        # I used kube-apiserver before but replaced it with all as part of troubleshooting
        - all
      toPorts:
        - ports:
            - port: "6443"
              protocol: TCP

This policy is definitely working since when I delete it with cilium version 1.14.10, I see traffic from jenkins to the apiserver getting DROPPED, and the message disappear again when I apply the network policy again.

As said before there are many other networkpolicies that relate to the jenkins pod that are failing. They all use the same label selector for the jenkins pod. However, the above one is the easiest to troubleshoot since it can be reproduced using a single pod in a single namespace and a single networkpolicy.

Cilium Version

1.15.4

Kernel Version

Linux baboon 6.1.0-16-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.67-1 (2023-12-12) x86_64 GNU/Linux
Debian 12

Kubernetes Version

1.28.5

Regression

Yes, version 1.14.10 was working fine

Sysdump

cilium-sysdump-20240428-140900.zip

Relevant log output

> stdbuf -oL hubble observe -f | grep 'jenkins.*DROPPED'
Apr 28 11:47:02.544: jenkins/wamblee-jenkins-0:55218 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:02.946: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:03.952: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:05.968: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:08.683: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:09.707: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:10.224: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:11.723: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:15.803: 192.168.178.118:6443 (kube-apiserver) <> jenkins/wamblee-jenkins-0:57734 (ID:21112) Policy denied DROPPED (TCP Flags: ACK, FIN, PSH)
Apr 28 11:47:15.915: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:18.416: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ErikEngerd ErikEngerd added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Apr 28, 2024
@ErikEngerd
Copy link
Author

The upgrade command I used was

helm upgrade --install cilium cilium/cilium --version 1.15.4 \
  --set upgradeCompatibility=1.14 \
  --namespace kube-system --values values.yaml

@ErikEngerd
Copy link
Author

When I detected the problem I was still using 'kube-apiserver' in the toEntities (which works with 1.14.10) and replaced it with 'all' for troubleshooting.

@ErikEngerd
Copy link
Author

ErikEngerd commented Apr 28, 2024

I did some additional testing on a single node k8s cluster running version k8s 1.29.2 and used cilium 1.15.4.

I created a dummy jenkins pod:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: jenkins
    app.kubernetes.io/component: jenkins-controller
    app.kubernetes.io/instance: wamblee
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: jenkins
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: wamblee-jenkins-7f469f5cd6
    statefulset.kubernetes.io/pod-name: wamblee-jenkins-0
  name: jenkins
  namespace: jenkins
spec:
  containers:
  - image: alpine
    name: jenkins
    command:
      - sh
      - -c 
      - | 
        apk add curl
        while : 
        do
          curl -vk https://kubernetes.default.svc.cluster.local
          sleep 1
        done

The idea is that it continuously tries to connect to the api server and this will show 'Could not resolve host' if the kubernetes API server lookup fails' and will also timeout in the later call to the api server that is blocked as well. Doing this, I quickly found out that using the short label 'run: jenkins' actually worked so I used this to allow traffic for the API server using network policies. Then I could quickly test DNS access using various network policy label selectors.

I used the following three network policies

---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: default-allow-nothing
  namespace: jenkins
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: jenkins-allow-external-traffic
  namespace: jenkins
spec:
  podSelector:
    matchLabels:
      #run: jenkins
      #app.kubernetes.io/component: jenkins-controller
      #app.kubernetes.io/instance: wamblee
      #app.kubernetes.io/managed-by: Helm
      #app.kubernetes.io/name: jenkins

      # pod-index is not used by cilium
      #apps.kubernetes.io/pod-index: "0"
      #controller-revision-hash: wamblee-jenkins-7f469f5cd6
      statefulset.kubernetes.io/pod-name: wamblee-jenkins-0

  egress:
    - ports:
        - port: 53
          protocol: TCP
        - port: 53
          protocol: UDP
        - port: 80
        - port: 443
---
kind: CiliumNetworkPolicy
apiVersion: cilium.io/v2
metadata:
  name: jenkins-api-server-access
  namespace: jenkins
spec:
  endpointSelector:
    matchLabels:
      #statefulset.kubernetes.io/pod-name: wamblee-jenkins-0
      run: jenkins
  egress:
    - toEntities:
        - kube-apiserver
    - toPorts:
        - ports:
            - port: "6443"
              protocol: TCP


So, (1) a default allow nothing policy, (2) allowing DNS and some other protocols, (3) allowing access to the API server.

Then I tried using each of the labels in the jenkins pod one by one i n the jenkins-allow-external-traffic network policy.

I found then that all labels worked apart from the last three: apps.kubernetes.io/pod-index, controller-revision-hash, statefulset.kubernetes.io/pod-name. Now of course, not all of these labels would make sense to use in network policy in practice, but the last one does and I think it is not the job of the CNI to silently discard certain labels.

Is there a reason for this new behavior? I could not find any documentation for this on the k8s website and it has definitely changed in version 1.15.4 compared to 1.14.10. I also double-checked this test setup with 1.14.10 and that version still works as expected.

@ErikEngerd
Copy link
Author

ErikEngerd commented Apr 28, 2024

I think I have found it based on the 'identity relevant labels' documented at https://docs.cilium.io/en/stable/operations/performance/scalability/identity-relevant-labels/.

There has been an extension of these labels in version 1.15.4. I think in any case three things should be done:

  • a validation should be done in the pre-flight checks. These should fail when these labels are used.
  • a validating webhook should reject network policies that use these labels
  • users should be informed on the release notes that the set of identity relevant labels has been extended

Also, it should be documented in a more prominent place that cilium by default, for performance reasons, limits the labels that can be used for pod selection. I have never encountered this before and it cost me hours of troubleshooting to find the cause. So pre-flight checks and a validating webhook would be very welcome.

@rauanmayemir
Copy link
Contributor

Is your issue similar to #30073?

@ErikEngerd
Copy link
Author

I don't think it is since I am not using audit mode. Also the issue is about a difference between 1.15.4 and 1.14.10.

@youngnick youngnick added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Apr 30, 2024
@youngnick
Copy link
Contributor

Hi @ErikEngerd, it does look likely that this is related to the PR linked above (#28003). As @joestringer noted there, that really should have been called out as a bigger change, and mentioned in the upgrade guide, as you say.

I've marked this for the attention of SIG Policy, so they can prioritize accordingly.

@ErikEngerd
Copy link
Author

This is precisely the issue indeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

No branches or pull requests

3 participants