Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213

ErikEngerd · 2024-04-28T12:11:54Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

I upgraded from 1.14.10 to 1.15.4 following the upgrade instructions including the pre-flight checks that passed.

I am using a fairly minimal values.yaml for the cilium installation:

ipam:
  operator:
    clusterPoolIPv4PodCIDRList:
    - 10.220.0.0/16

hubble:
  relay:
    enabled: true
  ui:
    enabled: true 
serviceAccounts:
  cilium:
    name: cilium
  operator:
    name: cilium-operator
tunnelPort: 8473
tunnelProtocol: vxlan

I have an existing setup where I use network policies extensively. I now see a number of policies failing. One example is my jenkins server. Where I see a lot of traffic that is no longer being allowed. See the log output for details. This involves standard network policies for pod to pod communication not involving the host network, communication to the api server for which I have a cilium network policy.

The involved pod is:

> k get pods -n jenkins --show-labels
NAME                READY   STATUS    RESTARTS   AGE    LABELS
wamblee-jenkins-0   2/2     Running   0          143m   app.kubernetes.io/component=jenkins-controller,app.kubernetes.io/instance=wamblee,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=jenkins,apps.kubernetes.io/pod-index=0,controller-revision-hash=wamblee-jenkins-7f469f5cd6,statefulset.kubernetes.io/pod-name=wamblee-jenkins-0

There are no network policies in the kube-system namespace so access to the API server should be determined only based on policies in the jenkins namespace.

I have the feeling that matching of the pod labels is not working correctly. What does appear to be working is the default-allow-nothing rule (default deny policy). If the matchLabels would not match anymore with cilium 1.15.4 then the specific network policies for jenkins will not apply to the jenkins pod leading to exactly the behavior I am seeing with all ingress and egress traffic being dropped.

For instance, here is my policy to allow traffic to the API server:

kind: CiliumNetworkPolicy
apiVersion: cilium.io/v2
metadata:
  name: jenkins-api-server-access
  namespace: jenkins
spec:
  endpointSelector:
    matchLabels:
      statefulset.kubernetes.io/pod-name: wamblee-jenkins-0
  egress:
    - toEntities:
        # I used kube-apiserver before but replaced it with all as part of troubleshooting
        - all
      toPorts:
        - ports:
            - port: "6443"
              protocol: TCP

This policy is definitely working since when I delete it with cilium version 1.14.10, I see traffic from jenkins to the apiserver getting DROPPED, and the message disappear again when I apply the network policy again.

As said before there are many other networkpolicies that relate to the jenkins pod that are failing. They all use the same label selector for the jenkins pod. However, the above one is the easiest to troubleshoot since it can be reproduced using a single pod in a single namespace and a single networkpolicy.

Cilium Version

1.15.4

Kernel Version

Linux baboon 6.1.0-16-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.67-1 (2023-12-12) x86_64 GNU/Linux
Debian 12

Kubernetes Version

1.28.5

Regression

Yes, version 1.14.10 was working fine

Sysdump

cilium-sysdump-20240428-140900.zip

Relevant log output

> stdbuf -oL hubble observe -f | grep 'jenkins.*DROPPED'
Apr 28 11:47:02.544: jenkins/wamblee-jenkins-0:55218 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:02.946: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:03.952: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:05.968: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:08.683: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:09.707: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:10.224: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:11.723: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:15.803: 192.168.178.118:6443 (kube-apiserver) <> jenkins/wamblee-jenkins-0:57734 (ID:21112) Policy denied DROPPED (TCP Flags: ACK, FIN, PSH)
Apr 28 11:47:15.915: exposure/httpd-wamblee-org-6d74cbcb5-mhbll:56978 (ID:36342) <> jenkins/wamblee-jenkins-0:8080 (ID:21112) Policy denied DROPPED (TCP Flags: SYN)
Apr 28 11:47:18.416: jenkins/wamblee-jenkins-0:52402 (ID:21112) <> 192.168.178.118:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN)

Anything else?

No response

Cilium Users Document

Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ErikEngerd · 2024-04-28T12:14:11Z

The upgrade command I used was

helm upgrade --install cilium cilium/cilium --version 1.15.4 \
  --set upgradeCompatibility=1.14 \
  --namespace kube-system --values values.yaml

ErikEngerd · 2024-04-28T12:30:08Z

When I detected the problem I was still using 'kube-apiserver' in the toEntities (which works with 1.14.10) and replaced it with 'all' for troubleshooting.

ErikEngerd · 2024-04-28T17:20:55Z

I did some additional testing on a single node k8s cluster running version k8s 1.29.2 and used cilium 1.15.4.

I created a dummy jenkins pod:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: jenkins
    app.kubernetes.io/component: jenkins-controller
    app.kubernetes.io/instance: wamblee
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: jenkins
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: wamblee-jenkins-7f469f5cd6
    statefulset.kubernetes.io/pod-name: wamblee-jenkins-0
  name: jenkins
  namespace: jenkins
spec:
  containers:
  - image: alpine
    name: jenkins
    command:
      - sh
      - -c 
      - | 
        apk add curl
        while : 
        do
          curl -vk https://kubernetes.default.svc.cluster.local
          sleep 1
        done

The idea is that it continuously tries to connect to the api server and this will show 'Could not resolve host' if the kubernetes API server lookup fails' and will also timeout in the later call to the api server that is blocked as well. Doing this, I quickly found out that using the short label 'run: jenkins' actually worked so I used this to allow traffic for the API server using network policies. Then I could quickly test DNS access using various network policy label selectors.

I used the following three network policies

---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: default-allow-nothing
  namespace: jenkins
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: jenkins-allow-external-traffic
  namespace: jenkins
spec:
  podSelector:
    matchLabels:
      #run: jenkins
      #app.kubernetes.io/component: jenkins-controller
      #app.kubernetes.io/instance: wamblee
      #app.kubernetes.io/managed-by: Helm
      #app.kubernetes.io/name: jenkins

      # pod-index is not used by cilium
      #apps.kubernetes.io/pod-index: "0"
      #controller-revision-hash: wamblee-jenkins-7f469f5cd6
      statefulset.kubernetes.io/pod-name: wamblee-jenkins-0

  egress:
    - ports:
        - port: 53
          protocol: TCP
        - port: 53
          protocol: UDP
        - port: 80
        - port: 443
---
kind: CiliumNetworkPolicy
apiVersion: cilium.io/v2
metadata:
  name: jenkins-api-server-access
  namespace: jenkins
spec:
  endpointSelector:
    matchLabels:
      #statefulset.kubernetes.io/pod-name: wamblee-jenkins-0
      run: jenkins
  egress:
    - toEntities:
        - kube-apiserver
    - toPorts:
        - ports:
            - port: "6443"
              protocol: TCP

So, (1) a default allow nothing policy, (2) allowing DNS and some other protocols, (3) allowing access to the API server.

Then I tried using each of the labels in the jenkins pod one by one i n the jenkins-allow-external-traffic network policy.

I found then that all labels worked apart from the last three: apps.kubernetes.io/pod-index, controller-revision-hash, statefulset.kubernetes.io/pod-name. Now of course, not all of these labels would make sense to use in network policy in practice, but the last one does and I think it is not the job of the CNI to silently discard certain labels.

Is there a reason for this new behavior? I could not find any documentation for this on the k8s website and it has definitely changed in version 1.15.4 compared to 1.14.10. I also double-checked this test setup with 1.14.10 and that version still works as expected.

ErikEngerd · 2024-04-28T20:56:25Z

I think I have found it based on the 'identity relevant labels' documented at https://docs.cilium.io/en/stable/operations/performance/scalability/identity-relevant-labels/.

There has been an extension of these labels in version 1.15.4. I think in any case three things should be done:

a validation should be done in the pre-flight checks. These should fail when these labels are used.
a validating webhook should reject network policies that use these labels
users should be informed on the release notes that the set of identity relevant labels has been extended

Also, it should be documented in a more prominent place that cilium by default, for performance reasons, limits the labels that can be used for pod selection. I have never encountered this before and it cost me hours of troubleshooting to find the cause. So pre-flight checks and a validating webhook would be very welcome.

rauanmayemir · 2024-04-29T12:28:45Z

Is your issue similar to #30073?

ErikEngerd · 2024-04-29T13:40:21Z

I don't think it is since I am not using audit mode. Also the issue is about a difference between 1.15.4 and 1.14.10.

youngnick · 2024-04-30T06:31:05Z

Hi @ErikEngerd, it does look likely that this is related to the PR linked above (#28003). As @joestringer noted there, that really should have been called out as a bigger change, and mentioned in the upgrade guide, as you say.

I've marked this for the attention of SIG Policy, so they can prioritize accordingly.

ErikEngerd · 2024-04-30T21:29:59Z

This is precisely the issue indeed

ErikEngerd added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Apr 28, 2024

joestringer mentioned this issue Apr 30, 2024

labelsfilter: ignore StatefulSet-related labels by default for CID creation #28003

Merged

youngnick added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213

Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213

ErikEngerd commented Apr 28, 2024 •

edited

ErikEngerd commented Apr 28, 2024

ErikEngerd commented Apr 28, 2024

ErikEngerd commented Apr 28, 2024 •

edited

ErikEngerd commented Apr 28, 2024 •

edited

rauanmayemir commented Apr 29, 2024

ErikEngerd commented Apr 29, 2024

youngnick commented Apr 30, 2024

ErikEngerd commented Apr 30, 2024

Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213

Upgrade 1.14.10 -> 1.15.4 network policies start dropping traffic #32213

Comments

ErikEngerd commented Apr 28, 2024 • edited

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

ErikEngerd commented Apr 28, 2024

ErikEngerd commented Apr 28, 2024

ErikEngerd commented Apr 28, 2024 • edited

ErikEngerd commented Apr 28, 2024 • edited

rauanmayemir commented Apr 29, 2024

ErikEngerd commented Apr 29, 2024

youngnick commented Apr 30, 2024

ErikEngerd commented Apr 30, 2024

ErikEngerd commented Apr 28, 2024 •

edited

ErikEngerd commented Apr 28, 2024 •

edited

ErikEngerd commented Apr 28, 2024 •

edited