[BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot #8550

yangchiu · 2024-05-12T02:50:46Z

Describe the bug

When running negative test case Power Off Node One By Once For More Than Pod Eviction Timeout While Workload Heavy Writing, accidentally encountered a deployment pod got stuck in ContainerStatusUnknown forever.

The test case steps are like:

Create a deployment 0 with rwo volume
Create a deployment 1 with rwx volume
Create a deployment 2 with rwo strict-local volume
Create a statefulset 0 with rwo volume
Create a statefulset 1 with rwx volume
Create a statefulset 2 with rwo strict-local volume
Keep writing data to deployment 0 ~2 and statefulset 0 ~ 2
Power off node 0 for 6 minutes, and then power on
Power off node 1 for 6 minutes, and then power on
Power off node 2 for 6 minutes, and then power on
Check deployment 0 ~2 and statefulset 0 ~ 2 pods running and able to read/write without problems

In https://ci.longhorn.io/job/private/job/longhorn-e2e-test/557/, accidentally encountered after nodes rebooted in step 7 ~ 9, the pod of deployment with rwo strict-local volume got stuck in ContainerStatusUnknown forever:

# kubectl get pods -owide
NAME                                     READY   STATUS                   RESTARTS        AGE     IP            NODE            NOMINATED NODE   READINESS GATES
longhorn-test-minio-f4bbdc54d-lhddt      1/1     Running                  0               5h27m   10.42.4.130   ip-10-0-2-198   <none>           <none>
e2e-test-deployment-2-7ddccb49f4-5fxcp   0/1     ContainerStatusUnknown   0               5h26m   <none>        ip-10-0-2-198   <none>           <none>
e2e-test-deployment-2-7ddccb49f4-lx7gv   1/1     Running                  0               5h24m   10.42.4.136   ip-10-0-2-198   <none>           <none>
e2e-test-deployment-0-b957b9f54-x7t82    1/1     Running                  0               5h18m   10.42.1.146   ip-10-0-2-53    <none>           <none>
e2e-test-deployment-1-dbc678584-8sq9k    1/1     Running                  0               5h18m   10.42.1.147   ip-10-0-2-53    <none>           <none>
e2e-test-statefulset-1-0                 1/1     Running                  0               5h15m   10.42.3.194   ip-10-0-2-78    <none>           <none>
e2e-test-statefulset-0-0                 1/1     Running                  0               5h15m   10.42.3.196   ip-10-0-2-78    <none>           <none>
e2e-test-statefulset-2-0                 1/1     Running                  0               5h14m   10.42.3.198   ip-10-0-2-78    <none>           <none>

# kubectl describe pod e2e-test-deployment-2-7ddccb49f4-5fxcp 
Name:           e2e-test-deployment-2-7ddccb49f4-5fxcp
Namespace:      default
Priority:       0
Node:           ip-10-0-2-198/10.0.2.198
Start Time:     Sat, 11 May 2024 21:05:04 +0000
Labels:         app=e2e-test-deployment-2
                pod-template-hash=7ddccb49f4
                test.longhorn.io=e2e
Annotations:    <none>
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: ephemeral-storage. Threshold quantity: 2145752505, available: 1470928Ki. 
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/e2e-test-deployment-2-7ddccb49f4
Containers:
  sleep:
    Container ID:  
    Image:         busybox
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      while true;do date;sleep 5; done
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from pod-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t8j6t (ro)
Conditions:
  Type                        Status
  DisruptionTarget            True 
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  e2e-test-claim-2
    ReadOnly:   false
  kube-api-access-t8j6t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:                      <none>

To Reproduce

Expected behavior

Running negative test case Power Off Node One By Once For More Than Pod Eviction Timeout While Workload Heavy Writing repeatedly.

Support bundle for troubleshooting

supportbundle_26ac5094-4709-4ab5-bb99-867e4b13cb8f_2024-05-12T02-28-45Z.zip

Environment

Longhorn version: v1.6.2.-rc1
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.29.3+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: sles 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): aws
Number of Longhorn volumes in the cluster:

Additional context

The text was updated successfully, but these errors were encountered:

derekbit · 2024-05-12T03:05:42Z

cc @c3y1huang

c3y1huang · 2024-05-14T02:15:13Z

The node was low on resource: ephemeral-storage. Threshold quantity: 2145752505, available: 1470928Ki.

# Deployment
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
    creationTimestamp: "2024-05-11T09:27:32Z"
    generation: 1
    labels:
      app: e2e-test-deployment-2
      test.longhorn.io: e2e
    managedFields:
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
            f:test.longhorn.io: {}
        f:spec:
          f:progressDeadlineSeconds: {}
          f:replicas: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:strategy:
            f:rollingUpdate:
              .: {}
              f:maxSurge: {}
              f:maxUnavailable: {}
            f:type: {}
          f:template:
            f:metadata:
              f:labels:
                .: {}
                f:app: {}
                f:test.longhorn.io: {}
            f:spec:
              f:containers:
                k:{"name":"sleep"}:
                  .: {}
                  f:args: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:resources: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/data"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
              f:dnsPolicy: {}
              f:restartPolicy: {}
              f:schedulerName: {}
              f:securityContext: {}
              f:terminationGracePeriodSeconds: {}
              f:volumes:
                .: {}
                k:{"name":"pod-data"}:
                  .: {}
                  f:name: {}
                  f:persistentVolumeClaim:
                    .: {}
                    f:claimName: {}
      manager: OpenAPI-Generator
      operation: Update
      time: "2024-05-11T09:27:32Z"
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:deployment.kubernetes.io/revision: {}
        f:status:
          f:availableReplicas: {}
          f:conditions:
            .: {}
            k:{"type":"Available"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Progressing"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:observedGeneration: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updatedReplicas: {}
      manager: k3s
      operation: Update
      subresource: status
      time: "2024-05-11T21:12:47Z"
    name: e2e-test-deployment-2
    namespace: default
    resourceVersion: "176770"
    uid: 1ab4cb23-2115-4cba-a3cb-51abaa549208
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app: e2e-test-deployment-2
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        creationTimestamp: "null"
        labels:
          app: e2e-test-deployment-2
          test.longhorn.io: e2e
      spec:
        containers:
        - args:
          - /bin/sh
          - -c
          - while true;do date;sleep 5; done
          image: busybox
          imagePullPolicy: IfNotPresent
          name: sleep
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /data
            name: pod-data
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
        volumes:
        - name: pod-data
          persistentVolumeClaim:
            claimName: e2e-test-claim-2
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2024-05-11T09:27:32Z"
      lastUpdateTime: "2024-05-11T09:27:44Z"
      message: ReplicaSet "e2e-test-deployment-2-7ddccb49f4" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    - lastTransitionTime: "2024-05-11T21:12:47Z"
      lastUpdateTime: "2024-05-11T21:12:47Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    observedGeneration: 1
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1

# Pod
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2024-05-11T21:05:04Z"
    generateName: e2e-test-deployment-2-7ddccb49f4-
    labels:
      app: e2e-test-deployment-2
      pod-template-hash: 7ddccb49f4
      test.longhorn.io: e2e
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:generateName: {}
          f:labels:
            .: {}
            f:app: {}
            f:pod-template-hash: {}
            f:test.longhorn.io: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"e46892f6-02e1-4298-be3f-18d6ed4be19d"}: {}
        f:spec:
          f:containers:
            k:{"name":"sleep"}:
              .: {}
              f:args: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:name: {}
              f:resources: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/data"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
          f:dnsPolicy: {}
          f:enableServiceLinks: {}
          f:restartPolicy: {}
          f:schedulerName: {}
          f:securityContext: {}
          f:terminationGracePeriodSeconds: {}
          f:volumes:
            .: {}
            k:{"name":"pod-data"}:
              .: {}
              f:name: {}
              f:persistentVolumeClaim:
                .: {}
                f:claimName: {}
      manager: k3s
      operation: Update
      time: "2024-05-11T21:05:04Z"
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:conditions:
            k:{"type":"ContainersReady"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"DisruptionTarget"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Initialized"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"PodReadyToStartContainers"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"Ready"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:containerStatuses: {}
          f:hostIP: {}
          f:hostIPs: {}
          f:message: {}
          f:phase: {}
          f:reason: {}
          f:startTime: {}
      manager: k3s
      operation: Update
      subresource: status
      time: "2024-05-11T21:07:00Z"
    name: e2e-test-deployment-2-7ddccb49f4-5fxcp
    namespace: default
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: e2e-test-deployment-2-7ddccb49f4
      uid: e46892f6-02e1-4298-be3f-18d6ed4be19d
    resourceVersion: "174844"
    uid: 3c1630b9-d41d-4a64-a812-ba2d4b4d15e4
  spec:
    containers:
    - args:
      - /bin/sh
      - -c
      - while true;do date;sleep 5; done
      image: busybox
      imagePullPolicy: IfNotPresent
      name: sleep
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /data
        name: pod-data
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-t8j6t
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-2-198
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: e2e-test-claim-2
    - name: kube-api-access-t8j6t
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:06:59Z"
      message: 'The node was low on resource: ephemeral-storage. Threshold quantity:
        2145752505, available: 1470928Ki. '
      reason: TerminationByKubelet
      status: "True"
      type: DisruptionTarget
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      status: "False"
      type: PodReadyToStartContainers
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      status: "True"
      type: Initialized
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      reason: PodFailed
      status: "False"
      type: Ready
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      reason: PodFailed
      status: "False"
      type: ContainersReady
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - image: busybox
      imageID: "null"
      lastState: {}
      name: sleep
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          exitCode: 137
          finishedAt: null
          message: The container could not be located when the pod was terminated
          reason: ContainerStatusUnknown
          startedAt: "null"
    hostIP: 10.0.2.198
    hostIPs:
    - ip: 10.0.2.198
    message: 'The node was low on resource: ephemeral-storage. Threshold quantity:
      2145752505, available: 1470928Ki. '
    phase: Failed
    qosClass: BestEffort
    reason: Evicted
    startTime: "2024-05-11T21:05:04Z"

- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2024-05-11T21:06:59Z"
    generateName: e2e-test-deployment-2-7ddccb49f4-
    labels:
      app: e2e-test-deployment-2
      pod-template-hash: 7ddccb49f4
      test.longhorn.io: e2e
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:generateName: {}
          f:labels:
            .: {}
            f:app: {}
            f:pod-template-hash: {}
            f:test.longhorn.io: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"e46892f6-02e1-4298-be3f-18d6ed4be19d"}: {}
        f:spec:
          f:containers:
            k:{"name":"sleep"}:
              .: {}
              f:args: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:name: {}
              f:resources: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/data"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
          f:dnsPolicy: {}
          f:enableServiceLinks: {}
          f:restartPolicy: {}
          f:schedulerName: {}
          f:securityContext: {}
          f:terminationGracePeriodSeconds: {}
          f:volumes:
            .: {}
            k:{"name":"pod-data"}:
              .: {}
              f:name: {}
              f:persistentVolumeClaim:
                .: {}
                f:claimName: {}
      manager: k3s
      operation: Update
      time: "2024-05-11T21:06:59Z"
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:conditions:
            .: {}
            k:{"type":"ContainersReady"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"Initialized"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"PodReadyToStartContainers"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"PodScheduled"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Ready"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
          f:containerStatuses: {}
          f:hostIP: {}
          f:hostIPs: {}
          f:phase: {}
          f:podIP: {}
          f:podIPs:
            .: {}
            k:{"ip":"10.42.4.136"}:
              .: {}
              f:ip: {}
          f:startTime: {}
      manager: k3s
      operation: Update
      subresource: status
      time: "2024-05-11T21:12:47Z"
    name: e2e-test-deployment-2-7ddccb49f4-lx7gv
    namespace: default
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: e2e-test-deployment-2-7ddccb49f4
      uid: e46892f6-02e1-4298-be3f-18d6ed4be19d
    resourceVersion: "176768"
    uid: a9705a91-e439-4f28-b079-ba23c0372965
  spec:
    containers:
    - args:
      - /bin/sh
      - -c
      - while true;do date;sleep 5; done
      image: busybox
      imagePullPolicy: IfNotPresent
      name: sleep
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /data
        name: pod-data
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6l5qb
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-2-198
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: e2e-test-claim-2
    - name: kube-api-access-6l5qb
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:47Z"
      status: "True"
      type: PodReadyToStartContainers
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:35Z"
      status: "True"
      type: Initialized
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:47Z"
      status: "True"
      type: Ready
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:47Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:35Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://76068cd483578d623459b5bdd3af78c3eb5cabaa34067fc6e54a534737f32ca0
      image: docker.io/library/busybox:latest
      imageID: docker.io/library/busybox@sha256:5eef5ed34e1e1ff0a4ae850395cbf665c4de6b4b83a32a0bc7bcb998e24e7bbb
      lastState: {}
      name: sleep
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2024-05-11T21:12:47Z"
    hostIP: 10.0.2.198
    hostIPs:
    - ip: 10.0.2.198
    phase: Running
    podIP: 10.42.4.136
    podIPs:
    - ip: 10.42.4.136
    qosClass: BestEffort
    startTime: "2024-05-11T21:12:35Z"

# Node
  status:
    addresses:
    - address: 10.0.2.198
      type: InternalIP
    - address: ip-10-0-2-198
      type: Hostname
    conditions:
    - lastHeartbeatTime: "2024-05-12T02:24:37Z"
      lastTransitionTime: "2024-05-11T21:12:35Z"
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure

c3y1huang · 2024-05-14T04:48:35Z

#8550 (comment)

The test was given a pause while hanging with the error: Waiting for e2e-test-deployment-2 pods ['e2e-test-deployment-2-7ddccb49f4-5fxcp'] stable, retry (58495)
The test given the deployment e2e-test-deployment-2 1 replicaSet.
During the test, the Kubelet detected low resource on node ip-10-0-2-198, it initiated eviction of pod e2e-test-deployment-2-7ddccb49f4-5fxcp.
```
      reason: TerminationByKubelet
      status: "True"
      type: DisruptionTarget
```

Then, a new deployment pod e2e-test-deployment-2-7ddccb49f4-lx7gv was created and reached the Running state.

e2e-test-deployment-2-7ddccb49f4-5fxcp   0/1     ContainerStatusUnknown   0               5h26m   <none>        ip-10-0-2-198   <none>           <none>
e2e-test-deployment-2-7ddccb49f4-lx7gv   1/1     Running                  0               5h24m   10.42.4.136   ip-10-0-2-198   <none>           <none>

However, despite the new pod being successfully deployed and running, the Kubelet failed to clean up the evicted pod (e2e-test-deployment-2-7ddccb49f4-5fxcp).

This issue doesn't look like a Longhorn's bug, as the deployment pod eviction was initiated by the Kubelet and it's expected that the Kubelet would handle the cleanup as well.

There's an ongoing upstream discussion that seems to be related: kubernetes/kubernetes#122160

We could consider enhancing the test case, filtering out pods terminated by the Kubelet.

cc @derekbit @yangchiu @innobead

longhorn-io-github-bot · 2024-05-14T23:36:03Z

c3y1huang · 2024-05-16T00:10:49Z

Closing as this has been tested. longhorn/longhorn-tests#1902 (review)

yangchiu added kind/bug reproduce/rare < 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) area/resilience System or volume resilience labels May 12, 2024

yangchiu added this to the v1.6.2 milestone May 12, 2024

derekbit assigned c3y1huang May 12, 2024

derekbit modified the milestones: v1.6.2, v1.6.3 May 14, 2024

derekbit changed the title ~~[BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot~~ [BUG][v1.6.3-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot May 14, 2024

derekbit changed the title ~~[BUG][v1.6.3-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot~~ [BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot May 14, 2024

derekbit modified the milestones: v1.6.3, v1.6.2 May 14, 2024

c3y1huang mentioned this issue May 14, 2024

fix(robot): skip kubelet terminated pod in get_workload_pods longhorn/longhorn-tests#1902

Merged

derekbit added the area/upstream Upstream related like tgt upstream library label May 15, 2024

c3y1huang closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot #8550

[BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot #8550

yangchiu commented May 12, 2024

derekbit commented May 12, 2024

c3y1huang commented May 14, 2024 •

edited

c3y1huang commented May 14, 2024 •

edited

longhorn-io-github-bot commented May 14, 2024 •

edited by c3y1huang

c3y1huang commented May 16, 2024

[BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot #8550

[BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot #8550

Comments

yangchiu commented May 12, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

derekbit commented May 12, 2024

c3y1huang commented May 14, 2024 • edited

c3y1huang commented May 14, 2024 • edited

longhorn-io-github-bot commented May 14, 2024 • edited by c3y1huang

Pre Ready-For-Testing Checklist

c3y1huang commented May 16, 2024

c3y1huang commented May 14, 2024 •

edited

c3y1huang commented May 14, 2024 •

edited

longhorn-io-github-bot commented May 14, 2024 •

edited by c3y1huang