Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot #8550

Closed
yangchiu opened this issue May 12, 2024 · 5 comments
Assignees
Labels
area/resilience System or volume resilience area/upstream Upstream related like tgt upstream library kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug

When running negative test case Power Off Node One By Once For More Than Pod Eviction Timeout While Workload Heavy Writing, accidentally encountered a deployment pod got stuck in ContainerStatusUnknown forever.

The test case steps are like:

  1. Create a deployment 0 with rwo volume
  2. Create a deployment 1 with rwx volume
  3. Create a deployment 2 with rwo strict-local volume
  4. Create a statefulset 0 with rwo volume
  5. Create a statefulset 1 with rwx volume
  6. Create a statefulset 2 with rwo strict-local volume
  7. Keep writing data to deployment 0 ~2 and statefulset 0 ~ 2
  8. Power off node 0 for 6 minutes, and then power on
  9. Power off node 1 for 6 minutes, and then power on
  10. Power off node 2 for 6 minutes, and then power on
  11. Check deployment 0 ~2 and statefulset 0 ~ 2 pods running and able to read/write without problems

In https://ci.longhorn.io/job/private/job/longhorn-e2e-test/557/, accidentally encountered after nodes rebooted in step 7 ~ 9, the pod of deployment with rwo strict-local volume got stuck in ContainerStatusUnknown forever:

# kubectl get pods -owide
NAME                                     READY   STATUS                   RESTARTS        AGE     IP            NODE            NOMINATED NODE   READINESS GATES
longhorn-test-minio-f4bbdc54d-lhddt      1/1     Running                  0               5h27m   10.42.4.130   ip-10-0-2-198   <none>           <none>
e2e-test-deployment-2-7ddccb49f4-5fxcp   0/1     ContainerStatusUnknown   0               5h26m   <none>        ip-10-0-2-198   <none>           <none>
e2e-test-deployment-2-7ddccb49f4-lx7gv   1/1     Running                  0               5h24m   10.42.4.136   ip-10-0-2-198   <none>           <none>
e2e-test-deployment-0-b957b9f54-x7t82    1/1     Running                  0               5h18m   10.42.1.146   ip-10-0-2-53    <none>           <none>
e2e-test-deployment-1-dbc678584-8sq9k    1/1     Running                  0               5h18m   10.42.1.147   ip-10-0-2-53    <none>           <none>
e2e-test-statefulset-1-0                 1/1     Running                  0               5h15m   10.42.3.194   ip-10-0-2-78    <none>           <none>
e2e-test-statefulset-0-0                 1/1     Running                  0               5h15m   10.42.3.196   ip-10-0-2-78    <none>           <none>
e2e-test-statefulset-2-0                 1/1     Running                  0               5h14m   10.42.3.198   ip-10-0-2-78    <none>           <none>
# kubectl describe pod e2e-test-deployment-2-7ddccb49f4-5fxcp 
Name:           e2e-test-deployment-2-7ddccb49f4-5fxcp
Namespace:      default
Priority:       0
Node:           ip-10-0-2-198/10.0.2.198
Start Time:     Sat, 11 May 2024 21:05:04 +0000
Labels:         app=e2e-test-deployment-2
                pod-template-hash=7ddccb49f4
                test.longhorn.io=e2e
Annotations:    <none>
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: ephemeral-storage. Threshold quantity: 2145752505, available: 1470928Ki. 
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/e2e-test-deployment-2-7ddccb49f4
Containers:
  sleep:
    Container ID:  
    Image:         busybox
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      while true;do date;sleep 5; done
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from pod-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-t8j6t (ro)
Conditions:
  Type                        Status
  DisruptionTarget            True 
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pod-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  e2e-test-claim-2
    ReadOnly:   false
  kube-api-access-t8j6t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:                      <none>

unknown_pod1

unknown_pod2

To Reproduce

Expected behavior

Running negative test case Power Off Node One By Once For More Than Pod Eviction Timeout While Workload Heavy Writing repeatedly.

Support bundle for troubleshooting

supportbundle_26ac5094-4709-4ab5-bb99-867e4b13cb8f_2024-05-12T02-28-45Z.zip

Environment

  • Longhorn version: v1.6.2.-rc1
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.29.3+k3s1
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: sles 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): aws
  • Number of Longhorn volumes in the cluster:

Additional context

@yangchiu yangchiu added kind/bug reproduce/rare < 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) area/resilience System or volume resilience labels May 12, 2024
@yangchiu yangchiu added this to the v1.6.2 milestone May 12, 2024
@derekbit
Copy link
Member

cc @c3y1huang

@derekbit derekbit modified the milestones: v1.6.2, v1.6.3 May 14, 2024
@derekbit derekbit changed the title [BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot [BUG][v1.6.3-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot May 14, 2024
@derekbit derekbit changed the title [BUG][v1.6.3-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot [BUG][v1.6.2-rc1] Workload pod got stuck in ContainerStatusUnknown after node shutdown and reboot May 14, 2024
@derekbit derekbit modified the milestones: v1.6.3, v1.6.2 May 14, 2024
@c3y1huang
Copy link
Contributor

c3y1huang commented May 14, 2024

The node was low on resource: ephemeral-storage. Threshold quantity: 2145752505, available: 1470928Ki.

# Deployment
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
    creationTimestamp: "2024-05-11T09:27:32Z"
    generation: 1
    labels:
      app: e2e-test-deployment-2
      test.longhorn.io: e2e
    managedFields:
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:labels:
            .: {}
            f:app: {}
            f:test.longhorn.io: {}
        f:spec:
          f:progressDeadlineSeconds: {}
          f:replicas: {}
          f:revisionHistoryLimit: {}
          f:selector: {}
          f:strategy:
            f:rollingUpdate:
              .: {}
              f:maxSurge: {}
              f:maxUnavailable: {}
            f:type: {}
          f:template:
            f:metadata:
              f:labels:
                .: {}
                f:app: {}
                f:test.longhorn.io: {}
            f:spec:
              f:containers:
                k:{"name":"sleep"}:
                  .: {}
                  f:args: {}
                  f:image: {}
                  f:imagePullPolicy: {}
                  f:name: {}
                  f:resources: {}
                  f:terminationMessagePath: {}
                  f:terminationMessagePolicy: {}
                  f:volumeMounts:
                    .: {}
                    k:{"mountPath":"/data"}:
                      .: {}
                      f:mountPath: {}
                      f:name: {}
              f:dnsPolicy: {}
              f:restartPolicy: {}
              f:schedulerName: {}
              f:securityContext: {}
              f:terminationGracePeriodSeconds: {}
              f:volumes:
                .: {}
                k:{"name":"pod-data"}:
                  .: {}
                  f:name: {}
                  f:persistentVolumeClaim:
                    .: {}
                    f:claimName: {}
      manager: OpenAPI-Generator
      operation: Update
      time: "2024-05-11T09:27:32Z"
    - apiVersion: apps/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:deployment.kubernetes.io/revision: {}
        f:status:
          f:availableReplicas: {}
          f:conditions:
            .: {}
            k:{"type":"Available"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Progressing"}:
              .: {}
              f:lastTransitionTime: {}
              f:lastUpdateTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:observedGeneration: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updatedReplicas: {}
      manager: k3s
      operation: Update
      subresource: status
      time: "2024-05-11T21:12:47Z"
    name: e2e-test-deployment-2
    namespace: default
    resourceVersion: "176770"
    uid: 1ab4cb23-2115-4cba-a3cb-51abaa549208
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app: e2e-test-deployment-2
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        creationTimestamp: "null"
        labels:
          app: e2e-test-deployment-2
          test.longhorn.io: e2e
      spec:
        containers:
        - args:
          - /bin/sh
          - -c
          - while true;do date;sleep 5; done
          image: busybox
          imagePullPolicy: IfNotPresent
          name: sleep
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /data
            name: pod-data
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
        volumes:
        - name: pod-data
          persistentVolumeClaim:
            claimName: e2e-test-claim-2
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2024-05-11T09:27:32Z"
      lastUpdateTime: "2024-05-11T09:27:44Z"
      message: ReplicaSet "e2e-test-deployment-2-7ddccb49f4" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    - lastTransitionTime: "2024-05-11T21:12:47Z"
      lastUpdateTime: "2024-05-11T21:12:47Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    observedGeneration: 1
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1
# Pod
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2024-05-11T21:05:04Z"
    generateName: e2e-test-deployment-2-7ddccb49f4-
    labels:
      app: e2e-test-deployment-2
      pod-template-hash: 7ddccb49f4
      test.longhorn.io: e2e
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:generateName: {}
          f:labels:
            .: {}
            f:app: {}
            f:pod-template-hash: {}
            f:test.longhorn.io: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"e46892f6-02e1-4298-be3f-18d6ed4be19d"}: {}
        f:spec:
          f:containers:
            k:{"name":"sleep"}:
              .: {}
              f:args: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:name: {}
              f:resources: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/data"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
          f:dnsPolicy: {}
          f:enableServiceLinks: {}
          f:restartPolicy: {}
          f:schedulerName: {}
          f:securityContext: {}
          f:terminationGracePeriodSeconds: {}
          f:volumes:
            .: {}
            k:{"name":"pod-data"}:
              .: {}
              f:name: {}
              f:persistentVolumeClaim:
                .: {}
                f:claimName: {}
      manager: k3s
      operation: Update
      time: "2024-05-11T21:05:04Z"
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:conditions:
            k:{"type":"ContainersReady"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"DisruptionTarget"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Initialized"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"PodReadyToStartContainers"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"Ready"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:containerStatuses: {}
          f:hostIP: {}
          f:hostIPs: {}
          f:message: {}
          f:phase: {}
          f:reason: {}
          f:startTime: {}
      manager: k3s
      operation: Update
      subresource: status
      time: "2024-05-11T21:07:00Z"
    name: e2e-test-deployment-2-7ddccb49f4-5fxcp
    namespace: default
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: e2e-test-deployment-2-7ddccb49f4
      uid: e46892f6-02e1-4298-be3f-18d6ed4be19d
    resourceVersion: "174844"
    uid: 3c1630b9-d41d-4a64-a812-ba2d4b4d15e4
  spec:
    containers:
    - args:
      - /bin/sh
      - -c
      - while true;do date;sleep 5; done
      image: busybox
      imagePullPolicy: IfNotPresent
      name: sleep
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /data
        name: pod-data
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-t8j6t
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-2-198
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: e2e-test-claim-2
    - name: kube-api-access-t8j6t
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:06:59Z"
      message: 'The node was low on resource: ephemeral-storage. Threshold quantity:
        2145752505, available: 1470928Ki. '
      reason: TerminationByKubelet
      status: "True"
      type: DisruptionTarget
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      status: "False"
      type: PodReadyToStartContainers
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      status: "True"
      type: Initialized
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      reason: PodFailed
      status: "False"
      type: Ready
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      reason: PodFailed
      status: "False"
      type: ContainersReady
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:05:04Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - image: busybox
      imageID: "null"
      lastState: {}
      name: sleep
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          exitCode: 137
          finishedAt: null
          message: The container could not be located when the pod was terminated
          reason: ContainerStatusUnknown
          startedAt: "null"
    hostIP: 10.0.2.198
    hostIPs:
    - ip: 10.0.2.198
    message: 'The node was low on resource: ephemeral-storage. Threshold quantity:
      2145752505, available: 1470928Ki. '
    phase: Failed
    qosClass: BestEffort
    reason: Evicted
    startTime: "2024-05-11T21:05:04Z"
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2024-05-11T21:06:59Z"
    generateName: e2e-test-deployment-2-7ddccb49f4-
    labels:
      app: e2e-test-deployment-2
      pod-template-hash: 7ddccb49f4
      test.longhorn.io: e2e
    managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:generateName: {}
          f:labels:
            .: {}
            f:app: {}
            f:pod-template-hash: {}
            f:test.longhorn.io: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"e46892f6-02e1-4298-be3f-18d6ed4be19d"}: {}
        f:spec:
          f:containers:
            k:{"name":"sleep"}:
              .: {}
              f:args: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:name: {}
              f:resources: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/data"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
          f:dnsPolicy: {}
          f:enableServiceLinks: {}
          f:restartPolicy: {}
          f:schedulerName: {}
          f:securityContext: {}
          f:terminationGracePeriodSeconds: {}
          f:volumes:
            .: {}
            k:{"name":"pod-data"}:
              .: {}
              f:name: {}
              f:persistentVolumeClaim:
                .: {}
                f:claimName: {}
      manager: k3s
      operation: Update
      time: "2024-05-11T21:06:59Z"
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:conditions:
            .: {}
            k:{"type":"ContainersReady"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"Initialized"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"PodReadyToStartContainers"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"PodScheduled"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Ready"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
          f:containerStatuses: {}
          f:hostIP: {}
          f:hostIPs: {}
          f:phase: {}
          f:podIP: {}
          f:podIPs:
            .: {}
            k:{"ip":"10.42.4.136"}:
              .: {}
              f:ip: {}
          f:startTime: {}
      manager: k3s
      operation: Update
      subresource: status
      time: "2024-05-11T21:12:47Z"
    name: e2e-test-deployment-2-7ddccb49f4-lx7gv
    namespace: default
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: e2e-test-deployment-2-7ddccb49f4
      uid: e46892f6-02e1-4298-be3f-18d6ed4be19d
    resourceVersion: "176768"
    uid: a9705a91-e439-4f28-b079-ba23c0372965
  spec:
    containers:
    - args:
      - /bin/sh
      - -c
      - while true;do date;sleep 5; done
      image: busybox
      imagePullPolicy: IfNotPresent
      name: sleep
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /data
        name: pod-data
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6l5qb
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-2-198
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: e2e-test-claim-2
    - name: kube-api-access-6l5qb
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:47Z"
      status: "True"
      type: PodReadyToStartContainers
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:35Z"
      status: "True"
      type: Initialized
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:47Z"
      status: "True"
      type: Ready
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:47Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: "null"
      lastTransitionTime: "2024-05-11T21:12:35Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://76068cd483578d623459b5bdd3af78c3eb5cabaa34067fc6e54a534737f32ca0
      image: docker.io/library/busybox:latest
      imageID: docker.io/library/busybox@sha256:5eef5ed34e1e1ff0a4ae850395cbf665c4de6b4b83a32a0bc7bcb998e24e7bbb
      lastState: {}
      name: sleep
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2024-05-11T21:12:47Z"
    hostIP: 10.0.2.198
    hostIPs:
    - ip: 10.0.2.198
    phase: Running
    podIP: 10.42.4.136
    podIPs:
    - ip: 10.42.4.136
    qosClass: BestEffort
    startTime: "2024-05-11T21:12:35Z"
# Node
  status:
    addresses:
    - address: 10.0.2.198
      type: InternalIP
    - address: ip-10-0-2-198
      type: Hostname
    conditions:
    - lastHeartbeatTime: "2024-05-12T02:24:37Z"
      lastTransitionTime: "2024-05-11T21:12:35Z"
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure

@c3y1huang
Copy link
Contributor

c3y1huang commented May 14, 2024

#8550 (comment)

  1. The test was given a pause while hanging with the error: Waiting for e2e-test-deployment-2 pods ['e2e-test-deployment-2-7ddccb49f4-5fxcp'] stable, retry (58495)
  2. The test given the deployment e2e-test-deployment-2 1 replicaSet.
  3. During the test, the Kubelet detected low resource on node ip-10-0-2-198, it initiated eviction of pod e2e-test-deployment-2-7ddccb49f4-5fxcp.
          reason: TerminationByKubelet
          status: "True"
          type: DisruptionTarget
    
  4. Then, a new deployment pod e2e-test-deployment-2-7ddccb49f4-lx7gv was created and reached the Running state.
    e2e-test-deployment-2-7ddccb49f4-5fxcp   0/1     ContainerStatusUnknown   0               5h26m   <none>        ip-10-0-2-198   <none>           <none>
    e2e-test-deployment-2-7ddccb49f4-lx7gv   1/1     Running                  0               5h24m   10.42.4.136   ip-10-0-2-198   <none>           <none>
    
  5. However, despite the new pod being successfully deployed and running, the Kubelet failed to clean up the evicted pod (e2e-test-deployment-2-7ddccb49f4-5fxcp).

This issue doesn't look like a Longhorn's bug, as the deployment pod eviction was initiated by the Kubelet and it's expected that the Kubelet would handle the cleanup as well.

There's an ongoing upstream discussion that seems to be related: kubernetes/kubernetes#122160

We could consider enhancing the test case, filtering out pods terminated by the Kubelet.

cc @derekbit @yangchiu @innobead

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented May 14, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

  • Which areas/issues this PR might have potential impacts on?
    Area test
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at fix(robot): skip kubelet terminated pod in get_workload_pods longhorn-tests#1902
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@derekbit derekbit added the area/upstream Upstream related like tgt upstream library label May 15, 2024
@c3y1huang
Copy link
Contributor

Closing as this has been tested. longhorn/longhorn-tests#1902 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/upstream Upstream related like tgt upstream library kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests

4 participants