Troubleshooting pods

kubernetes-sigs · Mar 25, 2024 · 1b662e4 · 1b662e4
1 parent 5023dd2
commit 1b662e4
Show file tree

Hide file tree

Showing 2 changed files with 106 additions and 4 deletions.
diff --git a/site/content/en/docs/tasks/_index.md b/site/content/en/docs/tasks/_index.md
@@ -54,7 +54,5 @@ As a platform developer, you can learn how to:
 
 ## Troubleshooting
 
-Sometimes things go wrong. The following guides can help you understand the state
-of the system.
-
-- [Troubleshooting Jobs](troubleshooting/troubleshooting_jobs)
+Sometimes things go wrong.
+You can follow the [Troubleshooting guides](troubleshooting) to understand the state of the system.
diff --git a/site/content/en/docs/tasks/troubleshooting/troubleshooting_pods.md b/site/content/en/docs/tasks/troubleshooting/troubleshooting_pods.md
@@ -0,0 +1,104 @@
+---
+title: "Troubleshooting Pods"
+date: 2024-03-21
+weight: 1
+description: >
+  Troubleshooting the status of a Pod or group of Pods
+---
+
+This doc is about troubleshooting [plain Pods](/docs/tasks/run/plain_pods/) when directly managed by Kueue,
+in other words, Pods that are not managed by kubernetes Jobs or supported CRDs.
+
+{{% alert title="Note" color="primary" %}}
+This doc focuses on the behavior of Kueue when managing Pods that is different from other job integrations.
+You can read [Troubleshooting Jobs](troubleshooting_jobs) for more general troubleshooting steps.
+{{% /alert %}}
+
+## Is my Pod managed directly by Kueue?
+
+Kueue adds the label `kueue.x-k8s.io/managed` with value `true` to Pods that it manages.
+If the label is not present on a Pod, it means that Kueue is not going to admit or account for the
+resource usage of this Pod directly.
+
+A Pod might not have the `kueue.x-k8s.io/managed` due to one of the following reasons:
+
+1. The [Pod integration is disabled](/docs/tasks/run/plain_pods/#before-you-begin).
+2. The Pod belongs to a namespace or has labels that don't satisfy the requirements of
+   the [`podOptions`](/docs/reference/kueue-config.v1beta1/#PodIntegrationOptions) configured for the Pod integration.
+3. The Pod is owned by a Job or equivalent CRD that is managed by Kueue.
+4. The Pod doesn't have a `kueue.x-k8s.io/queue-name` label and [`manageJobsWithoutQueueName`](/docs/reference/kueue-config.v1beta1/#Configuration)
+   is set to `false`.
+
+## Identifying the Workload for your Pod
+
+When using [Pod groups](/docs/tasks/run/plain_pods/#running-a-group-of-pods-to-be-admitted-together),
+the name of the Workload matches the value of the label `kueue.x-k8s.io/pod-group-name`.
+
+When using [single Pods](/docs/tasks/run/plain_pods/#running-a-single-pod-admitted-by-kueue), you can identify its corresponding
+Workload by following the guide for [Identifying the Workload of a Job](troubleshooting_jobs/#identifying-the-workload-for-your-job).
+
+## Why doesn't a Workload exist for my Pod group?
+
+Before creating a Workload object, Kueue expects all the Pods for the group to be created.
+The Pods should all have the same value for the label `kueue.x-k8s.io/pod-group-name` and
+the number of Pods should be equal to the value of the annotation `kueue.x-k8s.io/pod-group-total-count`.
+
+You can run the following command to identify whether Kueue has or has not created a Workload
+for the Pod:
+
+```bash
+kubectl describe pod my-pod -n my-namespace
+```
+
+If Kueue didn't create the Workload object, you will see an output similar to the following:
+
+```
+...
+Events:
+  Type     Reason              Age   From                  Message
+  ----     ------              ----  ----                  -------
+  Warning  ErrWorkloadCompose  14s   pod-kueue-controller  'my-pod-group' group has fewer runnable pods than expected
+```
+
+{{% alert title="Note" color="primary" %}}
+The above event might show up for the first Pod that Kueue observes, and it will remain
+even if Kueue successfully creates the Workload for the Pod group later.
+{{% /alert %}}
+
+Once Kueue observes all the Pods for the group, you will see an output similar to the following:
+
+```
+...
+Events:
+  Type     Reason              Age   From                  Message
+  ----     ------              ----  ----                  -------
+  Normal   CreatedWorkload     14s   pod-kueue-controller  Created Workload: my-namespace/my-pod-group
+```
+
+## Why did my Pod disappear?
+
+When you enable [preemption](/docs/concepts/cluster_queue/#preemption), Kueue might preempt Pods
+to accomodate higher priority jobs or reclaim quota. Preemption is implemented via `DELETE` calls,
+the standard way of terminating a Pod in Kubernetes.
+
+When using single Pods, Kubernetes will delete Workload object along with the Pod, as there is
+nothing else holding ownership to it.
+
+Kueue doesn't typically fully delete Pods in a Pod group upon preemption. See the next question
+to understand the deletion mechanics for Pods in a Pod group.
+
+## Why aren't Pods in a Pod group deleted when Failed or Succeeded?
+
+When using Pod groups, Kueue keeps a [finalizer](https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/)
+`kueue.x-k8s.io/managed` to prevent Pods from being deleted and to be able to track the progress of the group.
+You should not modify finalizers manually.
+
+Kueue will remove the finalizer from Pods when:
+- The group satisfies the [termination](/docs/tasks/run/plain_pods/#termination) criteria, for example,
+  when all Pods terminate successfully.
+- For Failed Pods, when Kueue observes a replacement Pod.
+- You delete the Workload object.
+
+Once a Pod doesn't have any finalizers, Kubernetes will delete the Pods based on:
+- Whether a user or a controller has issued a Pod deletion.
+- The [Pod garbage collector](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).