From ba4a3cdc9cb95f475160037a065d37a48ef219f6 Mon Sep 17 00:00:00 2001
From: Aldo Culquicondor <acondor@google.com>
Date: Wed, 20 Mar 2024 13:25:31 +0000
Subject: [PATCH] Troubleshooting: checking whether job is admitted or
 preempted

Change-Id: Ia1aefd5bf9b0669d24113688cd0f2f5cde433e9d
---
 site/content/en/docs/tasks/_index.md          |   7 +
 .../en/docs/tasks/troubleshooting/_index.md   |   8 +
 .../troubleshooting/troubleshooting_jobs.md   | 186 ++++++++++++++++++
 3 files changed, 201 insertions(+)
 create mode 100644 site/content/en/docs/tasks/troubleshooting/_index.md
 create mode 100644 site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md
diff --git a/site/content/en/docs/tasks/_index.md b/site/content/en/docs/tasks/_index.md
index 4d5267b023..8b753b6856 100755
--- a/site/content/en/docs/tasks/_index.md
+++ b/site/content/en/docs/tasks/_index.md
@@ -50,3 +50,10 @@ A _platform developer_ integrates Kueue with other software and/or contributes t
 As a platform developer, you can learn how to:
 - [Integrate a custom Job with Kueue](/docs/tasks/integrate_a_custom_job).
 - [Enable pprof endpoints](/docs/tasks/enabling_pprof_endpoints).
+
+## Troubleshooting
+
+Sometimes things go wrong. The following guides can help you understand the state
+of the system.
+
+- [Troubleshooting Jobs](troubleshooting/troubleshooting_jobs)
diff --git a/site/content/en/docs/tasks/troubleshooting/_index.md b/site/content/en/docs/tasks/troubleshooting/_index.md
new file mode 100644
index 0000000000..5d14575725
--- /dev/null
+++ b/site/content/en/docs/tasks/troubleshooting/_index.md
@@ -0,0 +1,8 @@
+---
+title: "Troubleshooting"
+weight: 10
+date: 2023-08-23
+description: >
+  Sometimes things go wrong. The following guides can help you understand the state of the system.
+no_list: false
+---
diff --git a/site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md b/site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md
new file mode 100644
index 0000000000..a462c3e067
--- /dev/null
+++ b/site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md
@@ -0,0 +1,186 @@
+---
+title: "Troubleshooting Jobs"
+date: 2024-03-23
+weight: 1
+description: >
+  Troubleshooting the status of a Job
+---
+
+This doc is about troubleshooting pending kubernetes Jobs, however, most of the ideas can be extrapolated
+to other supported CRDs.
+
+## Identifying the Workload for your Job
+
+For each Job (a Kubernetes Job or a CRD), Kueue creates a [Workload](/docs/concepts/workload) object to hold the
+information about the admission of the Job. The Workload object allows Kueue to make admission decisions without
+knowing the specifics of each CRD.
+
+There are multiple ways to find the Workload for a Job. In the following examples, let's assume your
+Job is called `my-job` in the `my-namespace` namespace.
+
+1. You can obtain the Workload name from the Job events, running the following command:
+
+   ```
+   kubectl describe job -n my-namespace my-job
+   ```
+
+   The relevant event will look like the following:
+
+   ```
+     Normal  CreatedWorkload   24s   batch/job-kueue-controller  Created Workload: my-namespace/job-my-job-19797
+   ```
+
+2. Kueue includes the UID of the source Job in the label `kueue.x-k8s.io/job-uid`.
+   You can obtain the workload name with the following commands:
+
+   ```
+   JOB_UID=$(kubectl get job -n my-namespace my-job -o jsonpath='{.metadata.uid}')
+   kubectl get workloads -n my-namespace -l "kueue.x-k8s.io/job-uid=$JOB_UID"
+   ```
+
+   The output looks like the following:
+
+   ```
+   NAME               QUEUE        ADMITTED BY     AGE
+   job-my-job-19797   user-queue   cluster-queue   9m45s
+   ```
+
+3. You can list all of the workloads in the same namespace of your job and identify the one
+   that matches the format `<api-name>-<job-name>-<hash>`.
+   The command may look like the following:
+
+   ```
+   kubectl get workloads -n my-namespace | grep job-my-job
+   ```
+
+   The output looks like the following:
+
+   ```
+   NAME               QUEUE        ADMITTED BY     AGE
+   job-my-job-19797   user-queue   cluster-queue   9m45s
+   ```
+
+## Is my Job running?
+
+To know whether your Job is running, look for the value of the `.spec.suspend` field, by
+running the following command:
+
+```
+kubectl get job -n my-namespace my-job -o jsonpath='{.spec.suspend}'
+```
+
+If your Job is running, the output will be `false`.
+
+## Is my Job admitted?
+
+If your Job is not running, you should first check whether Kueue has admitted the Workload.
+
+The starting point to know whether a Job was admitted, it's pending or was not yet attempted
+for admission is to look at the Workload status.
+
+Run the following command to obtain the full status of a Workload:
+
+```
+kubectl get workload -n my-namespace my-workload -o yaml
+```
+
+### Admitted Workload
+
+If your Job is admitted, the Workload should have a status similar to the following:
+
+```yaml
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: Workload
+...
+status:
+  admission:
+    clusterQueue: cluster-queue
+    podSetAssignments:
+    - count: 3
+      flavors:
+        cpu: default-flavor
+        memory: default-flavor
+      name: main
+      resourceUsage:
+        cpu: "3"
+        memory: 600Mi
+  conditions:
+  - lastTransitionTime: "2024-03-19T20:49:17Z"
+    message: Quota reserved in ClusterQueue cluster-queue
+    reason: QuotaReserved
+    status: "True"
+    type: QuotaReserved
+  - lastTransitionTime: "2024-03-19T20:49:17Z"
+    message: The workload is admitted
+    reason: Admitted
+    status: "True"
+    type: Admitted
+```
+
+### Pending Workload
+
+If Kueue has attempted to admit the Workload, but failed to so due to lack of quota,
+the Workload should have a status similar to the following:
+
+```yaml
+status:
+  conditions:
+  - lastTransitionTime: "2024-03-21T13:43:00Z"
+    message: 'couldn''t assign flavors to pod set main: insufficient quota for cpu
+      in flavor default-flavor in ClusterQueue'
+    reason: Pending
+    status: "False"
+    type: QuotaReserved
+```
+
+### Unattempted Workload
+
+When using a [ClusterQueue](/docs/concepts/cluster_queue) with the `StrictFIFO`
+[`queueingStrategy`](/docs/concepts/cluster_queue/#queueing-strategy), Kueue only attempts
+to admit the head of each ClusterQueue. As a result, if Kueue didn't attempt to admit
+a Workload, the Workload status would not contain any condition.
+
+### Misconfigured LocalQueues or ClusterQueues
+
+If your Job references a LocalQueue that doesn't exist or the LocalQueue or ClusterQueue
+that it references is misconfigured, the Workload status would look like the following:
+
+```yaml
+status:
+  conditions:
+  - lastTransitionTime: "2024-03-21T13:55:21Z"
+    message: LocalQueue user-queue doesn't exist
+    reason: Inadmissible
+    status: "False"
+    type: QuotaReserved
+```
+
+## Is my Job preempted?
+
+If your Job is not running, and your ClusterQueues have [preemption](/docs/concepts/cluster_queue/#preemption) enabled,
+you should check whether Kueue preempted the Workload.
+
+
+```yaml
+status:
+  conditions:
+  - lastTransitionTime: "2024-03-21T15:49:56Z"
+    message: 'couldn''t assign flavors to pod set main: insufficient unused quota
+      for cpu in flavor default-flavor, 9 more needed'
+    reason: Pending
+    status: "False"
+    type: QuotaReserved
+  - lastTransitionTime: "2024-03-21T15:49:55Z"
+    message: Preempted to accommodate a higher priority Workload
+    reason: Preempted
+    status: "True"
+    type: Evicted
+  - lastTransitionTime: "2024-03-21T15:49:56Z"
+    message: The workload has no reservation
+    reason: NoReservation
+    status: "False"
+    type: Admitted
+```
+
+The `Evicted` condition shows that the Workload was preempted and the `QuotaReserved` condition with `status: "True"`
+shows that Kueue already attempted to admit it again, unsuccessfully in this case.