From ba4a3cdc9cb95f475160037a065d37a48ef219f6 Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Wed, 20 Mar 2024 13:25:31 +0000 Subject: [PATCH] Troubleshooting: checking whether job is admitted or preempted Change-Id: Ia1aefd5bf9b0669d24113688cd0f2f5cde433e9d --- site/content/en/docs/tasks/_index.md | 7 + .../en/docs/tasks/troubleshooting/_index.md | 8 + .../troubleshooting/troubleshooting_jobs.md | 186 ++++++++++++++++++ 3 files changed, 201 insertions(+) create mode 100644 site/content/en/docs/tasks/troubleshooting/_index.md create mode 100644 site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md diff --git a/site/content/en/docs/tasks/_index.md b/site/content/en/docs/tasks/_index.md index 4d5267b023..8b753b6856 100755 --- a/site/content/en/docs/tasks/_index.md +++ b/site/content/en/docs/tasks/_index.md @@ -50,3 +50,10 @@ A _platform developer_ integrates Kueue with other software and/or contributes t As a platform developer, you can learn how to: - [Integrate a custom Job with Kueue](/docs/tasks/integrate_a_custom_job). - [Enable pprof endpoints](/docs/tasks/enabling_pprof_endpoints). + +## Troubleshooting + +Sometimes things go wrong. The following guides can help you understand the state +of the system. + +- [Troubleshooting Jobs](troubleshooting/troubleshooting_jobs) diff --git a/site/content/en/docs/tasks/troubleshooting/_index.md b/site/content/en/docs/tasks/troubleshooting/_index.md new file mode 100644 index 0000000000..5d14575725 --- /dev/null +++ b/site/content/en/docs/tasks/troubleshooting/_index.md @@ -0,0 +1,8 @@ +--- +title: "Troubleshooting" +weight: 10 +date: 2023-08-23 +description: > + Sometimes things go wrong. The following guides can help you understand the state of the system. +no_list: false +--- diff --git a/site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md b/site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md new file mode 100644 index 0000000000..a462c3e067 --- /dev/null +++ b/site/content/en/docs/tasks/troubleshooting/troubleshooting_jobs.md @@ -0,0 +1,186 @@ +--- +title: "Troubleshooting Jobs" +date: 2024-03-23 +weight: 1 +description: > + Troubleshooting the status of a Job +--- + +This doc is about troubleshooting pending kubernetes Jobs, however, most of the ideas can be extrapolated +to other supported CRDs. + +## Identifying the Workload for your Job + +For each Job (a Kubernetes Job or a CRD), Kueue creates a [Workload](/docs/concepts/workload) object to hold the +information about the admission of the Job. The Workload object allows Kueue to make admission decisions without +knowing the specifics of each CRD. + +There are multiple ways to find the Workload for a Job. In the following examples, let's assume your +Job is called `my-job` in the `my-namespace` namespace. + +1. You can obtain the Workload name from the Job events, running the following command: + + ``` + kubectl describe job -n my-namespace my-job + ``` + + The relevant event will look like the following: + + ``` + Normal CreatedWorkload 24s batch/job-kueue-controller Created Workload: my-namespace/job-my-job-19797 + ``` + +2. Kueue includes the UID of the source Job in the label `kueue.x-k8s.io/job-uid`. + You can obtain the workload name with the following commands: + + ``` + JOB_UID=$(kubectl get job -n my-namespace my-job -o jsonpath='{.metadata.uid}') + kubectl get workloads -n my-namespace -l "kueue.x-k8s.io/job-uid=$JOB_UID" + ``` + + The output looks like the following: + + ``` + NAME QUEUE ADMITTED BY AGE + job-my-job-19797 user-queue cluster-queue 9m45s + ``` + +3. You can list all of the workloads in the same namespace of your job and identify the one + that matches the format `--`. + The command may look like the following: + + ``` + kubectl get workloads -n my-namespace | grep job-my-job + ``` + + The output looks like the following: + + ``` + NAME QUEUE ADMITTED BY AGE + job-my-job-19797 user-queue cluster-queue 9m45s + ``` + +## Is my Job running? + +To know whether your Job is running, look for the value of the `.spec.suspend` field, by +running the following command: + +``` +kubectl get job -n my-namespace my-job -o jsonpath='{.spec.suspend}' +``` + +If your Job is running, the output will be `false`. + +## Is my Job admitted? + +If your Job is not running, you should first check whether Kueue has admitted the Workload. + +The starting point to know whether a Job was admitted, it's pending or was not yet attempted +for admission is to look at the Workload status. + +Run the following command to obtain the full status of a Workload: + +``` +kubectl get workload -n my-namespace my-workload -o yaml +``` + +### Admitted Workload + +If your Job is admitted, the Workload should have a status similar to the following: + +```yaml +apiVersion: kueue.x-k8s.io/v1beta1 +kind: Workload +... +status: + admission: + clusterQueue: cluster-queue + podSetAssignments: + - count: 3 + flavors: + cpu: default-flavor + memory: default-flavor + name: main + resourceUsage: + cpu: "3" + memory: 600Mi + conditions: + - lastTransitionTime: "2024-03-19T20:49:17Z" + message: Quota reserved in ClusterQueue cluster-queue + reason: QuotaReserved + status: "True" + type: QuotaReserved + - lastTransitionTime: "2024-03-19T20:49:17Z" + message: The workload is admitted + reason: Admitted + status: "True" + type: Admitted +``` + +### Pending Workload + +If Kueue has attempted to admit the Workload, but failed to so due to lack of quota, +the Workload should have a status similar to the following: + +```yaml +status: + conditions: + - lastTransitionTime: "2024-03-21T13:43:00Z" + message: 'couldn''t assign flavors to pod set main: insufficient quota for cpu + in flavor default-flavor in ClusterQueue' + reason: Pending + status: "False" + type: QuotaReserved +``` + +### Unattempted Workload + +When using a [ClusterQueue](/docs/concepts/cluster_queue) with the `StrictFIFO` +[`queueingStrategy`](/docs/concepts/cluster_queue/#queueing-strategy), Kueue only attempts +to admit the head of each ClusterQueue. As a result, if Kueue didn't attempt to admit +a Workload, the Workload status would not contain any condition. + +### Misconfigured LocalQueues or ClusterQueues + +If your Job references a LocalQueue that doesn't exist or the LocalQueue or ClusterQueue +that it references is misconfigured, the Workload status would look like the following: + +```yaml +status: + conditions: + - lastTransitionTime: "2024-03-21T13:55:21Z" + message: LocalQueue user-queue doesn't exist + reason: Inadmissible + status: "False" + type: QuotaReserved +``` + +## Is my Job preempted? + +If your Job is not running, and your ClusterQueues have [preemption](/docs/concepts/cluster_queue/#preemption) enabled, +you should check whether Kueue preempted the Workload. + + +```yaml +status: + conditions: + - lastTransitionTime: "2024-03-21T15:49:56Z" + message: 'couldn''t assign flavors to pod set main: insufficient unused quota + for cpu in flavor default-flavor, 9 more needed' + reason: Pending + status: "False" + type: QuotaReserved + - lastTransitionTime: "2024-03-21T15:49:55Z" + message: Preempted to accommodate a higher priority Workload + reason: Preempted + status: "True" + type: Evicted + - lastTransitionTime: "2024-03-21T15:49:56Z" + message: The workload has no reservation + reason: NoReservation + status: "False" + type: Admitted +``` + +The `Evicted` condition shows that the Workload was preempted and the `QuotaReserved` condition with `status: "True"` +shows that Kueue already attempted to admit it again, unsuccessfully in this case.