Checking batch job status fails #456

sokil · 2022-07-06T15:28:01Z

I have batch job that perform some one-time short-running task. Successfull deploument looks like:

2022-06-29T16:00:17Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/deploy: evaluation e9d76b4c-8f4b-68e5-05e3-eee20a82d225 finished successfully job_id=some_nomad_job_name
2022-06-29T16:00:18Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: job has status running job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in pending state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in running state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: all allocations in deployment of job are running job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/deploy: job deployment successful job_id=some_nomad_job_name

Today i'v got error:

2022-07-06T14:57:01Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-07-06T14:57:03Z |INFO| levant/deploy: evaluation ffa905f9-e937-e178-2e1a-d2b3d18ed8a8 finished successfully job_id=some_nomad_job_name
2022-07-06T14:57:03Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/job_status_checker: job has status dead job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/deploy: job deployment failed job_id=some_nomad_job_name

In successful deployment time between "levant/job_status_checker: running job status checker for job" and first status is 0 seconds.
In failed - 4 seconds. During this time my job was successfully finished and has status 'dead' but levant thinks that this task is just dead so it exited with non zero code and fails by CI pipeline.

As i see, levant have some problems with communication to nomad and its tooks to long time to get job status.
Is it possible to disable check of job? because asynchronous checking of short lived tasks may fail unexpectedly

The text was updated successfully, but these errors were encountered:

DevKhaverko · 2022-11-19T06:28:01Z

I have the same problem, levant marks deployment as failed because it checks job status, which can be pending, running and dead
This status can't tell us about was container or smth else exited successfully or not

linuxmail · 2023-01-18T15:34:40Z

hi,

same issue .. I have a one-shot container which creates files and then exit 0 .. but pipeline is marked as failed:

2023-01-18T14:55:03Z |INFO| levant/job_status_checker: task django-collectstatic in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in dead state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |INFO| levant/job_status_checker: task django in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in running state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |INFO| levant/job_status_checker: task nginx in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in running state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |ERRO| levant/deploy: job deployment failed job_id=backoffice_gunicorn
Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit code 1

cu denny

DevKhaverko · 2023-01-18T15:37:09Z

You can check status of allocation via cli. It works for checking until it won't be fixed

linuxmail · 2023-01-18T18:30:32Z

You can check status of allocation via cli. It works for checking until it won't be fixed

Via levant or via Nomad Cli ? Can you give me an example? It sounds for me, that I then need to add an exit 0 and check the state on a separate task.

DevKhaverko · 2023-01-19T06:49:57Z

      IDs=($(nomad job allocs -namespace "ns_name" -t '{{ $IDs := . }}{{ range $IDs }}{{ printf .ID }} {{ end }}' "job_name"))
      lastID="${IDs[0]}"
      status=$(nomad alloc status -namespace "ns_name" -short -t '{{ (index .ClientStatus) }}' "$lastID")
      if [[ "$status" != "complete" ]]; then
         echo "Job failed check error in logs: $NOMAD_ADDR/ui/allocations/$lastID/job_name-task/logs" 
         exit 1
      else
         echo "Job successfully finished"
      fi

DevKhaverko · 2023-02-05T08:21:49Z

Also I missed checking while job is running, just add while loop before checking status "complete"

sokil changed the title ~~Checking batch job stats fails~~ Checking batch job state fails Jul 6, 2022

sokil changed the title ~~Checking batch job state fails~~ Checking batch job status fails Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking batch job status fails #456

Checking batch job status fails #456

sokil commented Jul 6, 2022

DevKhaverko commented Nov 19, 2022

linuxmail commented Jan 18, 2023

DevKhaverko commented Jan 18, 2023

linuxmail commented Jan 18, 2023

DevKhaverko commented Jan 19, 2023

DevKhaverko commented Feb 5, 2023

Checking batch job status fails #456

Checking batch job status fails #456

Comments

sokil commented Jul 6, 2022

DevKhaverko commented Nov 19, 2022

linuxmail commented Jan 18, 2023

DevKhaverko commented Jan 18, 2023

linuxmail commented Jan 18, 2023

DevKhaverko commented Jan 19, 2023

DevKhaverko commented Feb 5, 2023