Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking batch job status fails #456

Open
sokil opened this issue Jul 6, 2022 · 6 comments
Open

Checking batch job status fails #456

sokil opened this issue Jul 6, 2022 · 6 comments

Comments

@sokil
Copy link

sokil commented Jul 6, 2022

I have batch job that perform some one-time short-running task. Successfull deploument looks like:

2022-06-29T16:00:17Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/deploy: evaluation e9d76b4c-8f4b-68e5-05e3-eee20a82d225 finished successfully job_id=some_nomad_job_name
2022-06-29T16:00:18Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: job has status running job_id=some_nomad_job_name
2022-06-29T16:00:18Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in pending state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: task command in allocation 124b605d-518e-6292-5cd3-8decc4d033ec now in running state job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/job_status_checker: all allocations in deployment of job are running job_id=some_nomad_job_name
2022-06-29T16:00:27Z |INFO| levant/deploy: job deployment successful job_id=some_nomad_job_name

Today i'v got error:

2022-07-06T14:57:01Z |INFO| levant/deploy: triggering a deployment job_id=some_nomad_job_name
2022-07-06T14:57:03Z |INFO| levant/deploy: evaluation ffa905f9-e937-e178-2e1a-d2b3d18ed8a8 finished successfully job_id=some_nomad_job_name
2022-07-06T14:57:03Z |DEBU| levant/job_status_checker: running job status checker for job job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/job_status_checker: job has status dead job_id=some_nomad_job_name
2022-07-06T14:57:07Z |ERRO| levant/deploy: job deployment failed job_id=some_nomad_job_name

In successful deployment time between "levant/job_status_checker: running job status checker for job" and first status is 0 seconds.
In failed - 4 seconds. During this time my job was successfully finished and has status 'dead' but levant thinks that this task is just dead so it exited with non zero code and fails by CI pipeline.

As i see, levant have some problems with communication to nomad and its tooks to long time to get job status.
Is it possible to disable check of job? because asynchronous checking of short lived tasks may fail unexpectedly

@sokil sokil changed the title Checking batch job stats fails Checking batch job state fails Jul 6, 2022
@sokil sokil changed the title Checking batch job state fails Checking batch job status fails Jul 6, 2022
@DevKhaverko
Copy link

I have the same problem, levant marks deployment as failed because it checks job status, which can be pending, running and dead
This status can't tell us about was container or smth else exited successfully or not

@linuxmail
Copy link

hi,

same issue .. I have a one-shot container which creates files and then exit 0 .. but pipeline is marked as failed:

2023-01-18T14:55:03Z |INFO| levant/job_status_checker: task django-collectstatic in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in dead state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |INFO| levant/job_status_checker: task django in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in running state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |INFO| levant/job_status_checker: task nginx in allocation dcfac9d2-9a14-f493-bd02-34af173724e3 now in running state job_id=backoffice_gunicorn
2023-01-18T14:55:04Z |ERRO| levant/deploy: job deployment failed job_id=backoffice_gunicorn
Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit code 1

cu denny

@DevKhaverko
Copy link

You can check status of allocation via cli. It works for checking until it won't be fixed

@linuxmail
Copy link

You can check status of allocation via cli. It works for checking until it won't be fixed

Via levant or via Nomad Cli ? Can you give me an example? It sounds for me, that I then need to add an exit 0 and check the state on a separate task.

@DevKhaverko
Copy link

      IDs=($(nomad job allocs -namespace "ns_name" -t '{{ $IDs := . }}{{ range $IDs }}{{ printf .ID }} {{ end }}' "job_name"))
      lastID="${IDs[0]}"
      status=$(nomad alloc status -namespace "ns_name" -short -t '{{ (index .ClientStatus) }}' "$lastID")
      if [[ "$status" != "complete" ]]; then
         echo "Job failed check error in logs: $NOMAD_ADDR/ui/allocations/$lastID/job_name-task/logs" 
         exit 1
      else
         echo "Job successfully finished"
      fi

@DevKhaverko
Copy link

Also I missed checking while job is running, just add while loop before checking status "complete"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants