Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy jobs are terminated when celery container goes down #436

Open
dtenenba opened this issue Jan 1, 2024 · 0 comments
Open

copy jobs are terminated when celery container goes down #436

dtenenba opened this issue Jan 1, 2024 · 0 comments
Assignees

Comments

@dtenenba
Copy link
Contributor

dtenenba commented Jan 1, 2024

Copy jobs are started inside the celery container as processes that run in that container. Therefore they terminate if the celery container is terminated.

So if the celery container instead started jobs as docker containers (mounting /var/run/docker.sock and using the API) then the copy jobs could continue even if the celery container goes down.

However, celery would stop getting updates from the jobs (I think?) once it came back up. Need to test and see exactly what happens (but note that

The idea behind filing this issue is that currently if we want to update all containers of a production instance of motuz, we have to wait until there are no jobs running, or else terminate them.

If the fix/change we want to deploy is only needed in the web app, we can just restart the web app container.
But if we need to update the celery (or rabbitmq?) container we will run into this issue.

Really, this is a larger issue than the title suggests. It could involve a major refactor depending on how we want to handle it.

Things that have been discussed:

  • Copy jobs are submitted to slurm
  • Motuz runs in k8s (k3s?) and starts copy jobs as pods
  • Motuz runs in a/the swarm and starts copy jobs as services(?)
  • Note that motuz does not have to run in a swarm or k8s in order to submit copy jobs to either.
  • copy jobs start as docker containers as described above

For all of these, we need to figure out how to track the progress of copy jobs. It should be robust and interruptible (meaning that if motuz is restarted, it can then still know about jobs in progress and be able to detect where they are, or see if a job failed while it was down, and write the relevant info to the database).

There may or may not be a continuing role for Celery in this new world. For example, if we submitted copy jobs to slurm, we could simply use slurm to track running jobs, celery would be redundant.

Also, it seems that if the celery container is terminated, there is no way to bring it back up without taking all services down and back up (docker-compose down && docker-compose up -d). docker-compose start celery and docker-compose restart celery do not work. This may be an argument against the continued use of celery.

@dtenenba dtenenba self-assigned this Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant