copy jobs are terminated when celery container goes down #436

dtenenba · 2024-01-01T21:25:50Z

Copy jobs are started inside the celery container as processes that run in that container. Therefore they terminate if the celery container is terminated.

So if the celery container instead started jobs as docker containers (mounting /var/run/docker.sock and using the API) then the copy jobs could continue even if the celery container goes down.

However, celery would stop getting updates from the jobs (I think?) once it came back up. Need to test and see exactly what happens (but note that

The idea behind filing this issue is that currently if we want to update all containers of a production instance of motuz, we have to wait until there are no jobs running, or else terminate them.

If the fix/change we want to deploy is only needed in the web app, we can just restart the web app container.
But if we need to update the celery (or rabbitmq?) container we will run into this issue.

Really, this is a larger issue than the title suggests. It could involve a major refactor depending on how we want to handle it.

Things that have been discussed:

Copy jobs are submitted to slurm
Motuz runs in k8s (k3s?) and starts copy jobs as pods
Motuz runs in a/the swarm and starts copy jobs as services(?)
Note that motuz does not have to run in a swarm or k8s in order to submit copy jobs to either.
copy jobs start as docker containers as described above

For all of these, we need to figure out how to track the progress of copy jobs. It should be robust and interruptible (meaning that if motuz is restarted, it can then still know about jobs in progress and be able to detect where they are, or see if a job failed while it was down, and write the relevant info to the database).

There may or may not be a continuing role for Celery in this new world. For example, if we submitted copy jobs to slurm, we could simply use slurm to track running jobs, celery would be redundant.

Also, it seems that if the celery container is terminated, there is no way to bring it back up without taking all services down and back up (docker-compose down && docker-compose up -d). docker-compose start celery and docker-compose restart celery do not work. This may be an argument against the continued use of celery.

The text was updated successfully, but these errors were encountered:

dtenenba self-assigned this Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy jobs are terminated when celery container goes down #436

copy jobs are terminated when celery container goes down #436

dtenenba commented Jan 1, 2024

copy jobs are terminated when celery container goes down #436

copy jobs are terminated when celery container goes down #436

Comments

dtenenba commented Jan 1, 2024