Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean the job directory when new jobs are received #284

Open
natefoo opened this issue Sep 28, 2021 · 2 comments
Open

Clean the job directory when new jobs are received #284

natefoo opened this issue Sep 28, 2021 · 2 comments

Comments

@natefoo
Copy link
Member

natefoo commented Sep 28, 2021

I sometimes have to requeue jobs in Galaxy that have finished remotely but weren't finished properly in Galaxy. This is a problem if the job directory still exists on the Pulsar side and the job is sent to the same Pulsar as it was previously. Pulsar attempts to resume stage in files but fails:

Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: 2021-09-28 17:22:41,228 INFO  [pulsar.managers.util.retry][[manager=jetstream_iu]-[action=preprocess]-[job=37992802]] Failed to execute action[Staging input 'dataset_61602712.dat' via FileAction[path=/galaxy-repl/main/files/061/602/dataset_61602712.dat,action_type=remote_transfer,url=https://galaxy-web-04.galaxyproject.org/_job_files?job_id=bbd44e69cb8906b5c6ea3db5fc7ab0c5&job_key=c0ffee&path=/galaxy-repl/main/files/061/602/dataset_61602712.dat&file_type=input] to /jetstream/scratch0/main/jobs/37992802/inputs/dataset_61602712.dat], retrying in 6.0 seconds.
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: Traceback (most recent call last):
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: return fun(*args, **kwargs)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/managers/staging/pre.py", line 19, in <lambda>
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: action_executor.execute(lambda: action.write_to_path(path), "action[%s]" % description)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/client/action_mapper.py", line 465, in write_to_path
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: get_file(self.url, path)
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: File "/srv/pulsar/main/venv/lib64/python3.6/site-packages/pulsar/client/transport/curl.py", line 93, in get_file
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: c.perform()
Sep 28 17:22:41 jetstream-iu0.galaxyproject.org pulsar[6627]: pycurl.error: (33, "HTTP server doesn't seem to support byte ranges. Cannot resume.")

This may be a more general problem as well of Pulsar not knowing the file length and attempting to fetch past the file. Which is to say, it should remove existing job directories when a new setup message is received, and it should also not attempt to resume past the file size when staging in (a separate issue).

@gmauro
Copy link
Member

gmauro commented Sep 29, 2021

I have a cronjob deleting successful/unsuccessful job directories but, I agree a more structured approach would be needed.

@natefoo
Copy link
Member Author

natefoo commented Sep 29, 2021

Yeah, I have a cron job running tmpwatch for this, which is needed regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants