Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete pre-computed, downloaded input files on workflow completion on AWS Batch #961

Open
tsibley opened this issue Jun 10, 2022 · 0 comments

Comments

@tsibley
Copy link
Member

tsibley commented Jun 10, 2022

During some troubleshooting on Slack re: a fork/derivative of this workflow, I realized that the workflow doesn't do anything to exclude the pre-computed input files it downloads at the start from being re-uploaded to S3 at the end of the AWS Batch jobs we use in production. For example, from the logs of 90f43cc5-da33-44a5-b0e6-9fda03ac0806 printed by nextstrain build (with some non-standard timestamping added):

[batch] [2022-06-09T19:30:16.001000]   adding: nextstrain-data/ (stored 0%)
[batch] [2022-06-09T19:30:16.001000]   adding: nextstrain-data/files/ (stored 0%)
[batch] [2022-06-09T19:30:16.002000]   adding: nextstrain-data/files/ncov/ (stored 0%)
[batch] [2022-06-09T19:30:16.002000]   adding: nextstrain-data/files/ncov/open/ (stored 0%)
[batch] [2022-06-09T19:30:28.670000]   adding: nextstrain-data/files/ncov/open/metadata.tsv.gz (deflated 0%)
[batch] [2022-06-09T19:30:53.306000]   adding: nextstrain-data/files/ncov/open/sequences.fasta.xz (deflated 0%)

While zip seems to skip trying to compress those files further and not waste CPU time (they add only ~40s to the zip total), the files still bloat the size of the uploaded archive by quite a bit. While it's theoretically nice to have the exact inputs preserved with the exact outputs so we could track detailed provenance or troubleshoot by exact replication, in practice I'm not sure we need to do either of those things.

Any file deletion solution below will want to condition on running in the context of an a) internal Nextstrain profile and b) AWS Batch. These can be maybe best accomplished by introducing a new config var (e.g. delete_inputs_after_use or some better name) that defaults to disabled but that we enable for our production runs. We could also detect AWS Batch by looking for the presence of an env var (e.g. NEXTSTRAIN_AWS_BATCH_WORKDIR_URL would work currently), but I think a single config var to opt-in to the behaviour is better.

For actually doing the deletion, I see two good solutions for the short term:

  1. Delete these files (conditionally) from within both an onsuccess and onerror handler. This is maybe most obvious.

  2. Mark the files (conditionally) as temp(…) so Snakemake automatically cleans them up when no further rules need them. This is maybe easiest/least extra code, but it's not clear if in practice there are pitfalls with temp() or if it's even supported with Snakemake remote files, which we use to download/materialize/localize files.

Longer term, I think it might be reasonable for the AWS Batch machinery in nextstrain/cli and nextstrain/docker-base to support some sort of ignores file, but there's a bit more to consider there in terms of the right interface and so maybe we'll always want to leave it up to the workflow to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

1 participant