Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Use temporary directories for sra and temporary fastq files #961

Open
simonvh opened this issue Mar 17, 2023 · 14 comments
Open

FR: Use temporary directories for sra and temporary fastq files #961

simonvh opened this issue Mar 17, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@simonvh
Copy link
Member

simonvh commented Mar 17, 2023

Useful pipeline you have here ;)

Is your feature request related to a problem? Please describe.

When working on an NFS-mounted systsem, it seems like downloading sra and converting to fastq files is taking longer than needed or expected. In addition, IO becomes very slow on that NFS system to the point that it's noticable in other applications and while working on the command-line. My (lightly tested) hunch is that this happens when SRA files are converted to FASTQ as, this means reading and writing to disk with potentially many parallel processes. Most likely, you will not want to keep the SRA files anyway.

Describe the solution you'd like

I think this could be solved by using $TMPDIR with the snakemake tmpdir resource and/or shadow rules. However, you may have already tried this, and refrained from implementing due to other issues that occur.

If this would be relevant, it would maybe also make sense to couple the SRA and the FASTQ rules, so that the SRA file is only used internal to the rule, and can be kept on the temporary fileystem and be deleted when the rule successfully finishes. Downside is that you may have duplicated code, as the SRA download would be copied. On the other hand, this could also be implemented in a bash script, which then could be re-used.

Would this be worth considering (I'd be happy to supply a PR)?

@simonvh simonvh added the enhancement New feature or request label Mar 17, 2023
@Maarten-vd-Sande
Copy link
Member

I guess this could work, and makes sense. I do think there are some some issues, specifically when running on e.g. slurm. I'm not sure how to do this nicely..

caveats:

  • fastq_dir shouldn't be in tmpdir when using the download-fastq workflow.
  • when using e.g. slurm or some other scheduler, and the compute node(s) do not share tmpdir with the node you run it from.
  • I'm not sure if you can use a resource for both input and output easily.

I don't think shallow rules would help here, but perhaps I'm just not seeing the solution here.

An alternative without using tmpdir would be to set the sra_dir: /tmp/my_sras in the config, and similarly for fastq_dir: /tmp: my_fastqs. But this won't work when running over multiple node(s).

@Maarten-vd-Sande
Copy link
Member

Also the sra files and the fastq files are all removed when not needed anymore in the newer seq2science versions

@simonvh
Copy link
Member Author

simonvh commented Mar 17, 2023

The problem is not disk space, as the files are indeed removed when no longer needed, but IO on the NFS which leads to slow response of our whole server.

If I understand it correctly, the following could solve this:

rule download_fastq_whatever:

  • SRA downloaded to $TMPDIR
  • SRA converted to FASTQ on $TMPDIR
  • FASTQ file moved/copied to output_dir

The $TMPDIR can be set using the tmpdir resource. This should work in a slurm context as it will delay the evaluation of that variable:

The tmpdir resource automatically leads to setting the $TMPDIR variable for shell commands, scripts, wrappers and notebooks. In cluster or cloud setups, its evaluation is delayed until the actual execution of the job. This way, it can dynamically react on the context of the node of execution.

The effect is that IO on NFS is only for the final fastq output (which is inevitable anyway), and on a local $TMPDIR for the IO-intensive multiprocessing SRA to FASTQ conversion.

@Maarten-vd-Sande
Copy link
Member

Thinking a bit about it again, I guess your original proposal actually could work, (dynamically) "grouping" and "shadowing" the rules, and setting the shadow-prefix to tmpdir.

I guess a new config variable should specify whether or not the temp dirs setting is needed, and where that would be: https://github.com/vanheeringen-lab/seq2science/blob/master/seq2science/schemas/config/download.schema.yaml

run2sra downloads the sra, and sra2fastq_SE and sra2fastq_PEuseparallel-fastq-dump` to make it a fastq. All the other downloading rules directly download fastq-files, and don't have to be changed. https://github.com/vanheeringen-lab/seq2science/blob/master/seq2science/rules/get_fastq.smk

Those rules then need dynamic group and shadow keywords, depending on if the config variable is set. My guess would be that None works as the default.

Only question is if it nicely works with the tmpdir resource.

@simonvh simonvh mentioned this issue Mar 20, 2023
4 tasks
@simonvh
Copy link
Member Author

simonvh commented Mar 20, 2023

See #963 for an initial PR. I think shadowing is not even necessary. As far as I understand it, snakemake will set the tmpdir resource to $TMPDIR, and will do so once the rule is started. This works fine on one computer, haven't tested this in a slurm setup.

The parallel_downloads resource will also ensure that not too many SRA files are present in $TMPDIR at the same time.

Due to merging the SRA download and fastq-dump in one rule, using tmpdir is now easy, as we don't have to keep temporary files between rules. It does have the downside that the SRA file will be downloaded again if dumping doesn't work or is interrupted.

Let me know if there are other potential issues with this solution.

@Maarten-vd-Sande
Copy link
Member

I don't think I like this approach 😇. For example /tmp is 15GB on cn106. Moreover it makes performance worse as snakemake can not distinguish anymore between the "downloading" resource and the cores resource.

Doesn't changing sra_dir: /tmp/my_sras and fastq_dir: /tmp/my_fastqs effectively do the same?

@simonvh
Copy link
Member Author

simonvh commented Mar 20, 2023

For example /tmp is 15GB on cn106

That's why you set/use $TMPDIR and use /scratch right? The 15GB will be prohibitive for many tools when using it as the default temporary directory.

Doesn't changing sra_dir: /tmp/my_sras and fastq_dir: /tmp/my_fastqs effectively do the same?

Kinda.. but...
If I have a workflow with hundreds of samples, this won't fit on the local temp directory. From the initial test run I did, it looks like it first will download all (or at least many) samples before mapping/quantifying. It also means that you will have to remember to set these directories every time you create a new workflow.

it makes performance worse as snakemake can not distinguish anymore between the "downloading" resource and the cores resource.

I see your point. I'm not sure this is always the case practically though, due to slow NFS performance in the current configuration. I will think a bit more on it :).

@simonvh
Copy link
Member Author

simonvh commented Mar 21, 2023

I'll do some benchmarking of this PR relative to the develop branch. It could be that improved performance of the SRA -> fastq conversion outweighs the downside of not having as many parallel downloads? Other than that I'm not sure there is another viable solution. (This fix does solve my immediate problems, so at the least it was good for that purpose ;))

@Maarten-vd-Sande
Copy link
Member

Maarten-vd-Sande commented Mar 21, 2023

That's why you set/use $TMPDIR and use /scratch right? The 15GB will be prohibitive for many tools when using it as the default temporary directory.

I do, yes. But I don't think everyone does.

The problem you're having seems specific to your hardware setup as I've never gotten any of these issues. Way back on sara I think I've set parallel-downloads to +/- 20 and all went fine. Similarly when using /scratch on the servers it doesn't cause latency issues.

Perhaps using disk_gb would help but seems pretty specific to this problem (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#standard-resources).

I still think setting sra_dir and fastq_dir could solve this problem. Especially as the trimmed fastqs get placed in the trimmed_dir, which could be on the large file system where they wait to get mapped. You can even set those directories as defaults in a profile and use that, e.g. seq2science run chip-seq --profile dontusetemp. Some command-line arguments don't work in the profile (they get overwritten), but that way you can't forget your favourite settings

https://github.com/vanheeringen-lab/seq2science-profiles

@Maarten-vd-Sande
Copy link
Member

Or just run it with less parallel downloads, seq2science run atac-seq --snakemakeOptions resources={parallel_downloads:1}

@simonvh
Copy link
Member Author

simonvh commented Mar 21, 2023

I don't think the parallel downloads are the problem, but the SRA -> fastq conversion (limiting parallel downloads is just an unfortunate side effect of grouping the download and conversion as I have done here). I could be wrong, and it could indeed be an issue with our NFS config. In my case sra_dir and fastq_dir wouldn't work, as snakemake downloaded many SRA files before converting them to FASTQ, resulting in a directory size that would be too big for our /scratch. We just don't have a local disk that is large enough to hold all the raw data. Snakemake doesn't prioritize finishing a sample (and thereby freeing space of SRA and untrimmed FASTQ files) , over just downloading all samples first.

Feel free to close the PR. I'm happy with it on my side (everything is running smoothly now) but I can see why you would not happy with it being a default.

@siebrenf
Copy link
Member

siebrenf commented Mar 21, 2023

Im all for fixing needless IO, but if prefetch cannot be piped into fastq-dump we're still stuck with writing sras.

That said, we can limit how many of those exist at any one moment. #968 prioritizes finishing a fastq over downloading sras, and can easily be expanded to include trimming and beyond: #969.

And this pr is not mutually exclusive

@siebrenf
Copy link
Member

Our version of sra-tools was old. #970 has the latest version, which received a few bugfixes. That should mean fewer retries, and less file moving.

To optimize for your setup, I think we should go with PRs #969 and #970, and you should try sra_dir: /tmp/sra, fastq_dir: /tmp/fastq and trimmed_dir: /tmp/trimmed in the config/profile.

while looking into prefetch I noticed something else: you could consider downloading SRAlite files (with simplified quality scores, thus smaller) instead of the full SRAs. Their docs.

@simonvh
Copy link
Member Author

simonvh commented Mar 22, 2023

This will all help, but don't worry too much about my use case. I'm happy with the code of my own PR on my side ;)

I just wonder if this is a broader issue, but that would need to be tested. If you don't experience slowdown (as in server really becoming unresponsive, waiting seconds for just an SSH login), then this may be due to our configuration. Don't spend too much (or any) time on it for my sake.

The thing is, I'm not 100% sure this is the underlying issue. What I do know that with the code of this PR I don't experience it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants