Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save disk space by compressing fastq files (after trimming and filtering) #136

Open
samnooij opened this issue Apr 14, 2020 · 5 comments
Open
Assignees
Milestone

Comments

@samnooij
Copy link
Collaborator

With large datasets and limited disk capacity, saving intermediate fastq files as raw fastq may take up hundreds of GBs of disk space. Disk usage may be decreaused by using gzipped fastq files. The tools that we currently have in the pipeline can all work with gzipped fastq files:

To implement this, the following rules have to be adjusted:

  • Clean_the_data: trimmomatic's output
  • QC_clean_data: FastQC's input
  • HuGo_removal_1: bowtie2's input
  • HuGo_removal_2/3: bedtools to bbtools reformat.sh (also requires new conda env)
  • De_novo_assembly: SPAdes's input
  • first line of rule all: {sample}_{read}.fq.gz?

I am currently testing these and want to create a new branch (from dev) when I get these working. I will also try and do a little benchmark to get an idea of the performance of the 'gzipped pipeline' against the current version with raw fastq files.

Please let me know if you have any other ideas!

@samnooij
Copy link
Collaborator Author

I have altered the snakemake rules in DennisSchmitz/jovian@30a5ec0 and DennisSchmitz/jovian@28c1cf1. These should change all intermediate fastq files to gzipped variants. I am still running tests to see how well this works compared to the non-gzipped pipeline.

@DennisSchmitz DennisSchmitz added this to the 0.9.7.1 milestone May 12, 2020
@samnooij
Copy link
Collaborator Author

I am done with the benchmark of 9 bacterial metagenomic datasets. In short the conclusions are:

  1. Total processing time per sample increases by about 50 minutes. (from 150 minutes to 200 minutes, counting only affected pipeline steps)
  2. Total disk usage per sample decreases by about 2 GB. (from 3 GB to 1 GB in intermediate fastq files)

Additional remarks:

  1. The steps HuGo_removal_2/3 were actually faster with compression. That is, reformat.sh with sambamba is faster than samtools and bedtools. Whether or not we implement compression by default, this seems like a beneficial change.

  2. Since I see a trade-off between runtime and disk usage, we may choose to make the 'compressing pipeline' optional by a switch in the config file.

@samnooij
Copy link
Collaborator Author

I just now noticed two other rules that depend on these intermediate fastq files: Fragment_length_analysis and quantify_output. The former uses BWA, which should be able to handle gzipped fastq files. The latter is a custom Python script that probably won't like binary input. I am going to try and test the whole pipeline now with compressed intermediate files. If anything breaks I will look for solutions again.

@DennisSchmitz
Copy link
Owner

Thank for your thorough analysis and the report you emailed, really nice! So I'm a bit at a loss about how to proceed. Being able to reduce the footprint of certain intermediate files by >50% is really nice. Especially since users on our internal servers are now being capped by a ROM quota (which I'm way above 👼). But I really don't like how it adds 50 minutes of additional processing time per sample. Maybe instead of making it a flag in the config file, it can be added as a flag to the wrapper? Then end-users can choose for themselves if they want to compress after an analysis has finished?

@samnooij
Copy link
Collaborator Author

Yes, 50 minutes extra per sample is not a very desirable change. Also, it might be a bit of a hassle to adapt those other two rules to work with compressed files (which again may slow down the whole process).

As an alternative, we have suggested to try and not wait for the 'onsuccess' part at the end of the pipeline to remove unnecessary files, but to use temp() statements as rule outputs for rules whose output is no longer necessary after they have been further processed by the next rule. E.g. after trimming, the trimmed reads only need to be mapped by bowtie2 (background removal part 1). Then they can be removed. If the output of trimming is written something like

output:
    temp("data/cleaned_fastq/{sample}.fastq")

then snakemake can automatically remove this file when it is no longer necessary. This of course also causes the disk usage to be smaller during processing.

Apparently using temp() before gave some trouble, especially when the pipeline crashed halfway. I am trying it right now to see if I run into problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants