Save disk space by compressing fastq files (after trimming and filtering) #136

samnooij · 2020-04-14T10:35:24Z

With large datasets and limited disk capacity, saving intermediate fastq files as raw fastq may take up hundreds of GBs of disk space. Disk usage may be decreaused by using gzipped fastq files. The tools that we currently have in the pipeline can all work with gzipped fastq files:

Trimmomatic can output fastq.gz files: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf (page 4)
SPAdes can take fastq.gz as input: https://github.com/ablab/spades#spades-input (see "SPAdes accepts gzip-compressed files." at the bottom of the linked section)
Bowtie2 should have been able to handle fastq.gz since 2012: https://sourceforge.net/p/bowtie-bio/bugs/163/
bedtools bamtofastq may not use gzip, but BBtools' reformat can: https://www.biostars.org/p/223625/#223674 (this is part of 'bbmap', which is in conda env scaffold_analyses.yaml. Also, this supposedly works faster if you also have pigz installed, which is in Jovian_master_environment.yaml)

To implement this, the following rules have to be adjusted:

Clean_the_data: trimmomatic's output
QC_clean_data: FastQC's input
HuGo_removal_1: bowtie2's input
HuGo_removal_2/3: bedtools to bbtools reformat.sh (also requires new conda env)
De_novo_assembly: SPAdes's input
first line of rule all: {sample}_{read}.fq.gz?

I am currently testing these and want to create a new branch (from dev) when I get these working. I will also try and do a little benchmark to get an idea of the performance of the 'gzipped pipeline' against the current version with raw fastq files.

Please let me know if you have any other ideas!

The text was updated successfully, but these errors were encountered:

samnooij · 2020-04-29T15:18:55Z

I have altered the snakemake rules in DennisSchmitz/jovian@30a5ec0 and DennisSchmitz/jovian@28c1cf1. These should change all intermediate fastq files to gzipped variants. I am still running tests to see how well this works compared to the non-gzipped pipeline.

samnooij · 2020-05-22T10:25:07Z

I am done with the benchmark of 9 bacterial metagenomic datasets. In short the conclusions are:

Total processing time per sample increases by about 50 minutes. (from 150 minutes to 200 minutes, counting only affected pipeline steps)
Total disk usage per sample decreases by about 2 GB. (from 3 GB to 1 GB in intermediate fastq files)

Additional remarks:

The steps HuGo_removal_2/3 were actually faster with compression. That is, reformat.sh with sambamba is faster than samtools and bedtools. Whether or not we implement compression by default, this seems like a beneficial change.
Since I see a trade-off between runtime and disk usage, we may choose to make the 'compressing pipeline' optional by a switch in the config file.

samnooij · 2020-05-22T13:07:42Z

I just now noticed two other rules that depend on these intermediate fastq files: Fragment_length_analysis and quantify_output. The former uses BWA, which should be able to handle gzipped fastq files. The latter is a custom Python script that probably won't like binary input. I am going to try and test the whole pipeline now with compressed intermediate files. If anything breaks I will look for solutions again.

DennisSchmitz · 2020-05-25T08:26:48Z

Thank for your thorough analysis and the report you emailed, really nice! So I'm a bit at a loss about how to proceed. Being able to reduce the footprint of certain intermediate files by >50% is really nice. Especially since users on our internal servers are now being capped by a ROM quota (which I'm way above 👼). But I really don't like how it adds 50 minutes of additional processing time per sample. Maybe instead of making it a flag in the config file, it can be added as a flag to the wrapper? Then end-users can choose for themselves if they want to compress after an analysis has finished?

samnooij · 2020-05-27T14:05:35Z

Yes, 50 minutes extra per sample is not a very desirable change. Also, it might be a bit of a hassle to adapt those other two rules to work with compressed files (which again may slow down the whole process).

As an alternative, we have suggested to try and not wait for the 'onsuccess' part at the end of the pipeline to remove unnecessary files, but to use temp() statements as rule outputs for rules whose output is no longer necessary after they have been further processed by the next rule. E.g. after trimming, the trimmed reads only need to be mapped by bowtie2 (background removal part 1). Then they can be removed. If the output of trimming is written something like

output:
    temp("data/cleaned_fastq/{sample}.fastq")

then snakemake can automatically remove this file when it is no longer necessary. This of course also causes the disk usage to be smaller during processing.

Apparently using temp() before gave some trouble, especially when the pipeline crashed halfway. I am trying it right now to see if I run into problems.

DennisSchmitz added this to the 0.9.7.1 milestone May 12, 2020

DennisSchmitz assigned samnooij May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save disk space by compressing fastq files (after trimming and filtering) #136

Save disk space by compressing fastq files (after trimming and filtering) #136

samnooij commented Apr 14, 2020

samnooij commented Apr 29, 2020

samnooij commented May 22, 2020

samnooij commented May 22, 2020

DennisSchmitz commented May 25, 2020

samnooij commented May 27, 2020

Save disk space by compressing fastq files (after trimming and filtering) #136

Save disk space by compressing fastq files (after trimming and filtering) #136

Comments

samnooij commented Apr 14, 2020

samnooij commented Apr 29, 2020

samnooij commented May 22, 2020

samnooij commented May 22, 2020

DennisSchmitz commented May 25, 2020

samnooij commented May 27, 2020