You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fastq_merge reads in files and writes them out even if they do not need to be merged. This results in 12 hour runtimes of fastqmerge where the file is not changed at all except that its name is different.
Example:
SRR000000_1.fastq if the only run in an experiment and is 50G.
This file is passed to fastq_merge
There are no other files besides SRR000000_1.fastq
fastq_merge will merge all files in the directory without checking how many files there are. It does this by reading in all files, merging them, and writing the single new file out with the new name (i.e. SRX00000_1.fastq).
In the above case of 1 file, the process takes 12 hours to perform because the python code was inefficent and wanted everything in memory.
The only thing different is the name of the file is now SRX00000_1.fastq
This impacts most experiments, as it is increasingly rare with modern sequencers to have multiple runs per experiment.
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered:
Here is a not very good fix, but it is faster than the alternative:
New fastq_merge.nf file:
/**
* This process merges the fastq files based on their sample_id number.
*/
process fastq_merge {
tag { sample_id }
container "systemsgenetics/gemmaker:2.1.0"
input:
tuple val(sample_id), path(fastq_files)
output:
tuple val(sample_id), path("${sample_id}_?.fastq"), emit: FASTQ_FILES
tuple val(sample_id), val(params.DONE_SENTINEL), emit: DONE_SIGNAL
script:
"""
echo "#TRACE sample_id=${sample_id}"
echo "#TRACE fastq_lines=`cat *.fastq | wc -l`"
# Use find to locate files matching the pattern in the current directory
# and count them for both potential paired end
file_count_1=`find . -maxdepth 1 -type f -name "*_1.fastq" | wc -l`
#file_count_2=`find . -maxdepth 1 -type f -name "*_2.fastq" | wc -l`
# Check the number of files. If there is only 1 there is no need to do the merge
if [ "\${file_count_1}" -gt 1 ] ; then
echo "There are two or more fastq files. Proceeding to merge"
merge_fastq.py --fastq_files ${fastq_files.join(" ")} --out_prefix ${sample_id}
else
echo "There is one of each file. No Need to merge, renaming instead"
# Move fatsq _11
cp *_1.fastq ${sample_id}_1.fastq
# This command only moves fastq _2 if it exists
if [ -f *_2.fastq ]; then
cp *_2.fastq ${sample_id}_2.fastq
echo "File _2 has been moved."
else
echo "File _2 does not exist. This means this sample is not paired"
fi
fi
"""
}
This checks if a sample has multiple files. If a sample only has 1 file, it copies that to the current directory.
This can not see an edge case where there is only one *_1.fastq and two *_2.fastq files (but I have never seen this).
Issues with this new code and why I am still not happy:
The cp command is more efficent, but not super efficent. I cannot use the mv command because it messes up cleanup step that we have, although it would be a lot more efficent.
A much better alternative would be to split the channel coming out of fastq_dump into files that need to be merged and those that do not. I do not have the time to do this though because it messes with the cleanup steps again and this takes awhile to move arround. @spficklin if you have time this would be a much needed improvement. Message me if you need more details.
Description of the bug
fastq_merge reads in files and writes them out even if they do not need to be merged. This results in 12 hour runtimes of fastqmerge where the file is not changed at all except that its name is different.
Example:
This impacts most experiments, as it is increasingly rare with modern sequencers to have multiple runs per experiment.
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: