Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastq_merge is very inefficent #288

Open
JohnHadish opened this issue Mar 21, 2024 · 1 comment
Open

fastq_merge is very inefficent #288

JohnHadish opened this issue Mar 21, 2024 · 1 comment
Assignees
Labels

Comments

@JohnHadish
Copy link
Collaborator

Description of the bug

fastq_merge reads in files and writes them out even if they do not need to be merged. This results in 12 hour runtimes of fastqmerge where the file is not changed at all except that its name is different.

Example:

  • SRR000000_1.fastq if the only run in an experiment and is 50G.
  • This file is passed to fastq_merge
  • There are no other files besides SRR000000_1.fastq
  • fastq_merge will merge all files in the directory without checking how many files there are. It does this by reading in all files, merging them, and writing the single new file out with the new name (i.e. SRX00000_1.fastq).
  • In the above case of 1 file, the process takes 12 hours to perform because the python code was inefficent and wanted everything in memory.
  • The only thing different is the name of the file is now SRX00000_1.fastq

This impacts most experiments, as it is increasingly rare with modern sequencers to have multiple runs per experiment.

Command used and terminal output

No response

Relevant files

No response

System information

No response

@JohnHadish JohnHadish added the bug label Mar 21, 2024
@JohnHadish JohnHadish self-assigned this Mar 21, 2024
@JohnHadish
Copy link
Collaborator Author

Here is a not very good fix, but it is faster than the alternative:

New fastq_merge.nf file:

/**
 * This process merges the fastq files based on their sample_id number.
 */
process fastq_merge {
    tag { sample_id }
    container "systemsgenetics/gemmaker:2.1.0"

    input:
    tuple val(sample_id), path(fastq_files)

    output:
    tuple val(sample_id), path("${sample_id}_?.fastq"), emit: FASTQ_FILES
    tuple val(sample_id), val(params.DONE_SENTINEL), emit: DONE_SIGNAL

    script:
    """
    echo "#TRACE sample_id=${sample_id}"
    echo "#TRACE fastq_lines=`cat *.fastq | wc -l`"

    # Use find to locate files matching the pattern in the current directory
    # and count them for both potential paired end
    file_count_1=`find . -maxdepth 1 -type f -name "*_1.fastq" | wc -l`
    #file_count_2=`find . -maxdepth 1 -type f -name "*_2.fastq" | wc -l`

    # Check the number of files. If there is only 1 there is no need to do the merge
    if [ "\${file_count_1}" -gt 1 ] ; then
        echo "There are two or more fastq files. Proceeding to merge"
        merge_fastq.py --fastq_files ${fastq_files.join(" ")} --out_prefix ${sample_id}
    else
        echo "There is one of each file. No Need to merge, renaming instead"
        
        # Move fatsq _11
        cp *_1.fastq ${sample_id}_1.fastq
        
        # This command only moves fastq _2 if it exists
        if [ -f *_2.fastq ]; then
                cp *_2.fastq ${sample_id}_2.fastq
                echo "File _2 has been moved."
        else
                echo "File _2 does not exist. This means this sample is not paired"
        fi
    fi
    """
}

This checks if a sample has multiple files. If a sample only has 1 file, it copies that to the current directory.
This can not see an edge case where there is only one *_1.fastq and two *_2.fastq files (but I have never seen this).

Issues with this new code and why I am still not happy:
The cp command is more efficent, but not super efficent. I cannot use the mv command because it messes up cleanup step that we have, although it would be a lot more efficent.

A much better alternative would be to split the channel coming out of fastq_dump into files that need to be merged and those that do not. I do not have the time to do this though because it messes with the cleanup steps again and this takes awhile to move arround. @spficklin if you have time this would be a much needed improvement. Message me if you need more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant