Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Resilience to Empty FASTQ Files with Logging Functionality #4842

Merged
merged 22 commits into from
May 8, 2024

Conversation

glichtenstein
Copy link
Contributor

@glichtenstein glichtenstein commented Feb 2, 2024

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/demultiplex branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Title of PR:

"Enhanced Resilience to Empty or Invalid FASTQ Files with Logging Functionality"

Description:

Changes Made:

  1. Introduction of a Log File:

    • Implemented a global log file (${params.outdir}/invalid_fastqs.log) for tracking empty or invalid FASTQ files, enhancing post-run analysis.
  2. Modified generate_fastq_meta Function:

    • Updated generate_fastq_meta to gracefully handle empty or invalid FASTQ files, preventing pipeline interruption.
    • Added logFile as a new parameter for logging these occurrences.
  3. New Function - appendToLogFile:

    • Introduced appendToLogFile, a utility function for appending messages to the log file. This function improves maintainability and includes a check to handle Groovy GStrings.

Reason for Changes:

  • Improving Pipeline Robustness: Aims to bolster the pipeline's resilience against edge cases like empty or invalid FASTQ files.
  • Enhanced Debugging and Transparency: Offers a clear record of skipped files, aiding in debugging and ensuring data quality.
  • Maintainability and Code Organization: The addition of appendToLogFile and the parameterization of logFile align with best coding practices, enhancing code reusability and organization.

Impact:

  • These enhancements bolster the pipeline's robustness without altering its core functionality, ensuring a smooth experience when encountering empty or invalid FASTQ files.

Use Cases Addressed:

  1. Incorrect Barcode in Sample Sheet:

    • In cases where an incorrect set of barcodes is provided for a sample, resulting in an empty FASTQ file, the pipeline previously halted. This was problematic when processing multiple samples (e.g., 99 valid samples and 1 with an incorrect barcode), as valid FASTQ files would remain in the workdir and not be moved to the outdir. With the proposed changes, the pipeline will continue to the end, allowing for the release of data for valid samples and logging the problematic sample for further investigation.
  2. Library Preparation Errors:

    • Instances where library preparation for a specific sample is faulty can also lead to empty FASTQ files. The updated pipeline will now allow for continuous processing to the end, enabling users to utilize MultiQC outputs to identify which samples were problematic. The invalid_fastqs.log file will assist in pinpointing these samples for reprocessing or further laboratory investigation.

@glichtenstein glichtenstein force-pushed the fix/empty-fastqs-handling branch 2 times, most recently from cdcc03b to df65c65 Compare February 2, 2024 20:14
@glichtenstein
Copy link
Contributor Author

Hi @matthdsm, one important note, to pass the tests, I have commented out the pytest_modules.yml for subworkflows/bcl_demultiplex, it seems the md5sum's dont match with the ones expected. This PR is related to this one in nf-core/demultiplex:dev how can we proceed? your advice will be very much appreciated. Thanks a lot.

@Aratz Aratz self-requested a review February 5, 2024 07:43
Copy link
Contributor

@Aratz Aratz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the tests failing, I tried it from master and they fail too, so it must be unrelated to your PR 🤷 That being said, we are migrating everything from pytest to nf-test, so it would be much appreciated if you could update the tests in this PR. You can find the instructions here 👉 https://nf-co.re/docs/contributing/subworkflows#migrating-from-pytest-to-nf-test. This will also update the md5sums in the process.

Also, if you want the logfile to be available to the outside and other modules downstream I think it should be part of an output channel (although I'm not entirely sure about this, you might want to get a second opinion).

@matthdsm
Copy link
Contributor

@Aratz, this good to go you think?

@Aratz
Copy link
Contributor

Aratz commented Feb 28, 2024

I'm still wondering if invalid_fastqs.log shouldn't be part of a channel and added to emit: instead. My thinking is that this may make it easier if we want to detect downstream if invalid fastqs have been generated. But I'm no expert, I'd be happy to hear some second opinions on this.

Also, maybe I was a bit too quick to fix the tests, I just noticed the snap file was basically empty, so this should be addressed too.

@SPPearce
Copy link
Contributor

SPPearce commented May 2, 2024

Do you need any help to get these changes merged in?

@glichtenstein
Copy link
Contributor Author

Do you need any help to get these changes merged in?

Yes, I could definelty use a hand, I am struggling with the nf-tests.

@SPPearce
Copy link
Contributor

SPPearce commented May 5, 2024

Ok, can help you this week.

@SPPearce
Copy link
Contributor

SPPearce commented May 7, 2024

@glichtenstein , should be good to merge now.
I didn't do much, just removed the files from the old pytest folder (tests/...) and then changed the assertions to that used in the individual modules.

@SPPearce
Copy link
Contributor

SPPearce commented May 7, 2024

@glichtenstein , I don't seem to be able to edit .github/workflows/test.yml on your branch. It needs:

          - profile: conda
            path: subworkflows/nf-core/bcl_demultiplex

Adding before the current line 644, to be able to exclude the conda test.

…est. - profile: conda; path: subworkflows/nf-core/bcl_demultiplex
@glichtenstein
Copy link
Contributor Author

@glichtenstein , I don't seem to be able to edit .github/workflows/test.yml on your branch. It needs:

          - profile: conda
            path: subworkflows/nf-core/bcl_demultiplex

Adding before the current line 644, to be able to exclude the conda test.

Oh, dont know why you cant edit, but I added those lines now.

@SPPearce
Copy link
Contributor

SPPearce commented May 8, 2024

I couldn't edit it because it changes the github actions that are run, so github sensibly didn't want some random person to make changes to that on your repository.
Great, let's merge it in.

@SPPearce SPPearce added this pull request to the merge queue May 8, 2024
Merged via the queue into nf-core:master with commit 98dec39 May 8, 2024
11 checks passed
@glichtenstein
Copy link
Contributor Author

I couldn't edit it because it changes the github actions that are run, so github sensibly didn't want some random person to make changes to that on your repository. Great, let's merge it in.

Awesome, Thanks for everything!!!!

@glichtenstein glichtenstein deleted the fix/empty-fastqs-handling branch May 8, 2024 16:34
@k1sauce
Copy link
Contributor

k1sauce commented May 15, 2024

@glichtenstein Thanks for all the work you have done on this! I filed an issue related to the use of new File here:
#5612

new File will cause issue with several things, for example running on aws with an outdir in s3 will not work. replacing this with file() should be sufficient.

However, I think that this sub workflow should not make decisions on which files are valid and invalid. In fact, an empty FASTQ is a valid output of Illumina's bclconvert and bcl2fastq. Instead, this sub workflow should emit all the files that are created (even if they are empty). This is useful in situations where a downstream user wants to count the # of reads in the FASTQ files to know that a certain barcode/index received zero counts, instead of inferring that is the case. If it is necessary to filter the output, the sub workflow should have emits for empty FASTQ files as @Aratz mentioned. Something like fastqs and empty_fastqs would work.

@glichtenstein
Copy link
Contributor Author

@glichtenstein Thanks for all the work you have done on this! I filed an issue related to the use of new File here: #5612

new File will cause issue with several things, for example running on aws with an outdir in s3 will not work. replacing this with file() should be sufficient.

However, I think that this sub workflow should not make decisions on which files are valid and invalid. In fact, an empty FASTQ is a valid output of Illumina's bclconvert and bcl2fastq. Instead, this sub workflow should emit all the files that are created (even if they are empty). This is useful in situations where a downstream user wants to count the # of reads in the FASTQ files to know that a certain barcode/index received zero counts, instead of inferring that is the case. If it is necessary to filter the output, the sub workflow should have emits for empty FASTQ files as @Aratz mentioned. Something like fastqs and empty_fastqs would work.

@k1sauce Thank you! Those are indeed very good and fair points. Basically, the issue arised because falco will return an error and exit when encountered with an empty fastq stating it was invalid and the entire workflow then exits. So even though BCLConvert has finished the workflow wont continue with the subsequent tasks, like Falco. So the idea is mainly to skip over those empty fastqs so that the workflow can succeed and reach the end so that one can evaluate the cause of empty fastqs, l.e., missed/wrong barcodes in SampleSheet.csv. The output file is intended as a log file to keep track of which files where empty or dimmed invalid by falco during the processing. I would very much like to enhance this and make it optional, so it may not disrupt your work, maybe with a flag like --skip-over-empty-fastqs or something of the sorts. What do you think?

@k1sauce
Copy link
Contributor

k1sauce commented May 16, 2024

@glichtenstein Thanks for the background info. After thinking it over, I think the best thing to do is to include a boolean flag in the meta map of the FASTQ output channel. That way the user can decide how to handle those empty FASTQ files.

@glichtenstein
Copy link
Contributor Author

@glichtenstein Thanks for the background info. After thinking it over, I think the best thing to do is to include a boolean flag in the meta map of the FASTQ output channel. That way the user can decide how to handle those empty FASTQ files.

Okie, I will try to do both, use nextflow's file() and add a boolean flag, will create a new branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants