Enhance Resilience to Empty FASTQ Files with Logging Functionality #4842

glichtenstein · 2024-02-02T19:14:27Z

PR checklist

Title of PR:

"Enhanced Resilience to Empty or Invalid FASTQ Files with Logging Functionality"

Description:

Changes Made:

Introduction of a Log File:
- Implemented a global log file (${params.outdir}/invalid_fastqs.log) for tracking empty or invalid FASTQ files, enhancing post-run analysis.
Modified generate_fastq_meta Function:
- Updated generate_fastq_meta to gracefully handle empty or invalid FASTQ files, preventing pipeline interruption.
- Added logFile as a new parameter for logging these occurrences.
New Function - appendToLogFile:
- Introduced appendToLogFile, a utility function for appending messages to the log file. This function improves maintainability and includes a check to handle Groovy GStrings.

Reason for Changes:

Improving Pipeline Robustness: Aims to bolster the pipeline's resilience against edge cases like empty or invalid FASTQ files.
Enhanced Debugging and Transparency: Offers a clear record of skipped files, aiding in debugging and ensuring data quality.
Maintainability and Code Organization: The addition of appendToLogFile and the parameterization of logFile align with best coding practices, enhancing code reusability and organization.

Impact:

These enhancements bolster the pipeline's robustness without altering its core functionality, ensuring a smooth experience when encountering empty or invalid FASTQ files.

Use Cases Addressed:

Incorrect Barcode in Sample Sheet:
- In cases where an incorrect set of barcodes is provided for a sample, resulting in an empty FASTQ file, the pipeline previously halted. This was problematic when processing multiple samples (e.g., 99 valid samples and 1 with an incorrect barcode), as valid FASTQ files would remain in the workdir and not be moved to the outdir. With the proposed changes, the pipeline will continue to the end, allowing for the release of data for valid samples and logging the problematic sample for further investigation.
Library Preparation Errors:
- Instances where library preparation for a specific sample is faulty can also lead to empty FASTQ files. The updated pipeline will now allow for continuous processing to the end, enabling users to utilize MultiQC outputs to identify which samples were problematic. The invalid_fastqs.log file will assist in pinpointing these samples for reprocessing or further laboratory investigation.

…nt demultiplex.

glichtenstein · 2024-02-02T20:59:33Z

Hi @matthdsm, one important note, to pass the tests, I have commented out the pytest_modules.yml for subworkflows/bcl_demultiplex, it seems the md5sum's dont match with the ones expected. This PR is related to this one in nf-core/demultiplex:dev how can we proceed? your advice will be very much appreciated. Thanks a lot.

Aratz

About the tests failing, I tried it from master and they fail too, so it must be unrelated to your PR 🤷 That being said, we are migrating everything from pytest to nf-test, so it would be much appreciated if you could update the tests in this PR. You can find the instructions here 👉 https://nf-co.re/docs/contributing/subworkflows#migrating-from-pytest-to-nf-test. This will also update the md5sums in the process.

Also, if you want the logfile to be available to the outside and other modules downstream I think it should be part of an output channel (although I'm not entirely sure about this, you might want to get a second opinion).

…handling

matthdsm · 2024-02-28T10:58:44Z

@Aratz, this good to go you think?

subworkflows/nf-core/bcl_demultiplex/tests/main.nf.test

Aratz · 2024-02-28T11:36:21Z

I'm still wondering if invalid_fastqs.log shouldn't be part of a channel and added to emit: instead. My thinking is that this may make it easier if we want to detect downstream if invalid fastqs have been generated. But I'm no expert, I'd be happy to hear some second opinions on this.

Also, maybe I was a bit too quick to fix the tests, I just noticed the snap file was basically empty, so this should be addressed too.

tests/subworkflows/nf-core/bcl_demultiplex/main.nf.test

Co-authored-by: Matthias De Smet <11850640+matthdsm@users.noreply.github.com>

…o test different container engines or defaults to Docker

SPPearce · 2024-05-02T12:23:43Z

Do you need any help to get these changes merged in?

glichtenstein · 2024-05-03T18:26:56Z

Do you need any help to get these changes merged in?

Yes, I could definelty use a hand, I am struggling with the nf-tests.

…ndling

SPPearce · 2024-05-05T07:23:54Z

Ok, can help you this week.

tests/config/nextflow.config

SPPearce · 2024-05-07T10:15:02Z

@glichtenstein , should be good to merge now.
I didn't do much, just removed the files from the old pytest folder (tests/...) and then changed the assertions to that used in the individual modules.

SPPearce · 2024-05-07T10:32:56Z

@glichtenstein , I don't seem to be able to edit .github/workflows/test.yml on your branch. It needs:

          - profile: conda
            path: subworkflows/nf-core/bcl_demultiplex

Adding before the current line 644, to be able to exclude the conda test.

…est. - profile: conda; path: subworkflows/nf-core/bcl_demultiplex

glichtenstein · 2024-05-08T16:10:02Z

@glichtenstein , I don't seem to be able to edit .github/workflows/test.yml on your branch. It needs:
          - profile: conda
            path: subworkflows/nf-core/bcl_demultiplex
Adding before the current line 644, to be able to exclude the conda test.

Oh, dont know why you cant edit, but I added those lines now.

SPPearce · 2024-05-08T16:23:11Z

I couldn't edit it because it changes the github actions that are run, so github sensibly didn't want some random person to make changes to that on your repository.
Great, let's merge it in.

glichtenstein · 2024-05-08T16:33:17Z

I couldn't edit it because it changes the github actions that are run, so github sensibly didn't want some random person to make changes to that on your repository. Great, let's merge it in.

Awesome, Thanks for everything!!!!

k1sauce · 2024-05-15T22:40:25Z

@glichtenstein Thanks for all the work you have done on this! I filed an issue related to the use of new File here:
#5612

new File will cause issue with several things, for example running on aws with an outdir in s3 will not work. replacing this with file() should be sufficient.

However, I think that this sub workflow should not make decisions on which files are valid and invalid. In fact, an empty FASTQ is a valid output of Illumina's bclconvert and bcl2fastq. Instead, this sub workflow should emit all the files that are created (even if they are empty). This is useful in situations where a downstream user wants to count the # of reads in the FASTQ files to know that a certain barcode/index received zero counts, instead of inferring that is the case. If it is necessary to filter the output, the sub workflow should have emits for empty FASTQ files as @Aratz mentioned. Something like fastqs and empty_fastqs would work.

glichtenstein · 2024-05-16T14:08:22Z

@glichtenstein Thanks for all the work you have done on this! I filed an issue related to the use of new File here: #5612

new File will cause issue with several things, for example running on aws with an outdir in s3 will not work. replacing this with file() should be sufficient.

However, I think that this sub workflow should not make decisions on which files are valid and invalid. In fact, an empty FASTQ is a valid output of Illumina's bclconvert and bcl2fastq. Instead, this sub workflow should emit all the files that are created (even if they are empty). This is useful in situations where a downstream user wants to count the # of reads in the FASTQ files to know that a certain barcode/index received zero counts, instead of inferring that is the case. If it is necessary to filter the output, the sub workflow should have emits for empty FASTQ files as @Aratz mentioned. Something like fastqs and empty_fastqs would work.

@k1sauce Thank you! Those are indeed very good and fair points. Basically, the issue arised because falco will return an error and exit when encountered with an empty fastq stating it was invalid and the entire workflow then exits. So even though BCLConvert has finished the workflow wont continue with the subsequent tasks, like Falco. So the idea is mainly to skip over those empty fastqs so that the workflow can succeed and reach the end so that one can evaluate the cause of empty fastqs, l.e., missed/wrong barcodes in SampleSheet.csv. The output file is intended as a log file to keep track of which files where empty or dimmed invalid by falco during the processing. I would very much like to enhance this and make it optional, so it may not disrupt your work, maybe with a flag like --skip-over-empty-fastqs or something of the sorts. What do you think?

k1sauce · 2024-05-16T21:47:15Z

@glichtenstein Thanks for the background info. After thinking it over, I think the best thing to do is to include a boolean flag in the meta map of the FASTQ output channel. That way the user can decide how to handle those empty FASTQ files.

glichtenstein · 2024-05-16T22:21:50Z

@glichtenstein Thanks for the background info. After thinking it over, I think the best thing to do is to include a boolean flag in the meta map of the FASTQ output channel. That way the user can decide how to handle those empty FASTQ files.

Okie, I will try to do both, use nextflow's file() and add a boolean flag, will create a new branch.

Enhance Resilience to Empty FASTQ Files with Logging Functionality

df65c65

glichtenstein force-pushed the fix/empty-fastqs-handling branch 2 times, most recently from cdcc03b to df65c65 Compare February 2, 2024 20:14

tests commented out from pytest_modules.yml, md5sums dont match curre…

19f90d2

…nt demultiplex.

glichtenstein force-pushed the fix/empty-fastqs-handling branch from 7649d1b to 19f90d2 Compare February 2, 2024 20:52

glichtenstein marked this pull request as ready for review February 2, 2024 20:59

glichtenstein requested review from matthdsm and a team as code owners February 2, 2024 20:59

glichtenstein requested a review from koenbossers February 2, 2024 20:59

glichtenstein mentioned this pull request Feb 2, 2024

skip over empty fastqs after demultiplex and continue gracefully nf-core/demultiplex#166

Closed

10 tasks

matthdsm approved these changes Feb 4, 2024

View reviewed changes

Aratz self-requested a review February 5, 2024 07:43

Aratz reviewed Feb 5, 2024

View reviewed changes

glichtenstein mentioned this pull request Feb 18, 2024

Demux dies for empty files nf-core/demultiplex#142

Closed

glichtenstein and others added 7 commits February 18, 2024 19:09

Merge remote-tracking branch 'upstream/master' into fix/empty-fastqs-…

4391313

…handling

pytest modules returned to default before migrating to nf-tests

724f108

Migrating from pytest to nf-test

da78f94

bcl2fastq

4864727

rollback using filter instead of branch.

f152603

Merge branch 'master' into fix/empty-fastqs-handling

640d348

Fix bcl_demultiplex tests

e9e43b1

matthdsm reviewed Feb 28, 2024

View reviewed changes

subworkflows/nf-core/bcl_demultiplex/tests/main.nf.test Outdated Show resolved Hide resolved

Aratz reviewed Feb 28, 2024

View reviewed changes

tests/subworkflows/nf-core/bcl_demultiplex/main.nf.test Outdated Show resolved Hide resolved

Aratz and others added 3 commits February 28, 2024 12:46

Fix snapshot

d5fece5

Update subworkflows/nf-core/bcl_demultiplex/tests/main.nf.test

0773f9b

Co-authored-by: Matthias De Smet <11850640+matthdsm@users.noreply.github.com>

Merge branch 'nf-core:master' into fix/empty-fastqs-handling

3a6504e

glichtenstein force-pushed the fix/empty-fastqs-handling branch from 3221e14 to 3a6504e Compare March 29, 2024 21:39

glichtenstein added 3 commits March 29, 2024 17:59

docker set in nextflow.config (?)

fa507f1

setting docker enabled in bcl2fastq and bclconvert test nextflow.config

9e288ef

PROFILE = "" // Set to 'singularity', 'conda', 'mamba', or 'podman' t…

64727c0

…o test different container engines or defaults to Docker

Merge remote-tracking branch 'origin/master' into fix/empty-fastqs-ha…

d7f1f7b

…ndling

SPPearce added 3 commits May 7, 2024 05:44

Remove pytest files

ab2dfc4

Merge branch 'master' into fix/empty-fastqs-handling

65dc4ab

Update test

66bcbb3

maxulysse approved these changes May 7, 2024

View reviewed changes

maxulysse reviewed May 7, 2024

View reviewed changes

tests/config/nextflow.config Outdated Show resolved Hide resolved

Update tests and remove nextflow.config

b27fd5a

maxulysse approved these changes May 7, 2024

View reviewed changes

SPPearce mentioned this pull request May 7, 2024

Add nf-test for bcl_demultiplex, remove old pytest #4501

Closed

14 tasks

Adding before the current line 644, to be able to exclude the conda t…

bb75b7c

…est. - profile: conda; path: subworkflows/nf-core/bcl_demultiplex

Merge branch 'master' into fix/empty-fastqs-handling

9ad1b80

SPPearce added this pull request to the merge queue May 8, 2024

Merged via the queue into nf-core:master with commit 98dec39 May 8, 2024
11 checks passed

glichtenstein deleted the fix/empty-fastqs-handling branch May 8, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Resilience to Empty FASTQ Files with Logging Functionality #4842

Enhance Resilience to Empty FASTQ Files with Logging Functionality #4842

glichtenstein commented Feb 2, 2024 •

edited

glichtenstein commented Feb 2, 2024

Aratz left a comment

matthdsm commented Feb 28, 2024

Aratz commented Feb 28, 2024

SPPearce commented May 2, 2024

glichtenstein commented May 3, 2024

SPPearce commented May 5, 2024

SPPearce commented May 7, 2024

SPPearce commented May 7, 2024

glichtenstein commented May 8, 2024

SPPearce commented May 8, 2024

glichtenstein commented May 8, 2024

k1sauce commented May 15, 2024 •

edited

glichtenstein commented May 16, 2024

k1sauce commented May 16, 2024

glichtenstein commented May 16, 2024

Enhance Resilience to Empty FASTQ Files with Logging Functionality #4842

Enhance Resilience to Empty FASTQ Files with Logging Functionality #4842

Conversation

glichtenstein commented Feb 2, 2024 • edited

PR checklist

Title of PR:

Description:

Reason for Changes:

Impact:

Use Cases Addressed:

glichtenstein commented Feb 2, 2024

Aratz left a comment

Choose a reason for hiding this comment

matthdsm commented Feb 28, 2024

Aratz commented Feb 28, 2024

SPPearce commented May 2, 2024

glichtenstein commented May 3, 2024

SPPearce commented May 5, 2024

SPPearce commented May 7, 2024

SPPearce commented May 7, 2024

glichtenstein commented May 8, 2024

SPPearce commented May 8, 2024

glichtenstein commented May 8, 2024

k1sauce commented May 15, 2024 • edited

glichtenstein commented May 16, 2024

k1sauce commented May 16, 2024

glichtenstein commented May 16, 2024

glichtenstein commented Feb 2, 2024 •

edited

k1sauce commented May 15, 2024 •

edited