Potential problem with demultiplexing and proposed solution #613

a4000 · 2023-08-02T10:00:16Z

Description of feature

The early steps in the pipeline run multiple samples in parallel (e.g., if there is 10 samples, fastqc gets run 10 times). This is the normal way of doing things in nf-core and I think it's great in lots of situations, however I don't think this is an efficient way of handling demultiplexing (at least with Cutadapt). If you have 100 samples and two large fastq file, running Cutadapt 100 times might not be the most efficient use of memory. You would also create lots of redundant data. For each sample, the reads for the other samples would get placed in unknown files and you can quickly fill up storage space.

One lazy solution to the last problem is to just delete those unknown file, but I image people might want the option to be able to look at the reads that didn't get assigned.

My solution is to run Cutadapt on all samples at the same time, I also have a module that creates the input files necessary for Cutadapt to be used this way. I use a mmv module to rename the files so that they contain their sample names. I create a new sample sheet with the original sample sheet data plus the paths to the new fastq files. Lastly I run the new sample sheet through nf-cores samplesheet_check module (I believe Ampliseq uses parse_input instead) which outputs a channel in the [ meta, fastqs ] format that's compatible with other nf-core modules (e.g., fastqc). It feels a bit hacky to create a new sample sheet in the pipeline, then re-use the samplesheet_check module, but that was the easiest solution I found and I lack the Nextflow experience to come up with a better solution.

d4straub · 2023-08-02T12:31:49Z

Hm, maybe it would be possible to:

if any of the demultiplexing columns are provided, activate demultiplexing mode, demupltiplaxing info is saved with other metadata in the meta map per sample
in demultiplexing mode, aggregate the channel by sample files (because each sample has the identical read file(s)), e.g. using groupTuple, i.e. one channel element by file
convert the grouped meta map (which contains each sample with indexes etc.) to info that can be used by cutadapt in the process script section, process the files, output all demultiplexed read files and meta map
somehow group meta with files again, so that there is one channel element per sample

admittedly, I have no idea how the last two steps work, that needs to be tested with channel operators whether that is possible. If not, my whole idea ofc is not feasible.

a4000 added the enhancement New feature or request label Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential problem with demultiplexing and proposed solution #613

Potential problem with demultiplexing and proposed solution #613

a4000 commented Aug 2, 2023

d4straub commented Aug 2, 2023

Potential problem with demultiplexing and proposed solution #613

Potential problem with demultiplexing and proposed solution #613

Comments

a4000 commented Aug 2, 2023

Description of feature

d4straub commented Aug 2, 2023