Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential problem with demultiplexing and proposed solution #613

Open
a4000 opened this issue Aug 2, 2023 · 1 comment
Open

Potential problem with demultiplexing and proposed solution #613

a4000 opened this issue Aug 2, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@a4000
Copy link
Contributor

a4000 commented Aug 2, 2023

Description of feature

The early steps in the pipeline run multiple samples in parallel (e.g., if there is 10 samples, fastqc gets run 10 times). This is the normal way of doing things in nf-core and I think it's great in lots of situations, however I don't think this is an efficient way of handling demultiplexing (at least with Cutadapt). If you have 100 samples and two large fastq file, running Cutadapt 100 times might not be the most efficient use of memory. You would also create lots of redundant data. For each sample, the reads for the other samples would get placed in unknown files and you can quickly fill up storage space.

One lazy solution to the last problem is to just delete those unknown file, but I image people might want the option to be able to look at the reads that didn't get assigned.

My solution is to run Cutadapt on all samples at the same time, I also have a module that creates the input files necessary for Cutadapt to be used this way. I use a mmv module to rename the files so that they contain their sample names. I create a new sample sheet with the original sample sheet data plus the paths to the new fastq files. Lastly I run the new sample sheet through nf-cores samplesheet_check module (I believe Ampliseq uses parse_input instead) which outputs a channel in the [ meta, fastqs ] format that's compatible with other nf-core modules (e.g., fastqc). It feels a bit hacky to create a new sample sheet in the pipeline, then re-use the samplesheet_check module, but that was the easiest solution I found and I lack the Nextflow experience to come up with a better solution.

@a4000 a4000 added the enhancement New feature or request label Aug 2, 2023
@d4straub
Copy link
Collaborator

d4straub commented Aug 2, 2023

Hm, maybe it would be possible to:

  • if any of the demultiplexing columns are provided, activate demultiplexing mode, demupltiplaxing info is saved with other metadata in the meta map per sample
  • in demultiplexing mode, aggregate the channel by sample files (because each sample has the identical read file(s)), e.g. using groupTuple, i.e. one channel element by file
  • convert the grouped meta map (which contains each sample with indexes etc.) to info that can be used by cutadapt in the process script section, process the files, output all demultiplexed read files and meta map
  • somehow group meta with files again, so that there is one channel element per sample

admittedly, I have no idea how the last two steps work, that needs to be tested with channel operators whether that is possible. If not, my whole idea ofc is not feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants