A pipeline to process Nanopore reads and transfer the results to the end users.
git clone git@github.com:maxplanck-ie/nanoporeReads_dataTransfer.git
cd nanoporeReads_dataTransfer
mamba env create -n ont -f env.yaml
mamba activate ont
pip install .
For Apple M1/M2 (arm64) many conda packages are not yet available. Use instead:
CONDA_SUBDIR=osx-64 mamba create -n ont -f env.yaml
The key functionality is achieved using snakemake workflows. From version 2.0.0 two different snakemake rule sets are supported which are centered around two different basecallers:
rules
: a guppy-based workflow (legacy); In fact, this also works with dorado basecaller, but only for non-multiplexed data. These rules will not be maintained.rules_dorado
: a dorado-based workflow. This does not work with guppy.
A wrapper python script (ont.py
) implements
- the continuous screening of the source directory,
- the generation of a flowcell-specific configuration file, and
- the communication with enduser (emails etc.)
The main configuration file (config.yaml
) specifies:
- the paths for the rule set be used (
rulesPath: rules
orrules_dorado
), - the overall directory structure (see below)
- organism-specific paths (e.g. genome and transcriptome locations)
- communication settings (email, Parkour LIMS, sambahost)
- generic parameters (basecalling, mapping)
Notice that the generic configuration defined by this file is expanded by project-specific entries for each incoming flowcell
Additional configuration files are:
env.yaml
(for conda installation of all dependencies)multiqc_config.yaml
(to customize multiqc output)
ont -c config.yaml
The workflow connects and relies on three main data locations:
- A source directory (
offloadDir
) is screened for the arrival of new and unprocessed flowcells - A work directory (
outputDir
) is used for various processing steps (merging, basecalling, demultiplexing, alignment, quality controls) - The target directory (
groupDir
) receives the analysis results in a project-wise manner.
The details are rule-set dependent. Annotated examples for rules_dorado
is given below
This directory is generated by the sequencing machine and may change in response to technological developments.
../path/to/flowcell/
.
├── bam_pass # from fast basecalling
├── barcode_alignment_PAS33554_6b0029ab_a0fbcf5b.tsv
├── fastq_pass # from fast basecalling
├── final_summary_PAS33554_6b0029ab_a0fbcf5b.txt
├── other_reports
├── pod5_pass # pod5 format
├── pore_activity_PAS33554_6b0029ab_a0fbcf5b.csv
├── report_PAS33554_20230928_1016_6b0029ab.html
├── report_PAS33554_20230928_1016_6b0029ab.json
├── report_PAS33554_20230928_1016_6b0029ab.md
├── SampleSheet.csv # sample sheet information
├── sample_sheet_PAS33554_20230928_1016_6b0029ab.csv
├── sequencing_summary_PAS33554_6b0029ab_a0fbcf5b.txt
└── throughput_PAS33554_6b0029ab_a0fbcf5b.csv
../path/to/flowcell
.
├── analysis.done # flag to signal that this folowcell has been fully processed
├── bam # output from basecalling in bam format (including modificaytion calls)
├── bam_demux # demulitplex samples (empty if no barcoding)
├── benchmarks # benchmarks for each rule
├── benchmarks_combined.tsv # combined benchmark file
├── flags # directory with flags from snakemake rules
├── log # log files (rule-specific)
├── pipeline_config.yaml # configfile (snakemake & more)
├── pod5 # directory with merged pod5 file (from offloadDir)
├── reports # directory with reports and SampleSheet.csv (from offloadDir)
├── summary # summary files (DAG, disk status)
└── transfer # analysis output that will be transferred)
transfer/
└── Project_projectID_User_Group
├── Analysis_mouse_dna # analysis directory (exists only if genome is known)
│ ├── 23L000329_WT_rep1.align.bam # alignment
│ ├── 23L000329_WT_rep1.align.bam.bai # index
│ └── 23L000329_WT_rep1.align.bed.gz # modification calls
├── Data
│ ├── 23L000329_WT_rep1.bam # basecalled sequences
│ ├── 23L000329_WT_rep1.fastq.gz # basecalled sequences (fastq - deprecated)
│ ├── 23L000329_WT_rep1_porechop.fastq.gz # adaptors, barcodes trimmed
│ └── 23L000329_WT_rep1.seqsum # sequencing summaries (for pycoQC etc )
└── QC
├── multiqc
│ ├── multiqc_data
│ └── multiqc_report.html # multiqc report
├── sample_names.tsv # dictionary sampleID-sampleName
└── Samples # samples-wise quality controls
├── 23L000329_WT_rep1.align.flagstat
├── 23L000329_WT_rep1.align_pycoqc.html
├── 23L000329_WT_rep1.align_pycoqc.json
├── 23L000329_WT_rep1_fastqc.html
├── 23L000329_WT_rep1_fastqc.zip
├── 23L000329_WT_rep1_kraken.report
├── 23L000329_WT_rep1_porechop.info
├── 23L000329_WT_rep1_pycoqc.html
├── 23L000329_WT_rep1_pycoqc.json
├── all_porechop.best_end
├── all_porechop.best_start
└── all_porechop.trimmed
../user_path/to/flowcell/ (identical to outputDir/transfer)
.
├── metadata.yaml
└── Project_projectID_User_Group
├── Analysis_mouse_dna
├── Data
└── QC
This is a legacy structure for older versions of the pipeline (guppy-based)
.. as above
../path/to/flowcell
.
├── analysis.done
├── fastq
├── FASTQC_Project_2913_Falk_DeepSeq
├── flags
├── log
├── pipeline_config.yaml
├── pod5
├── Project_2913_Falk_DeepSeq
├── reports
├── sequencing_summary_0.txt
├── sequencing_summary_barcode10.txt
├── sequencing_summary_barcode11.txt
├── sequencing_summary_unclassified.txt
└── tmp
../user_path/to/flowcell/
.
├── Analysis_projectID_user_group
│ └── dna_drosophila
│ ├── 2901_23S004286_control.bam
│ ├── 2901_23S004286_control.bam.bai
│ ├── 2901_23S004286_control.html
│ ├── 2901_23S004286_control.json
│ ├── 2901_23S004286_fragmented.bam
│ ├── 2901_23S004286_fragmented.bam.bai
│ ├── 2901_23S004286_fragmented.html
│ └── 2901_23S004286_fragmented.json
├── FASTQC_projectID_user_group
│ ├── multiqc
│ │ ├── multiqc_data
│ │ └── multiqc_report.html
│ ├── Sample_23L004429
│ │ ├── 2901_23S004286_fragmented_fastqc.html
│ │ ├── 2901_23S004286_fragmented_fastqc.zip
│ │ ├── 2901_23S004286_fragmented_kraken.report
│ │ ├── 2901_23S004286_fragmented_porechop.info
│ │ ├── 2901_23S004286_fragmented_pycoqc.html
│ │ └── 2901_23S004286_fragmented_pycoqc.json
│ └── Sample_23L004430
│ ├── 2901_23S004286_control_fastqc.html
│ ├── 2901_23S004286_control_fastqc.zip
│ ├── 2901_23S004286_control_kraken.report
│ ├── 2901_23S004286_control_porechop.info
│ ├── 2901_23S004286_control_pycoqc.html
│ └── 2901_23S004286_control_pycoqc.json
├── metadata.yaml
└── Project_projectID_user_group
├── Sample_23L004429
│ ├── 2901_23S004286_fragmented.fastq.gz
│ ├── pass
│ └── sequencing_summary_2901_23S004286_fragmented.txt
└── Sample_23L004430
├── 2901_23S004286_control.fastq.gz
├── pass
└── sequencing_summary_2901_23S004286_control.txt