nanoporeReads_dataTransfer

A pipeline to process Nanopore reads and transfer the results to the end users.

Installation

git clone git@github.com:maxplanck-ie/nanoporeReads_dataTransfer.git
cd nanoporeReads_dataTransfer
mamba env create -n ont -f env.yaml 
mamba activate ont
pip install .

For Apple M1/M2 (arm64) many conda packages are not yet available. Use instead:

CONDA_SUBDIR=osx-64 mamba create -n ont -f env.yaml

Implementation

The key functionality is achieved using snakemake workflows. From version 2.0.0 two different snakemake rule sets are supported which are centered around two different basecallers:

rules: a guppy-based workflow (legacy); In fact, this also works with dorado basecaller, but only for non-multiplexed data. These rules will not be maintained.
rules_dorado: a dorado-based workflow. This does not work with guppy.

A wrapper python script (ont.py) implements

the continuous screening of the source directory,
the generation of a flowcell-specific configuration file, and
the communication with enduser (emails etc.)

Configurations

The main configuration file (config.yaml) specifies:

the paths for the rule set be used (rulesPath: rules or rules_dorado),
the overall directory structure (see below)
organism-specific paths (e.g. genome and transcriptome locations)
communication settings (email, Parkour LIMS, sambahost)
generic parameters (basecalling, mapping)

Notice that the generic configuration defined by this file is expanded by project-specific entries for each incoming flowcell

Additional configuration files are:

env.yaml (for conda installation of all dependencies)
multiqc_config.yaml (to customize multiqc output)

Usage

ont -c config.yaml

Directory structures

The workflow connects and relies on three main data locations:

A source directory (offloadDir) is screened for the arrival of new and unprocessed flowcells
A work directory (outputDir) is used for various processing steps (merging, basecalling, demultiplexing, alignment, quality controls)
The target directory (groupDir) receives the analysis results in a project-wise manner.

The details are rule-set dependent. Annotated examples for rules_dorado is given below

Example input path (`offloadDir`)

This directory is generated by the sequencing machine and may change in response to technological developments.

../path/to/flowcell/
.
├── bam_pass            # from fast basecalling
├── barcode_alignment_PAS33554_6b0029ab_a0fbcf5b.tsv
├── fastq_pass          # from fast basecalling
├── final_summary_PAS33554_6b0029ab_a0fbcf5b.txt
├── other_reports
├── pod5_pass           # pod5 format
├── pore_activity_PAS33554_6b0029ab_a0fbcf5b.csv
├── report_PAS33554_20230928_1016_6b0029ab.html
├── report_PAS33554_20230928_1016_6b0029ab.json
├── report_PAS33554_20230928_1016_6b0029ab.md
├── SampleSheet.csv     # sample sheet information
├── sample_sheet_PAS33554_20230928_1016_6b0029ab.csv
├── sequencing_summary_PAS33554_6b0029ab_a0fbcf5b.txt
└── throughput_PAS33554_6b0029ab_a0fbcf5b.csv

Example output path during processing (`outputDir`)

../path/to/flowcell
.
├── analysis.done            # flag to signal that this folowcell has been fully processed
├── bam                      # output from basecalling in bam format (including modificaytion calls)
├── bam_demux                # demulitplex samples (empty if no barcoding)
├── benchmarks               # benchmarks for each rule
├── benchmarks_combined.tsv  # combined benchmark file
├── flags                    # directory with flags from snakemake rules
├── log                      # log files (rule-specific)
├── pipeline_config.yaml     # configfile (snakemake & more)
├── pod5                     # directory with merged pod5 file (from offloadDir)
├── reports                  # directory with reports and SampleSheet.csv (from offloadDir)
├── summary                  # summary files (DAG, disk status)
└── transfer                 # analysis output that will be transferred)

transfer/
└── Project_projectID_User_Group
    ├── Analysis_mouse_dna                    # analysis directory (exists only if genome is known)
    │   ├── 23L000329_WT_rep1.align.bam       # alignment
    │   ├── 23L000329_WT_rep1.align.bam.bai   # index
    │   └── 23L000329_WT_rep1.align.bed.gz    # modification calls
    ├── Data
    │   ├── 23L000329_WT_rep1.bam             # basecalled sequences
    │   ├── 23L000329_WT_rep1.fastq.gz        # basecalled sequences (fastq - deprecated)
    │   ├── 23L000329_WT_rep1_porechop.fastq.gz # adaptors, barcodes trimmed
    │   └── 23L000329_WT_rep1.seqsum            # sequencing summaries (for pycoQC etc )
    └── QC
        ├── multiqc
        │   ├── multiqc_data
        │   └── multiqc_report.html            # multiqc report
        ├── sample_names.tsv                   # dictionary sampleID-sampleName
        └── Samples                            # samples-wise quality controls
            ├── 23L000329_WT_rep1.align.flagstat
            ├── 23L000329_WT_rep1.align_pycoqc.html
            ├── 23L000329_WT_rep1.align_pycoqc.json
            ├── 23L000329_WT_rep1_fastqc.html
            ├── 23L000329_WT_rep1_fastqc.zip
            ├── 23L000329_WT_rep1_kraken.report
            ├── 23L000329_WT_rep1_porechop.info
            ├── 23L000329_WT_rep1_pycoqc.html
            ├── 23L000329_WT_rep1_pycoqc.json
            ├── all_porechop.best_end
            ├── all_porechop.best_start
            └── all_porechop.trimmed

Example output path for an end user (`groupDir`)

../user_path/to/flowcell/  (identical to outputDir/transfer)
.
├── metadata.yaml
└── Project_projectID_User_Group
    ├── Analysis_mouse_dna
    ├── Data
    └── QC

Directory structures (with rules)

This is a legacy structure for older versions of the pipeline (guppy-based)

Example input path (`offloadDir`)

.. as above

Example output path during processing (`outputDir`)

../path/to/flowcell
.
├── analysis.done
├── fastq
├── FASTQC_Project_2913_Falk_DeepSeq
├── flags
├── log
├── pipeline_config.yaml
├── pod5
├── Project_2913_Falk_DeepSeq
├── reports
├── sequencing_summary_0.txt
├── sequencing_summary_barcode10.txt
├── sequencing_summary_barcode11.txt
├── sequencing_summary_unclassified.txt
└── tmp

Example output path for an end user (`groupDir`)

../user_path/to/flowcell/
.
├── Analysis_projectID_user_group
│   └── dna_drosophila
│       ├── 2901_23S004286_control.bam
│       ├── 2901_23S004286_control.bam.bai
│       ├── 2901_23S004286_control.html
│       ├── 2901_23S004286_control.json
│       ├── 2901_23S004286_fragmented.bam
│       ├── 2901_23S004286_fragmented.bam.bai
│       ├── 2901_23S004286_fragmented.html
│       └── 2901_23S004286_fragmented.json
├── FASTQC_projectID_user_group
│   ├── multiqc
│   │   ├── multiqc_data
│   │   └── multiqc_report.html
│   ├── Sample_23L004429
│   │   ├── 2901_23S004286_fragmented_fastqc.html
│   │   ├── 2901_23S004286_fragmented_fastqc.zip
│   │   ├── 2901_23S004286_fragmented_kraken.report
│   │   ├── 2901_23S004286_fragmented_porechop.info
│   │   ├── 2901_23S004286_fragmented_pycoqc.html
│   │   └── 2901_23S004286_fragmented_pycoqc.json
│   └── Sample_23L004430
│       ├── 2901_23S004286_control_fastqc.html
│       ├── 2901_23S004286_control_fastqc.zip
│       ├── 2901_23S004286_control_kraken.report
│       ├── 2901_23S004286_control_porechop.info
│       ├── 2901_23S004286_control_pycoqc.html
│       └── 2901_23S004286_control_pycoqc.json
├── metadata.yaml
└── Project_projectID_user_group
    ├── Sample_23L004429
    │   ├── 2901_23S004286_fragmented.fastq.gz
    │   ├── pass
    │   └── sequencing_summary_2901_23S004286_fragmented.txt
    └── Sample_23L004430
        ├── 2901_23S004286_control.fastq.gz
        ├── pass
        └── sequencing_summary_2901_23S004286_control.txt

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
data		data
misc		misc
src/npr		src/npr
.gitignore		.gitignore
ChangeLog		ChangeLog
LICENSE		LICENSE
README.md		README.md
config.template.yaml		config.template.yaml
createEnvs.sh		createEnvs.sh
env.yaml		env.yaml
setup.cfg		setup.cfg
setup.py		setup.py

License

maxplanck-ie/nanoporeReads_dataTransfer

Folders and files

Latest commit

History

Repository files navigation

nanoporeReads_dataTransfer

Installation

Implementation

Configurations

Usage

Directory structures

Example input path (offloadDir)

Example output path during processing (outputDir)

Example output path for an end user (groupDir)

Directory structures (with rules)

Example input path (offloadDir)

Example output path during processing (outputDir)

Example output path for an end user (groupDir)

About

Resources

License

Stars

Watchers

Forks

Languages

Example input path (`offloadDir`)

Example output path during processing (`outputDir`)

Example output path for an end user (`groupDir`)

Example input path (`offloadDir`)

Example output path during processing (`outputDir`)

Example output path for an end user (`groupDir`)