GPS Pipeline

The GPS Pipeline is a Nextflow pipeline designed for processing raw reads (FASTQ files) of Streptococcus pneumoniae samples. After preprocessing, the pipeline performs initial assessment based on the total bases in reads. Passed samples will be further assess based on assembly, mapping, and taxonomy. If the sample passes all quality controls (QC), the pipeline also provides the sample's serotype, multi-locus sequence typing (MLST), lineage (based on the Global Pneumococcal Sequence Cluster (GPSC)), and antimicrobial resistance (AMR) against multiple antimicrobials.

The pipeline is designed to be easy to set up and use, and is suitable for use on local machines and high-performance computing (HPC) clusters alike. Additionally, the pipeline only downloads essential files to enable the analysis, and no data is uploaded from the local environment, making it an ideal option for cases where the FASTQ files being analysed is confidential. After initialisation or the first successful complete run, the pipeline can be used offline unless you have changed the selection of any database or container image.

The development of this pipeline is part of the GPS Project (Global Pneumococcal Sequencing Project).

Workflow

Usage

Requirements

Software

A POSIX-compatible operating system (e.g. Linux, macOS, Windows with WSL) with Bash 3.2 or later
- Installation guide for WSL on Windows by Microsoft
Java 11 or later (up to 21) (OpenJDK/Oracle Java)
- Installation guide for OpenJDK by freeCodeCamp
Docker or Singularity/Apptainer
- Installation guides:
  - For Linux
    
    ℹ️ For Linux, Docker Engine or Singularity/Apptainer is recommended over Docker Desktop. The latter is known to cause permission issues when running the pipeline on Linux.
    - Docker Engine on Linux by Docker
    - Apptainer on Linux by Apptainer
  - For macOS
    - Docker Desktop on macOS by Docker
      
      ℹ️ After installation, you might need to allow Docker to access more system resources, especially CPU and Memory, to match the hardware requirement of the pipeline
  - For Windows with WSL
    - Docker Desktop on Windows with WSL by Docker

Hardware

It is recommended to have at least 16GB of RAM and 50GB of free storage

ℹ️ Details on storage

The pipeline core files use ~5MB

All default databases use ~8GB in total

All Docker images use ~13GB in total; alternatively, Singularity images use ~4.5GB in total

The pipeline generates ~1.8GB intermediate files for each sample on average
(These files can be removed when the pipeline run is completed, please refer to Clean Up)
(To further reduce storage requirement by sacrificing the ability to resume the pipeline, please refer to Experimental)

Accepted Inputs

Only Illumina paired-end short reads are supported
Each sample is expected to be a pair of raw reads following this file name pattern:
- *_{,R}{1,2}{,_001}.{fq,fastq}{,.gz}
  - example 1: SampleName_R1_001.fastq.gz, SampleName_R2_001.fastq.gz
  - example 2: SampleName_1.fastq.gz, SampleName_2.fastq.gz
  - example 3: SampleName_R1.fq, SampleName_R2.fq

Setup

Clone the repository (if Git is installed on your system)
```
git clone https://github.com/sanger-bentley-group/gps-pipeline.git
```
or

Download and unzip/extract the latest release
Go into the local directory of the pipeline and it is ready to use without installation (the directory name might be different)
```
cd gps-pipeline
```
(Optional) You could perform an initialisation to download all required additional files and container images, so the pipeline can be used at any time with or without the Internet afterwards.

⚠️ Docker or Singularity must be running, and an Internet connection is required.
- Using Docker as the container engine
```
./run_pipeline --init
```
- Using Singularity as the container engine
```
./run_pipeline --init -profile singularity
```

Run

⚠️ Docker or Singularity must be running.

⚠️ If this is the first run and initialisation was not performed, an Internet connection is required.

ℹ️ By default, Docker is used as the container engine and all the processes are executed by the local machine. See Profile for details on running the pipeline with Singularity or on a HPC cluster.

You can run the pipeline without options. It will attempt to get the raw reads from the default location (i.e. input directory inside the gps-pipeline local directory)
```
./run_pipeline
```
You can also specify the location of the raw reads by adding the --reads option
```
./run_pipeline --reads /path/to/raw-reads-directory
```
For a test run, you could obtain a small test dataset by running the included download_test_input script. The dataset will be saved to the test_input directory inside the pipeline local directory. You can then run the pipeline on the test data
```
./download_test_input
./run_pipeline --reads test_input
```
- 9870_5#52 will fail the Taxonomy QC and hence Overall QC, therefore without analysis results
- 17175_7#59 and 21127_1#156 should pass Overall QC, therefore with analysis results

Profile

By default, Docker is used as the container engine and all the processes are executed by the local machine. To change this, you could use Nextflow's built-in -profile option to switch to other available profiles

ℹ️ -profile is a built-in Nextflow option, it only has one leading -
```
./run_pipeline -profile [profile name]
```

Available profiles:

Profile Name	Details
`standard` (Default)	Docker is used as the container engine. Processes are executed locally.
`singularity`	Singularity is used as the container engine. Processes are executed locally.
`lsf`	The pipeline should be launched from a LSF cluster head node with this profile. Singularity is used as the container engine. Processes are submitted to your LSF cluster via `bsub` by the pipeline. (Tested on Wellcome Sanger Institute farm5 LSF cluster only) (Option `--kraken2_memory_mapping` default change to `false`.)

Resume

If the pipeline is interrupted mid-run, Nextflow's built-in -resume option can be used to resume the pipeline execution instead of starting from scratch again
You should use the same command of the original run, only add -resume at the end (i.e. all pipeline options should be identical)

ℹ️ -resume is a built-in Nextflow option, it only has one leading -
- If the original command is
```
./run_pipeline --reads /path/to/raw-reads-directory
```
- The command to resume the pipeline execution should be
```
./run_pipeline --reads /path/to/raw-reads-directory -resume
```

Clean Up

During the run of the pipeline, Nextflow generates a considerable amount of intermediate files
If the run has been completed and you do not intend to use the -resume option or those intermediate files, you can remove the intermediate files using one of the following ways:
- Run the included clean_pipeline script
  - It runs the commands in manual removal for you
  - It removes the work directory and log files within the gps-pipeline local directory
```
./clean_pipeline
```
- Manual removal
  - Remove the work directory and log files within the gps-pipeline local directory
```
rm -rf work
rm -rf .nextflow.log*
```
- Run nextflow clean command
  - This built-in command cleans up cache and work directories
  - By default, it only cleans up the latest run
  - For details and available options of nextflow clean, refer to the Nextflow documentation
```
./nextflow clean
```

Nextflow Tower (Optional)

The pipeline is compatible with Launchpad of Nextflow tower and Nextflow -with-tower option. For more information, please refer to the Nextflow Tower documentation.

Pipeline Options

The tables below contain the available options that can be used when you run the pipeline
Usage:
```
./run_pipeline [option] [value]
```

ℹ️ To permanently change the value of an option, edit the nextflow.config file inside the gps-pipeline local directory.

ℹ️ $projectDir is a Nextflow built-in implicit variables, it is defined as the local directory of gps-pipeline.

ℹ️ Pipeline options are not built-in Nextflow options, they are lead with -- instead of -

Alternative Workflows

Option	Values	Description
`--init`	`true` or `false` (Default: `false`)	Use alternative workflow for initialisation, which means downloading all required additional files and container images, and creating databases. Can be enabled by including `--init` without value.
`--version`	`true` or `false` (Default: `false`)	Use alternative workflow for showing versions of pipeline, container images, tools and databases. Can be enabled by including `--version` without value. (This workflow pulls the required container images if they are not yet available locally)
`--help`	`true` or `false` (Default: `false`)	Show help message. Can be enabled by including `--help` without value.

Input and Output

⚠️ --output overwrites existing results in the target directory if there is any

⚠️ --db does not accept user provided local databases, directory content will be overwritten

Option	Values	Description
`--reads`	Any valid path (Default: `"$projectDir/input"`)	Path to the input directory that contains the reads to be processed.
`--output`	Any valid path (Default: `"$projectDir/output"`)	Path to the output directory that save the results.
`--db`	Any valid path (Default: `"$projectDir/databases"`)	Path to the directory saving databases used by the pipeline.
`--assembly_publish`	`"link"` or `"symlink"` or `"copy"` (Default: `"link"`)	Method used by Nextflow to publish the generated assemblies. (The default setting `"link"` means hard link, therefore will fail if the output directory is set to outside of the working file system)

QC Parameters

ℹ️ Read QC does not have directly accessible parameters. The minimum base count in reads of Read QC is based on the multiplication of --length_low and --depth of Assembly QC (i.e. default value is 38000000).

Option	Values	Description
`--spneumo_percentage`	Any integer or float value (Default: `60.00`)	Minimum S. pneumoniae percentage in reads to pass Taxonomy QC.
`--non_strep_percentage`	Any integer or float value (Default: `2.00`)	Maximum non-Streptococcus genus percentage in reads to pass Taxonomy QC.
`--ref_coverage`	Any integer or float value (Default: `60.00`)	Minimum reference coverage percentage by the reads to pass Mapping QC.
`--het_snp_site`	Any integer value (Default: `220`)	Maximum non-cluster heterozygous SNP (Het-SNP) site count to pass Mapping QC.
`--contigs`	Any integer value (Default: `500`)	Maximum contig count in assembly to pass Assembly QC.
`--length_low`	Any integer value (Default: `1900000`)	Minimum assembly length to pass Assembly QC.
`--length_high`	Any integer value (Default: `2300000`)	Maximum assembly length to pass Assembly QC.
`--depth`	Any integer or float value (Default: `20.00`)	Minimum sequencing depth to pass Assembly QC.

Assembly

ℹ️ The output of SPAdes-based assembler is deterministic for a given count of threads. Hence, using --assembler_thread with a specific value can guarantee the generated assemblies will be reproducible for others using the same value.

Option	Values	Description
`--assembler`	`"shovill"` or `"unicycler"` (Default: `"shovill"`)	Using which SPAdes-based assembler to assemble the reads.
`--assembler_thread`	Any integer value (Default: `0`)	Number of threads used by the assembler. `0` means all available.
`--min_contig_length`	Any integer value (Default: `500`)	Minimum legnth of contig to be included in the assembly.

Mapping

Option	Values	Description
`--ref_genome`	Any valid path to a `.fa` or `.fasta` file (Default: `"$projectDir/data/ATCC_700669_v1.fa"`)	Path to the reference genome for mapping.

Taxonomy

Option	Values	Description
`--kraken2_db_remote`	Any valid URL to a Kraken2 database in `.tar.gz` or `.tgz` format (Default: Minikraken v1)	URL to a Kraken2 database.
`--kraken2_memory_mapping`	`true` or `false` (Default: `true`)	Using the memory mapping option of Kraken2 or not. `true` means not loading the database into RAM, suitable for memory-limited or fast storage environments.

Serotype

Option	Values	Description
`--seroba_db_remote`	Any valid URL to a SeroBA release in `.tar.gz` or `.tgz` format (Default: SeroBA v1.0.7)	URL to a SeroBA release.
`--seroba_kmer`	Any integer value (Default: `71`)	Kmer size for creating the KMC database of SeroBA.

Lineage

Option	Values	Description
`--poppunk_db_remote`	Any valid URL to a PopPUNK database in `.tar.gz` or `.tgz` format (Default: GPS v8 - Reference Only)	URL to a PopPUNK database.
`--poppunk_ext_remote`	Any valid URL to a PopPUNK external clusters file in `.csv` format (Default: GPS v8 GPSC Designation)	URL to a PopPUNK external clusters file.

Other AMR

Option	Values	Description
`--ariba_ref`	Any valid path to a `.fa` or `.fasta` file (Default: `"$projectDir/data/ariba_ref_sequences.fasta"`)	Path to the reference sequences for preparing ARIBA database.
`--ariba_metadata`	Any valid path to a `tsv` file (Default: `"$projectDir/data/ariba_metadata.tsv"`)	Path to the metadata file for preparing ARIBA database.
`--resistance_to_mic`	Any valid path to a `tsv` file (Default: `"$projectDir/data/resistance_to_MIC.tsv"`)	Path to the resistance phenotypes to MIC (minimum inhibitory concentration) lookup table.

Singularity

ℹ️ This section is only valid when Singularity is used as the container engine

Option	Values	Description
`--singularity_cachedir`	Any valid path (Default: `"$projectDir/singularity_cache"`)	Path to the directory where Singularity images should be saved to.

Experimental

Option	Values	Description
`--lite`	`true` or `false` (Default: `false`)	⚠️ Enable this option breaks Nextflow resume function. Reduce storage requirement by removing intermediate `.sam` and `.bam` files once they are no longer needed while the pipeline is still running. The quantity of reduction of storage requirement cannot be guaranteed. Can be enabled by including `--lite` without value.

Output

By default, the pipeline outputs the results into the output directory inside the gps-pipeline local directory

It can be changed by adding the option --output

./run_pipeline --output /path/to/output-directory

Output Content

The following directories and files are output into the output directory

Directory / File	Description
`assemblies`	This directory contains all assemblies (`.fasta`) generated by the pipeline
`results.csv`	This file contains all the information generated by the pipeline on each sample
`info.txt`	This file contains information regarding the pipeline and parameters of the run

Details of `results.csv`

The following fields can be found in the output results.csv

ℹ️ The output fields in Other AMR / Virulence type depends on the provided ARIBA reference sequences and metadata file, and resistance phenotypes to MIC lookup table, the below table is based on the defaults.

ℹ️ The inferred Minimum Inhibitory Concentration (MIC) range of an antimicrobial in "Other AMR" type is only provided if it is included in the resistance phenotypes to MIC lookup table. The default lookup table is based on 2014 CLSI guidelines.

ℹ️ For resistance phenotypes: S = Sensitive/Susceptible; I = Intermediate; R = Resistant

ℹ️ For virulence genes: POS = Positive; NEG = Negative

⚠️ If the result of Overall_QC of a sample is READ_ONE_CORRUPTED, READ_TWO_CORRUPTED or both, the specific read file is found to be corrupted (i.e. incomplete/damaged Gzip file, mis-match(s) in read length and quality-score length). You might want to reacquire the read file from its source, or discard the sample if the source file is corrupted as well.

⚠️ If the result of Overall_QC of a sample is ASSEMBLER FAILURE, the assembler has crashed when trying to assembly the reads. You might want to re-run the sample with another assembler, or discard the sample if it is a low quality one.

⚠️ If the result of Serotype of a sample is SEROBA FAILURE, SeroBA has crashed when trying to serotype the sample.

Field	Type	Description
`Sample_ID`	Identification	Sample ID based on the raw reads file name
`Read_QC`	QC	Read quality control result
`Assembly_QC`	QC	Assembly quality control result
`Mapping_QC`	QC	Mapping quality control result
`Taxonomy_QC`	QC	Taxonomy quality control result
`Overall_QC`	QC	Overall quality control result (Based on `Assembly_QC`, `Mapping_QC` and `Taxonomy_QC`)
`Bases`	Read	Number of bases in the reads (Default: ≥ 38 Mb to pass Read QC)
`Contigs#`	Assembly	Number of contigs in the assembly (Default: ≤ 500 to pass Assembly QC)
`Assembly_Length`	Assembly	Total length of the assembly (Default: 1.9 - 2.3 Mb to pass Assembly QC)
`Seq_Depth`	Assembly	Sequencing depth of the assembly (Default: ≥ 20x to pass Assembly QC)
`Ref_Cov_%`	Mapping	Percentage of reference covered by reads (Default: ≥ 60% to pass Mapping QC)
`Het-SNP#`	Mapping	Non-cluster heterozygous SNP (Het-SNP) site count (Default: ≤ 220 to pass Mapping QC)
`S.Pneumo_%`	Taxonomy	Percentage of reads assigned to Streptococcus pneumoniae (Default: ≥ 60% to pass Taxonomy QC)
`Top_Non-Strep_Genus`	Taxonomy	The most abundant non-Streptococcus genus in reads
`Top_Non-Strep_Genus_%`	Taxonomy	Percentage of reads assigned to the most abundant non-Streptococcus genus (Default: ≤ 2% to pass Taxonomy QC)
`GPSC`	Lineage	GPSC Lineage
`Serotype`	Serotype	Serotype
`ST`	MLST	Sequence Type (ST)
`aroE`	MLST	Allele ID of aroE
`gdh`	MLST	Allele ID of gdh
`gki`	MLST	Allele ID of gki
`recP`	MLST	Allele ID of recP
`spi`	MLST	Allele ID of spi
`xpt`	MLST	Allele ID of xpt
`ddl`	MLST	Allele ID of ddl
`pbp1a`	PBP AMR	Allele ID of pbp1a
`pbp2b`	PBP AMR	Allele ID of pbp2b
`pbp2x`	PBP AMR	Allele ID of pbp2x
`AMO_MIC`	PBP AMR	Estimated minimum inhibitory concentration (MIC) of amoxicillin (AMO)
`AMO_Res`	PBP AMR	Inferred resistance phenotype against AMO
`CFT_MIC`	PBP AMR	Estimated MIC of ceftriaxone (CFT)
`CFT_Res(Meningital)`	PBP AMR	Inferred resistance phenotype against CFT in meningital form
`CFT_Res(Non-meningital)`	PBP AMR	Inferred resistance phenotype against CFT in non-meningital form
`TAX_MIC`	PBP AMR	Estimated MIC of cefotaxime (TAX)
`TAX_Res(Meningital)`	PBP AMR	Inferred resistance phenotype against TAX in meningital form
`TAX_Res(Non-meningital)`	PBP AMR	Inferred resistance phenotype against TAX in non-meningital form
`CFX_MIC`	PBP AMR	Estimated MIC of cefuroxime (CFX)
`CFX_Res`	PBP AMR	Inferred resistance phenotype against CFX
`MER_MIC`	PBP AMR	Estimated MIC of meropenem (MER)
`MER_Res`	PBP AMR	Inferred resistance phenotype against MER
`PEN_MIC`	PBP AMR	Estimated MIC of penicillin (PEN)
`PEN_Res(Meningital)`	PBP AMR	Inferred resistance phenotype against PEN in meningital form
`PEN_Res(Non-meningital)`	PBP AMR	Inferred resistance phenotype against PEN in non-meningital form
`CHL_MIC`	Other AMR	Inferred MIC of Chloramphenicol (CHL)
`CHL_Res`	Other AMR	Estimated resistance phenotype against CHL
`CHL_Determinant`	Other AMR	Known determinants that estimated the CHL resistance phenotype
`CLI_MIC`	Other AMR	Inferred MIC of Clindamycin (CLI)
`CLI_Res`	Other AMR	Estimated resistance phenotype against CLI
`CLI_Determinant`	Other AMR	Known determinants that estimated the CLI resistance phenotype
`COT_MIC`	Other AMR	Inferred MIC of Co-Trimoxazole (COT)
`COT_Res`	Other AMR	Estimated resistance phenotype against COT
`COT_Determinant`	Other AMR	Known determinants that estimated the COT resistance phenotype
`DOX_MIC`	Other AMR	Inferred MIC of Doxycycline (DOX)
`DOX_Res`	Other AMR	Estimated resistance phenotype against DOX
`DOX_Determinant`	Other AMR	Known determinants that estimated the DOX resistance phenotype
`ERY_MIC`	Other AMR	Inferred MIC of Erythromycin (ERY)
`ERY_Res`	Other AMR	Estimated resistance phenotype against ERY
`ERY_Determinant`	Other AMR	Known determinants that estimated the ERY resistance phenotype
`ERY_CLI_Res`	Other AMR	Estimated resistance phenotype against Erythromycin (ERY) and Clindamycin (CLI)
`ERY_CLI_Determinant`	Other AMR	Known determinants that estimated the ERY and CLI resistance phenotype
`FQ_Res`	Other AMR	Estimated resistance phenotype against Fluoroquinolones (FQ)
`FQ_Determinant`	Other AMR	Known determinants that estimated the FQ resistance phenotype
`KAN_Res`	Other AMR	Estimated resistance phenotype against Kanamycin (KAN)
`KAN_Determinant`	Other AMR	Known determinants that estimated the KAN resistance phenotype
`LFX_MIC`	Other AMR	Inferred MIC of Levofloxacin (LFX)
`LFX_Res`	Other AMR	Estimated resistance phenotype against LFX
`LFX_Determinant`	Other AMR	Known determinants that estimated the LFX resistance phenotype
`RIF_MIC`	Other AMR	Inferred MIC of Rifampin (RIF)
`RIF_Res`	Other AMR	Estimated resistance phenotype against RIF
`RIF_Determinant`	Other AMR	Known determinants that estimated the RIF resistance phenotype
`SMX_Res`	Other AMR	Estimated resistance phenotype against Sulfamethoxazole (SMX)
`SMX_Determinant`	Other AMR	Known determinants that estimated the SMX resistance phenotype
`TET_MIC`	Other AMR	Inferred MIC of Tetracycline (TET)
`TET_Res`	Other AMR	Estimated resistance phenotype against TET
`TET_Determinant`	Other AMR	Known determinants that estimated the TET resistance phenotype
`TMP_Res`	Other AMR	Estimated resistance phenotype against Trimethoprim (TMP)
`TMP_Determinant`	Other AMR	Known determinants that estimated the TMP resistance phenotype
`VAN_MIC`	Other AMR	Inferred MIC of Vancomycin (VAN)
`VAN_Res`	Other AMR	Estimated resistance phenotype against VAN
`VAN_Determinant`	Other AMR	Known determinants that estimated the VAN resistance phenotype
`PILI1`	Virulence	Expression of PILI-1
`PILI1_Determinant`	Virulence	Known determinants that estimated the PILI-1 expression
`PILI2`	Virulence	Expression of PILI-2
`PILI2_Determinant`	Virulence	Known determinants that estimated the PILI-2 expression

Credits

This project uses open-source components. You can find the homepage or source code of their open-source projects along with license information below. I acknowledge and am grateful to these developers for their contributions to open source.

ARIBA

ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads Hunt M, Mather AE, Sánchez-Busó L, Page AJ, Parkhill J , Keane JA, Harris SR. Microbial Genomics 2017. doi: 110.1099/mgen.0.000131
License (GPL-3.0): https://github.com/sanger-pathogens/ariba/blob/master/LICENSE
This tool is used in GET_ARIBA_DB and OTHER_RESISTANCE processes of the amr.nf module

BCFtools and SAMtools

Twelve years of SAMtools and BCFtools. Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
Licenses
- BCFtools (MIT/Expat or GPL-3.0): https://github.com/samtools/bcftools/blob/develop/LICENSE
- SAMtools (MIT/Expat): https://github.com/samtools/samtools/blob/develop/LICENSE
These tools are used in SAM_TO_SORTED_BAM and SNP_CALL processes of the mapping.nf module

BWA

Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]
License (GPL-3.0): https://github.com/lh3/bwa/blob/master/COPYING
This tool is used in GET_REF_GENOME_BWA_DB and MAPPING processes of the mapping.nf module

Docker Images of ARIBA, BCFtools, BWA, fastp, Kraken 2, mlst, PopPUNK, QUAST, SAMtools, Shovill, Unicycler

State Public Health Bioinformatics Workgroup (@StaPH-B)
License (GPL-3.0): https://github.com/StaPH-B/docker-builds/blob/master/LICENSE
These Docker images provide containerised environments with different bioinformatics tools for processes of multiple modules

Docker Image of network-multitool

Wbitt - We Bring In Tomorrow's Technolgies (@WBITT)
License (MIT): https://github.com/wbitt/Network-MultiTool/blob/master/LICENSE
This Docker image provides the containerised environment with Bash tools for processes of multiple modules

Docker Image of Pandas

Alexander Mancevice (@amancevice)
License (MIT): https://github.com/amancevice/docker-pandas/blob/main/LICENSE
This Docker image provides the containerised environment with Python and Pandas for GENERATE_OVERALL_REPORT process of the output.nf module, HET_SNP_COUNT process of the mapping.nf module and PARSE_OTHER_RESISTANCE process of the amr.nf module

fastp

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560
License (MIT): https://github.com/OpenGene/fastp/blob/master/LICENSE
This tool is used in PREPROCESS process of the preprocess.nf module

GPSC_pipeline_nf

Victoria Dyster (@blue-moon22)
License (GPL-3.0): https://github.com/sanger-bentley-group/GPSC_pipeline_nf/blob/master/LICENSE
Code adapted into the get_lineage.sh script

Kraken 2

Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
License (MIT): https://github.com/DerrickWood/kraken2/blob/master/LICENSE
This tool is used in TAXONOMY process of the taxonomy.nf module

mecA-HetSites-calculator

Narender Kumar (@kumarnaren)
License (GPL-3.0): https://github.com/kumarnaren/mecA-HetSites-calculator/blob/master/LICENSE
Code was rewritten into the het_snp_count.py script

mlst

Torsten Seemann (@tseemann)
License (GPL-2.0): https://github.com/tseemann/mlst/blob/master/LICENSE
Incorporates components of the PubMLST database
This tool is used in MLST process of the mlst.nf module

Nextflow

P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) doi:10.1038/nbt.3820
License (Apache 2.0): https://github.com/nextflow-io/nextflow/blob/master/COPYING
This project is a Nextflow pipeline; Nextflow executable nextflow is included in this repository

PopPUNK

Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Research 29:1-13 (2019). doi:10.1101/gr.241455.118
License (Apache 2.0): https://github.com/bacpop/PopPUNK/blob/master/LICENSE
This tool is used in LINEAGE process of the lineage.nf module

QUAST

Alla Mikheenko, Andrey Prjibelski, Vladislav Saveliev, Dmitry Antipov, Alexey Gurevich, Versatile genome assembly evaluation with QUAST-LG, Bioinformatics (2018) 34 (13): i142-i150. doi: 10.1093/bioinformatics/bty266. First published online: June 27, 2018
License (GPL-2.0): https://github.com/ablab/quast/blob/master/LICENSE.txt
This tool is used in ASSEMBLY_ASSESS process of the assembly.nf module

SeroBA

SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data. Epping L, van Tonder, AJ, Gladstone RA, GPS Consortium, Bentley SD, Page AJ, Keane JA, Microbial Genomics 2018, doi: 10.1099/mgen.0.000186
License (GPL-3.0): https://github.com/sanger-pathogens/seroba/blob/master/LICENSE
This project uses a Docker image of a fork
- The fork provides SeroBA with the latest updates as the original repository is no longer maintained
- The Docker image provides the containerised environment with SeroBA for GET_SEROBA_DB and SEROTYPE processes of the serotype.nf module

resistanceDatabase

Narender Kumar (@kumarnaren)
License (GPL-3.0): https://github.com/kumarnaren/resistanceDatabase/blob/main/LICENSE
sequences.fasta is renamed to ariba_ref_sequences.fasta and modified
metadata.tsv is renamed to ariba_metadata.tsv and modified
The files are used as the default inputs of GET_ARIBA_DB process of the amr.nf module

Shovill

Torsten Seemann (@tseemann)
License (GPL-3.0): https://github.com/tseemann/shovill/blob/master/LICENSE
This tool is used in ASSEMBLY_SHOVILL process of the assembly.nf module

SPN-PBP-AMR

Pathogenwatch (@pathogenwatch-oss)
License (MIT): https://github.com/pathogenwatch-oss/spn-resistance-pbp/blob/main/LICENSE
This is a modified version of AMR predictor by Ben Metcalf (@BenJamesMetcalf) at the Centre for Disease Control (CDC)
This project uses a Docker image of a fork
- The fork changes the Docker image from a Docker executable image to a Docker environment for Nextflow integration
- The Docker image provides the containerised environment with SPN-PBP-MAR for PBP_RESISTANCE process of the amr.nf module

Unicycler

Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017.
License (GPL-3.0): https://github.com/rrwick/Unicycler/blob/main/LICENSE
This tool is used in ASSEMBLY_UNICYCLER process of the assembly.nf module

Name		Name	Last commit message	Last commit date
Latest commit History 687 Commits
bin		bin
data		data
doc		doc
modules		modules
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_pipeline		clean_pipeline
download_test_input		download_test_input
main.nf		main.nf
nextflow		nextflow
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
run_pipeline		run_pipeline
tower.yml		tower.yml

License

sanger-bentley-group/gps-pipeline

Folders and files

Latest commit

History

Repository files navigation