Datapipe

Pipeline to process SARS-CoV-2 sequences and metadata, clean up irregularities, align and variant call then publish matched subsets of FASTA sequences and metadata for groups with different access to sensitive data.

Runs weekly on global sequences downloaded from GISAID.

Runs daily on COG-UK sequences, and combines with non-UK GISAID sequences.

Install and run

git clone --recurse-submodules https://github.com/COG-UK/grapevine_nextflow.git
cd grapevine_nextflow
conda env create -f environment.yml
conda activate grapevine_nextflow

NXF_VER=20.10.0 nextflow run workflows/process_cog_uk.nf <params>

Pipeline Overview

GISAID processing

Parse GISAID dump (export.json) and extract FASTA of sequences and associated metadata.
- Excludes known problematic sequences listed in gisaid_omissions.txt
- Excludes sequences where covv_host.lower() != 'human'
- Excludes sequences where malformed (not YYYY-MM-DD) or impossible (earlier than 2019-11-30 or later than today) date in covv_collection_date
- Reformat FASTA header
- Add epi-week and epi-day columns to metadata
Run pangolin (https://github.com/cov-lineages/pangolin) on all new sequences. If new release of pangolin run on all sequences.
Calculate the unmapped_genome_completeness as the proportion of sequence length which is unambiguous (not N)
Deduplicate by date, keeping the earliest example
Align to the reference (Wuhan/WH04/2020) with minimap2
Variant call using gofasta and type specific mutations of interest listed in AAs.csv and dels.csv
Filter out low quality sequences with mapped completeness < 93%, and trim and pad alignment outside of reference coordinates 265:29674
Calculate distance to reference and exclude sequences with distance to more than 4.0 epi-week std devs.

COG-UK processing

Parse matched FASTA and metadata TSV output by Elan/Majora
- Reformats header and unaligns sequences which have already been aligned to the reference
- Manual date correction for samples listed in date_corrections.csv
- Excludes early sequences which have been resequenced as listed in resequencing_omissions.txt
- Adds GISAID accession if recently submitted
- Excludes sequences where malformed (not YYYY-MM-DD) or impossible (earlier than 2019-11-30 or later than today) date in covv_collection_date
- Add epi-week and epi-day, source_id and pillar_2 columns to metadata
Run pangolin (https://github.com/cov-lineages/pangolin) on all new sequences. If new release of pangolin run on all sequences.
Calculate the unmapped_genome_completeness as the proportion of sequence length which is unambiguous (not N)
Deduplicate COG-ID by completeness and label samples with duplicate source_id
Align to the reference (Wuhan/WH04/2020) with minimap2
Variant call using gofasta and type specific mutations of interest listed in AAs.csv and dels.csv
Filter out low quality sequences with mapped completeness < 93%, and trim and pad alignment outside of reference coordinates 265:29674
Clean up geographical metadata (https://github.com/COG-UK/geography_cleaning)
Combine COG-UK sequences and metadata with non-UK GISAID sequences and metadata
Publish subsets of the data as described in publish_cog_global_recipes.json

What is grapevine?

grapevine (https://github.com/COG-UK/grapevine) was the name of the original pipeline which did all of the above, made phylogenetic trees and more. As the number of sequences has grown the tree building steps take increasingly long to complete. As the majority of users only interact with the alignments and cleaned metadata, it was decided that a robust implementation of the alignment and metadata processing steps run daily would be more useful and that is what is provided here.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
bin		bin
config		config
modules		modules
resources		resources
workflows		workflows
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
future_improvements		future_improvements
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

config

config

modules

modules

resources

resources

workflows

workflows

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE.txt

LICENSE.txt

README.md

README.md

environment.yml

environment.yml

future_improvements

future_improvements

nextflow.config

nextflow.config

Repository files navigation

Datapipe

Install and run

Pipeline Overview

GISAID processing

COG-UK processing

What is grapevine?

About

Releases 5

Packages

Languages

License

COG-UK/datapipe

Folders and files

Latest commit

History

Repository files navigation

Datapipe

Install and run

Pipeline Overview

GISAID processing

COG-UK processing

What is grapevine?

About

Resources

License

Stars

Watchers

Forks

Languages