Skip to content
Rachel Colquhoun edited this page Aug 21, 2023 · 8 revisions

Scorpio

(serious constellations of reoccurring phylogenetically-independent origin)

Scorpio is a tool for classifying, haplotyping and defining Variants of Concern or Variants of Interest for a species. It was designed in the context of SARS-CoV-2, but is not species specific - all SARS-CoV-2 specific information can be installed via constellations.

It currently includes the following commands:

  1. classify - takes a set of lineage-defining constellations with rules and classifies sequences by them.
  2. haplotype - takes a set of constellations and writes haplotypes (either as strings or individual columns).
  3. list - print the mrca_lineage and output_name of constellations as a single column to stdout.
  4. define - takes a CSV with a group column and a mutations column and extracts the common mutations within the group, optionally with reference to a specified outgroup

It takes as input a ref-coordinate based multiple sequence alignment FASTA. For this reason it currently only supports typing SNP mutations and deletions (not insertions). This style of MSA has been commonly used within the SARS-CoV-2 pandemic as it can be generated by combining consensus-to-reference mappings instead of all-against-all mappings and therefore scales much better with millions of sequences. This MSA can be generated from unaligned reads using the following command:

minimap2 -t <threads> -a --secondary=no -x asm20 --score-N=0 <reference_fasta> <sequence_fasta> \
 | gofasta sam toMultiAlign -t <threads> --reference <reference_fasta> --pad -o alignment.fasta

Or potentially using MAFFT with the --keeplength option ("Keep alignment length" in the web app).

Classify

Classify counts up the number of reference, alternative, ambiguous and other alleles at each of the defining sites of each constellation, and summarizes whether each sequence can be classified as belonging to each constellation based on sets of rules.

If it meets the criteria set in the rules for several constellations, a winning constellation is chosen by default as the constellation with the most rules met and with the best support (#alt/#sites). The default output is a single summary file, with optional additional columns. Individual counts and True/False classifications for each constellation can be output in individual CSV files.

Example commands:

  1. Create individual count files for each of the Omicron and Delta constellations. Note that the -n flag specifies a list of names in the format specified by the label in the constellation JSON files.
scorpio classify -i alignment.fa --prefix scorpio_classify --output-counts -n "Delta (B.1.617.2-like)" "Omicron (B.1.1.529-like)" "Omicron (BA.1-like)" "Omicron (BA.2-like)" "Omicron (BA.3-like)" "Omicron (Unassigned)"
  1. Create a single file with the winning classification for each sample, but include count information for the winner using --long.
scorpio classify -i alignment.fa --prefix scorpio_classify --long
  1. View the output as constellations are loaded but stop before classifying samples.
scorpio classify -i alignment.fa --prefix scorpio_classify --long

Haplotype

Create barcode strings for each sample for each constellation - these strings are ordered by position in the definition files and can help to resolve why a sample is failing to be classified as a given constellation: amplicon dropout, potential recombination or contamination.

Options include combining constellations and creating a single barcode/set of haplotypes for the ordered list of defining sites of all constellations, splitting barcodes into a column per site, and outputting a file per constellation containing counts of ref, alt, ambig and other alleles.

Example commands

  1. Create a single summary file with a haplotype barcodes for each of the Omicron and Delta constellations for each sample. Note that the -n flag specifies a list of names in the format specified by the label in the constellation JSON files.
scorpio haplotype -i alignment.fa --prefix scorpio_haplotype -n "Delta (B.1.617.2-like)" "Omicron (B.1.1.529-like)" "Omicron (BA.1-like)" "Omicron (BA.2-like)" "Omicron (BA.3-like)"
  1. Create a file per constellation with a column containing the genotype call for each defining mutation site, and a summary of the counts of ref, alt, ambig and other alleles.
scorpio haplotype -i alignment.fa --prefix scorpio_haplotype --append-genotypes --output-counts
  1. Create a single file with a barcode representing the union of Delta and the Omicron parent lineage (B.1.1.529)
scorpio haplotype -i alignment.fa --prefix scorpio_haplotype --combination -n "Omicron (Unassigned)" "Delta (B.1.617.2-like)"

List

Prints to stdout a single column list of the mrca_lineage and output_name for each constellation. This can then be parsed for downstream analysis e.g. this is used by Pangolin to get a list of the lineages we have constellations for in order to remove false positive lineage assignments. The output_name corresponds to the label in the constellation JSON unless another field is specified with --label.

Define

Identify the common mutations within a group of sequences. This command assumes that the mutations for each sample have already been found and are provided as a pipe-separated list in a column called nucleotide_mutations. If required, the user can specify an outgroup, and mutations which are common to this outgroup are placed in a separate ancestral site list which is used by classify but not haplotype in order to retain sensitivity whilst removing noise from haplotype barcodes.

The following two examples show different ways to create a constellation definition file from a GISAID download named sequences.fasta. Please first check that your FASTA has no spaces or weird symbols in header names ([A-Za-z0-9_-|] are fine).

Example generating the variants column using gofasta

If it is not already installed, please install gofasta using conda install bioconda::gofasta. You will also require local reference files e.g. MN908947.fa and MN908947.gff are the reference genome files for SARS-CoV-2 (here is its Genbank accession and note that the downloads have an extra newline at the end which has to be deleted)

minimap2 -a -x asm20 --sam-hit-only --secondary=no --score-N=0 MN908947.fa sequences.fasta -o aligned.sam
gofasta sam variants -a MN908947.gff -r MN908947.fa -s aligned.sam -o variants.csv

Example generating the variants column using datapipe

Download the COG-UK datapipe and install its dependancies using conda install -f environment.yml && conda activate datapipe. We will be using nextflow to run the variant calling module (https://github.com/COG-UK/datapipe/blob/main/modules/align_and_variant_call.nf).

  1. We need a basic CSV for datapipe including a sequence_name column with names which correspond to the FASTA. Run e.g. cat sequences.fasta | grep ">" | cut -f1 > sequences.csv, manually add a lineage column and strip ">" using find and replace to add a header row.
  2. Create the nucleotide_mutations column using the following datapipe command:
NXF_VER=20.10.0 nextflow run modules/align_and_variant_call.nf --uk_fasta sequences.fasta --uk_metadata sequences.csv

and find the output from align_and_variant_call:add_nucleotide_mutations_to_metadata ending .with_nuc_mutations.csv.

Example define commands

  1. Generate a new_constellation.json file based on the mutations in variants.csv excluding those already defined in parent constellation BA.2 (already installed in constellations) OR in local constellation file cBA.2.json.
scorpio define -i variants.csv --outgroup-json BA.2
scorpio define -i variants.csv --outgroup-json cBA.2.json
  1. Generate constellation files for all groups defined by a lineage column in variants.csv. e.g. the output of datapipe.
scorpio define -i variants.csv