Timing of QVOA strains using RNA pre-treatment data

Updated Nov 2022 to provide compatibility with HyPhy 2.5.x and support additional options, like not compressing identical sequences Updated Mar 2024 to provide compatibility with HyPhy 2.5.5x, better handle alignment when none of the sequences are in frame (use longest ORF), and branch length re-estimation using HyPhy.

Overview

This pipeline times outgrowth virus (OGV) strains from a single host using longtitudinally sampled RNA data to provide temporal information. Four different approaches are used to time strains with unobserved dates (OGV or DNA strains). The phylogenetic tree for each genomic segment is rooted to maximize the root-to-tip to sampling time correlation coefficient first.

Patristic distance

Given a tree with OGV and RNA strains, this approach simply finds the nearest RNA neighbor (in patristic, i.e., path length sense) for each OGV sequence. To assess reliability of this assignment, we also identify all K≥1 sequences within a factor of 2x of this smallest distance, and extract the consensus vote to assign the date. The support value if the proportion of K neighboring sequences that come from the same timepoint. In the example below, there are 9 RNA neighbors to the blue QVOA sequence. The closest one is CAP288_4260_184WPI_ENV_C1C2_NGS_020_0_006 and the other two are with 2x distance. The blue sequence will be assigned to the 184WPI timepoint (WPI = weeks post infection), because most of the neighbors are from that timepoint, and the support value will be 8/9, because one of teh neighbors has a timestamp of 174WPI

Clade support

Here, we start with a QVOA sequence, and go up the tree until we encounter the first internal node that has both QVOA and RNA descendants and that has a minimum of 0.90 bootstrap support. We then assign a timestamp and support to the QVOA sequence using the same majority rule as for the patristic distance for all the nodes in the clade. In the example below, to date CAP288_NEF_1_W11_QVOA, we walk up the tree to the first supported node (blue) and look at its RNA descendants, of which there are 6 AND 5/6 come from 184WPI, yielding the assignment of 184WPI for the QVOA sequence and support of 5/6.

Placement

Phylogenetic placement using a modified IgSCUEAL tool. The support value is the model weight (in [0-1]) assigned to the branches with teh corresponding label

Regression

The linear regression molecular clock method, essentially the same as in the Jones et al paper.

Implementation

Dependancies

HyPhy ≥v2.5.55 -- or the develop branch from github.com/veg/hyphy
MAFFT for MSA generation
FastTree for phylogeny inference
phylotree.js (an npm package) for tree processing and visualization [note; there is a custom JS script for performing dating steps that is included in scripts]
snakemake for workflow management
node for executing JavaScript
BioPython

See notes on configuration below and please run

npm install

from the base directory

Workflow

The workflow for sample classification is implemented as a Snakemake workflow, encoded in the Snakemake file, and can be graphically summarized as follows (for a specific sample)...

The pipeline requires an input samples.json file, with pathnames of files to process specified relative to the data directory, e.g.

{
    'samples': [
        "CAP188/CAP188_ENV_2_all_hap.fasta",
        "CAP188/CAP188_ENV_3_all_hap.fasta",
        "CAP188/CAP188_ENV_4_all_hap.fasta",
        "CAP188/CAP188_NEF_1_all_hap.fasta",
        "CAP206/CAP206_ENV_2_all_hap.fasta",
        "CAP206/CAP206_ENV_4_all_hap.fasta",
        "CAP206/CAP206_GAG_1_all_hap.fasta",
        "CAP217/CAP217_ENV_2_all_hap.fasta",
        "CAP217/CAP217_ENV_3_all_hap.fasta",
        "CAP217/CAP217_ENV_4_all_hap.fasta",
        "CAP257/CAP257_ENV_2_all_hap.fasta",
        "CAP257/CAP257_ENV_3_all_hap.fasta",
        "CAP257/CAP257_ENV_4_all_hap.fasta",

The pipeline is executed by calling snakemake, e.g.

$snakemake --cores 4

After the pipeline has completed, you need to run

$python3 scripts/result-summary.py -d results/dating/ -j data/conversion.json -o results/summary.csv

to generate the overall summary CSV file.

The high level directory structure contains the following components

├── HBL							[HyPhy scripts]
│   ├── data_filter-2.bf
│   ├── data_filter.bf
│   ├── lib
│   └── scripts
├── LICENSE
├── package.json					[npm package manifest]
├── README.md						[This file]
├── Snakefile						[Snakemake configuration file]
├── data							[unprocessed FASTA files]
│   ├── CAP188
│   ├── CAP206
│   ├── CAP217
│   ├── CAP257
│   ├── CAP268
│   ├── CAP280
│   ├── CAP287
│   ├── CAP288
│   ├── CAP302
│   ├── CAP316
│   ├── CAP336
│   ├── CAP372
│   └── conversion.json	  [Color codes and WPI to pre-ART conversions]
├── img					  [Images for this file]
│   ├── 1.png
│   ├── 2.png
│   └── 3.png
├── results              [Results generated by the pipeline]
│   ├── alignments	     [Alignments]
│   ├── dating		     [QVOA/OGV dating results]
│   └── trees            [Trees inferred by the pipeline]
├── samples.json         [Manifest of files to be processed by the pipeline]
└── scripts
    ├── compute-distance.js 	[Classifier/timing script]
    └── result-summary.py		[Penultimate summary script]

Within specific resutls directories, the following files are created for each input FASTA

`alignments` directory

CAP316/
├── CAP316_ENV_3_all_hap.fasta_combined.msa           
    [RNA+QVOA nucleotide in-frame sequences, aligned]
├── CAP316_ENV_3_all_hap.fasta_combined_nuc.fas
    [RNA+QVOA nucleotide in-frame sequences, not aligned]
├── CAP316_ENV_3_all_hap.fasta_combined_protein.fas
    [Translated RNA+QVOA nucleotide in-frame sequences, not aligned]
├── CAP316_ENV_3_all_hap.fasta_combined_protein.msa
    [Translated RNA+QVOA nucleotide in-frame sequences, aligned]
├── CAP316_ENV_3_all_hap.fasta_nuc.fas
    [In frame RNA nucleotide in-frame sequences, not aligned]
├── CAP316_ENV_3_all_hap.fasta_protein.fas
    [Translated RNA nucleotide in-frame sequences, not aligned]
├── CAP316_ENV_3_all_hap.fasta_qvoa.fas
    [QVOA/OGV in-frame sequences, not aligned]
├── CAP316_ENV_3_all_hap.fasta_rna.msa
    [RNA+QVOA nucleotide in-frame sequences not aligned]
...

`trees` directory

CAP316/
├── CAP316_ENV_3_all_hap.fasta_combined_unscaled.nwk
    [Pseudo-ML FastTree topology for QVOA + RNA sequences]
├── CAP316_ENV_3_all_hap.fasta_rna_unscaled.nwk
    [Pseudo-ML FastTree topology for RNA sequences]
├── CAP316_ENV_3_all_hap.fasta_combined.nwk
    [FastTree topology with branch lengths retuned by HyPhy]
├── CAP316_ENV_3_all_hap.fasta_rna.nwk
    [FastTree topology with branch lengths retuned by HyPhy]

`dating` directory

CAP336/
├── CAP336_ENV_4_all_hap.fasta.scueal
	[alignment+tree file used by HyPhy for phylogenetic placement]
├── CAP336_ENV_4_all_hap.fasta.scueal.hyphy
	[auxiliary file generated by HyPhy during phylogenetic placement]
├── CAP336_ENV_4_all_hap.fasta.scueal.hyphy-data
	[auxiliary file generated by HyPhy during phylogenetic placement]
├── CAP336_ENV_4_all_hap.fasta_classification.csv
	[Summary classification (see below)]
├── CAP336_ENV_4_all_hap.fasta_full_placement.tsv
	[phylogenetic placement extended results]
├── CAP336_ENV_4_all_hap.fasta_placement.tsv
	[phylogenetic placement brief results]

Alignment preparation

For each individual and each genomic region, we performed the following steps to generate in-frame codon alignments for subsequent analysis. Sequence names in the input files are expected to encode their origin (RNA vs QVOA): any sequence whose name matched the regexp (QVOA)|(OGV)|(DNA)|(xxxx)|(BAR[0-9]+) are interpreted as outgrowth virus (see

ogv-dating/HBL/data_filter.bf

Line 41 in 128af4b

filter.regex = "(QVOA)|(OGV)|(DNA)|(xxxx)|(BAR[0-9]+)";

, for example).

CAP188_xxxx_000WPI_ENV_C2C3_OGV_B-W36
                            --- [OGV]
CAP188_4260_185WPI_ENV_C2C3_NGS_046_0.002
            ------ [Time stamp in weeks post-infection]

Translate each of the RNA nucleotide sequences into protein sequences using 3 forward reading frames. Keep track those that are in frame. Note : for some of the alignments none of the sequences were in frame, typically because the reads were stitched from discontinuous genomic regions [e.g. due to variable loop removal]. For such alignments, the code will find the longest ORF to use as a reference, meaning that only partial sequence data may be retained
Reads that were not in frame, and QVOA reads were codon aligned to each of the in-frame reads using a codon-aware Smith-Waterman algorithm implemented in HyPhy, picking the best-scoring pair to correct frame-shifts.
The resulting in-frame sequences were translated to amino-acids, mapping all ambiguous codons to ? (e.g. A-A).
A multiple sequence alignment (MSA) was generated from translated protein sequences using mafft
Original codon sequences are mapped to aligned protein sequences using a script in HyPhy. Identical sequences (which do not contribute any information to phylogenetic analyses) are removed at this stage.
ML phylognetic trees are reconstructed using FastTree (option) for RNA sequences only (these are later used for phylogenetic placement) and for RNA+QVOA. These topologies are then fed into hyphy to re-estimate the branch lengths.
The pipeline will compress identical RNA sequences, i.e. only one copy will be retained. The copy number will (including single copies, i.e. 1) will be appended to sequence names. It will NOT compress QVOA sequences, defined as those with QVOA or OGV in their name.

Important: adjust the paths to various executables, like hyphy and FastTree, in the Snakefile before running the pipeline

callables = {
    'hyphy' : '/usr/local/bin/hyphy',
    'mafft' : '/usr/local/bin/mafft --auto ',
    'raxml' : '/usr/local/bin/raxml-ng',
    'fasttree' : '/usr/local/bin/FastTree',
    'classifier' : 'scripts/compute-distance.js'
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
HBL		HBL
data		data
img		img
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
package.json		package.json
samples.json		samples.json

License

veg/ogv-dating

Folders and files

Latest commit

History

Repository files navigation

Timing of QVOA strains using RNA pre-treatment data

Overview

Patristic distance

Clade support

Placement

Regression

Implementation

Dependancies

Workflow

alignments directory

trees directory

dating directory

Alignment preparation

About

Resources

License

Stars

Watchers

Forks

Languages

`alignments` directory

`trees` directory

`dating` directory