Skip to content

NCBI-Hackathons/chervil

Repository files navigation

logo

Computational Human Endogenous RetroViral Infection Labeler (CHERVIL) is a pipeline for the detection of endogenous retroviral expression patterns that correspond to current or previous viral infection.

This project was developed at the Rocky Mountain Genomics HackCon 2018 by Benjamin Lee (team lead), Jeremy Ash (team lead), Corinne Walsh, and Grant Vagle with support from Ben Busby and Michael Crusoe.

ERV

What is CHERVIL and why do we need it?

Human endogenous retroviral elements (HERVs) are retroviruses that have integrated themselves into the human germline. Usually, they remain latent in the human genome. However, previous work suggests that some HERVs become actively transcribed upon viral infection.

CHERVIL builds on an existing pipeline built for HERV expression quantification, RetroSpotter, and adds on a machine learning component to identify patterns in HERV expression indicative of pre-symptomatic or historic viral infection.

How it works

pipeline

At a high level, there are two major phases of the CHERVIL pipeline.

The first is the calculation of HERV expression in different populations. To do this, we use RetroSpotter and Magic-BLAST to align RNA-seq data to known HERVs to quantify HERV expression.

The second phase is the automatic development of a machine learning pipeline that uses expression data to predict disease status. We accomplish this using TPOT to identify HERV expression patterns specific to viral infection.

Installation

Docker

From Docker Hub

We have provided a Docker image with our pipeline pre-installed. To download it (assuming you already have Docker installed), run:

$ docker pull benjamindlee/chervil

From Dockerfile

Alternatively, you can build the image yourself from our dockerfile:

$ docker build -t chervil .

Manually

Before proceeding, ensure that you have the following installed and functional:

  1. R 3.5 or greater
  2. Python 3.6 or greater
  3. Magic-BLAST
  4. A towel

Next, clone a copy of the repository:

$ git clone https://github.com/NCBI-Hackathons/chervil.git

and then cd into it:

$ cd chervil

Odds are that you will want to run CHERVIL in a virtual environment. If you don't have virtualenv installed, run:

$ pip install virtualenv

And then to set up your shiny new virtual environment:

$ virtualenv env --python=python3.6
$ source env/bin/activate

Next, to install the Python components, run:

(env) $ pip install -r requirements.txt

How to use CHERVIL

  1. Create blast database with HERV elements

    Users will need to create a FASTA file containing the nucleotide sequence of each HERV element. For convenience, we have included a set of known HERV sequences. The makeblastdb.sh command creates a blast database:

     $ makeblastdb.sh reference_genome/her_reference.fasta
    

    This creates a directory blastdb containing a reference database called referencedb.

  2. Input accession numbers and their classifications

    This should in the form of a CSV file that looks something like this:

    SRR123456, infected
    SRR789101, infected
    SRR112131, infected
    SRR415161, control
    SRR718192, control
    SRR021222, control
    

    (note: these are made-up accessions)

  3. Generate the HERV classification machine learning model

    Assuming you are in a directory with accessions and their classes, run:

     $ chervil.sh [path to SRR accession csv] [path to blast database] [number of cores] [output directory] [prefix for SAM files]
    

    Example usage:

     $ chervil.sh srr_inf_test.csv ../blast_dbs/referencedb 20 out "test"
    

    This command calls multiple scripts that execute the pipeline we have developed.

    • Uses magicblast command align RNA-seq reads to the reference blast database. Generates a SAM file for each accession. (S1_make_acc_file.r, run_jobs.sh)

    • Takes the SAM files and count the number of reads corresponding to each ERV gene. (count_hits.sh)

    • Organizes the counts into a dataframe that includes all of the sample numbers (by SRR accession), their class (infected, not infected, etc.) and their read count for each ERV gene, written to a CSV file. (S2_orgCountsScript.r)

    • Feeds this dataframe into TPOT, an automated machine learning pipeline. The model and an HTML file with a confusion matrix table with performance measures for external data set are then saved for analysis. (S3_generate_classifier.py)

Troubleshooting

Bug reports should be submitted here.

If you run into any problems while using CHERVIL, feel free to email Benjamin Lee (GitHub) for support.

Example Dataset

  • PRJNA349748: Human Tracheobronchial Epithelial (HTBE) cells infected with Influenza
    • Data Type: RNA-seq
    • Samples:
      • 10 H1N1, H5N1, and H3N2 infected cells
      • 5 mock-infected controls

Example Report

2-fold Cross Validation accuracy: .917
Validation accuracy: .75

Actual Predict
infected not_infected
infected 3 0
not_infected 1 0

Overall Statistics:

95% CI (0.32565,1.17435)
Bennett_S 0.5
Chi-Squared None
Chi-Squared DF 1
Conditional Entropy None
Cramer_V None
Cross Entropy None
Gwet_AC1 0.68
Joint Entropy None
KL Divergence None
Kappa 0.0
Kappa 95% CI (-1.69741,1.69741)
Kappa No Prevalence 0.5
Kappa Standard Error 0.86603
Kappa Unbiased -0.14286
Lambda A None
Lambda B None
Mutual Information None
Overall_ACC 0.75
Overall_RACC 0.75
Overall_RACCU 0.78125
PPV_Macro None
PPV_Micro 0.75
Phi-Squared None
Reference Entropy 0.81128
Response Entropy None
Scott_PI -0.14286
Standard Error 0.21651
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.5
TPR_Micro 0.75

Class Statistics :

Class infected not_infected Description
ACC 0.75 0.75 Accuracy
BM 0.0 0.0 Informedness or bookmaker informedness
DOR None None Diagnostic odds ratio
ERR 0.25 0.25 Error rate
F0.5 0.78947 0.0 F0.5 score
F1 0.85714 0.0 F1 score - harmonic mean of precision and sensitivity
F2 0.9375 0.0 F2 score
FDR 0.25 None False discovery rate
FN 0 1 False negative/miss/type 2 error
FNR 0.0 1.0 Miss rate or false negative rate
FOR None 0.25 False omission rate
FP 1 0 False positive/type 1 error/false alarm
FPR 1.0 0.0 Fall-out or false positive rate
G 0.86603 None G-measure geometric mean of precision and sensitivity
LR+ 1.0 None Positive likelihood ratio
LR- None 1.0 Negative likelihood ratio
MCC None None Matthews correlation coefficient
MK None None Markedness
N 1 3 Condition negative
NPV None 0.75 Negative predictive value
P 3 1 Condition positive
POP 4 4 Population
PPV 0.75 None Precision or positive predictive value
PRE 0.75 0.25 Prevalence
RACC 0.75 0.0 Random accuracy
RACCU 0.76562 0.01562 Random accuracy unbiased
TN 0 3 True negative/correct rejection
TNR 0.0 1.0 Specificity or true negative rate
TON 0 4 Test outcome negative
TOP 4 0 Test outcome positive
TP 3 0 True positive/hit
TPR 1.0 0.0 Sensitivity, recall, hit rate, or true positive rate

About

A detection algorithm for expression features that correspond to previous viral infection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published