GitHub - NCBI-Hackathons/chervil: A detection algorithm for expression features that correspond to previous viral infection

Computational Human Endogenous RetroViral Infection Labeler (CHERVIL) is a pipeline for the detection of endogenous retroviral expression patterns that correspond to current or previous viral infection.

This project was developed at the Rocky Mountain Genomics HackCon 2018 by Benjamin Lee (team lead), Jeremy Ash (team lead), Corinne Walsh, and Grant Vagle with support from Ben Busby and Michael Crusoe.

What is CHERVIL and why do we need it?

Human endogenous retroviral elements (HERVs) are retroviruses that have integrated themselves into the human germline. Usually, they remain latent in the human genome. However, previous work suggests that some HERVs become actively transcribed upon viral infection.

CHERVIL builds on an existing pipeline built for HERV expression quantification, RetroSpotter, and adds on a machine learning component to identify patterns in HERV expression indicative of pre-symptomatic or historic viral infection.

How it works

At a high level, there are two major phases of the CHERVIL pipeline.

The first is the calculation of HERV expression in different populations. To do this, we use RetroSpotter and Magic-BLAST to align RNA-seq data to known HERVs to quantify HERV expression.

The second phase is the automatic development of a machine learning pipeline that uses expression data to predict disease status. We accomplish this using TPOT to identify HERV expression patterns specific to viral infection.

Installation

Docker

From Docker Hub

We have provided a Docker image with our pipeline pre-installed. To download it (assuming you already have Docker installed), run:

$ docker pull benjamindlee/chervil

From Dockerfile

Alternatively, you can build the image yourself from our dockerfile:

$ docker build -t chervil .

Manually

Before proceeding, ensure that you have the following installed and functional:

Next, clone a copy of the repository:

$ git clone https://github.com/NCBI-Hackathons/chervil.git

and then cd into it:

$ cd chervil

Odds are that you will want to run CHERVIL in a virtual environment. If you don't have virtualenv installed, run:

$ pip install virtualenv

And then to set up your shiny new virtual environment:

$ virtualenv env --python=python3.6
$ source env/bin/activate

Next, to install the Python components, run:

(env) $ pip install -r requirements.txt

How to use CHERVIL

Create blast database with HERV elements

Users will need to create a FASTA file containing the nucleotide sequence of each HERV element. For convenience, we have included a set of known HERV sequences. The makeblastdb.sh command creates a blast database:
```
 $ makeblastdb.sh reference_genome/her_reference.fasta
```
This creates a directory blastdb containing a reference database called referencedb.
Input accession numbers and their classifications

This should in the form of a CSV file that looks something like this:
```
SRR123456, infected
SRR789101, infected
SRR112131, infected
SRR415161, control
SRR718192, control
SRR021222, control
```
(note: these are made-up accessions)
Generate the HERV classification machine learning model

Assuming you are in a directory with accessions and their classes, run:
```
 $ chervil.sh [path to SRR accession csv] [path to blast database] [number of cores] [output directory] [prefix for SAM files]
```
Example usage:
```
 $ chervil.sh srr_inf_test.csv ../blast_dbs/referencedb 20 out "test"
```
This command calls multiple scripts that execute the pipeline we have developed.
- Uses magicblast command align RNA-seq reads to the reference blast database. Generates a SAM file for each accession. (S1_make_acc_file.r, run_jobs.sh)
- Takes the SAM files and count the number of reads corresponding to each ERV gene. (count_hits.sh)
- Organizes the counts into a dataframe that includes all of the sample numbers (by SRR accession), their class (infected, not infected, etc.) and their read count for each ERV gene, written to a CSV file. (S2_orgCountsScript.r)
- Feeds this dataframe into TPOT, an automated machine learning pipeline. The model and an HTML file with a confusion matrix table with performance measures for external data set are then saved for analysis. (S3_generate_classifier.py)

Troubleshooting

Bug reports should be submitted here.

If you run into any problems while using CHERVIL, feel free to email Benjamin Lee (GitHub) for support.

Example Dataset

PRJNA349748: Human Tracheobronchial Epithelial (HTBE) cells infected with Influenza
- Data Type: RNA-seq
- Samples:
  - 10 H1N1, H5N1, and H3N2 infected cells
  - 5 mock-infected controls

Example Report

2-fold Cross Validation accuracy: .917
Validation accuracy: .75

Actual

Predict

	infected	not_infected
infected	3	0
not_infected	1	0

Overall Statistics:

95% CI	(0.32565,1.17435)
Bennett_S	0.5
Chi-Squared	None
Chi-Squared DF	1
Conditional Entropy	None
Cramer_V	None
Cross Entropy	None
Gwet_AC1	0.68
Joint Entropy	None
KL Divergence	None
Kappa	0.0
Kappa 95% CI	(-1.69741,1.69741)
Kappa No Prevalence	0.5
Kappa Standard Error	0.86603
Kappa Unbiased	-0.14286
Lambda A	None
Lambda B	None
Mutual Information	None
Overall_ACC	0.75
Overall_RACC	0.75
Overall_RACCU	0.78125
PPV_Macro	None
PPV_Micro	0.75
Phi-Squared	None
Reference Entropy	0.81128
Response Entropy	None
Scott_PI	-0.14286
Standard Error	0.21651
Strength_Of_Agreement(Altman)	Poor
Strength_Of_Agreement(Cicchetti)	Poor
Strength_Of_Agreement(Fleiss)	Poor
Strength_Of_Agreement(Landis and Koch)	Slight
TPR_Macro	0.5
TPR_Micro	0.75

Class Statistics :

Class	infected	not_infected	Description
ACC	0.75	0.75	Accuracy
BM	0.0	0.0	Informedness or bookmaker informedness
DOR	None	None	Diagnostic odds ratio
ERR	0.25	0.25	Error rate
F0.5	0.78947	0.0	F0.5 score
F1	0.85714	0.0	F1 score - harmonic mean of precision and sensitivity
F2	0.9375	0.0	F2 score
FDR	0.25	None	False discovery rate
FN	0	1	False negative/miss/type 2 error
FNR	0.0	1.0	Miss rate or false negative rate
FOR	None	0.25	False omission rate
FP	1	0	False positive/type 1 error/false alarm
FPR	1.0	0.0	Fall-out or false positive rate
G	0.86603	None	G-measure geometric mean of precision and sensitivity
LR+	1.0	None	Positive likelihood ratio
LR-	None	1.0	Negative likelihood ratio
MCC	None	None	Matthews correlation coefficient
MK	None	None	Markedness
N	1	3	Condition negative
NPV	None	0.75	Negative predictive value
P	3	1	Condition positive
POP	4	4	Population
PPV	0.75	None	Precision or positive predictive value
PRE	0.75	0.25	Prevalence
RACC	0.75	0.0	Random accuracy
RACCU	0.76562	0.01562	Random accuracy unbiased
TN	0	3	True negative/correct rejection
TNR	0.0	1.0	Specificity or true negative rate
TON	0	4	Test outcome negative
TOP	4	0	Test outcome positive
TP	3	0	True positive/hit
TPR	1.0	0.0	Sensitivity, recall, hit rate, or true positive rate

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
backend_scripts		backend_scripts
blast_dbs		blast_dbs
example_data		example_data
example_run		example_run
images		images
reference_genome		reference_genome
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
chervil.cwl		chervil.cwl
example_file.csv		example_file.csv
requirements.txt		requirements.txt

License

NCBI-Hackathons/chervil

Folders and files

Latest commit

History

Repository files navigation

What is CHERVIL and why do we need it?

How it works

Installation

Docker

From Docker Hub

From Dockerfile

Manually

How to use CHERVIL

Troubleshooting

Example Dataset

Example Report

Overall Statistics:

Class Statistics :

About

Topics

Resources

License

Stars

Watchers

Forks

Languages