VirLab

Contributers: Gabriel Steinberg (gsteinb1@binghamton.edu), Kenneth Chiu (kchiu@binghamton.edu), Anna Levenberg (alevenb1@binghamton.edu), Janis Louie (jlouie2@binghamton.edu), Len Kaupert (lkauper1@binghamton.edu)

Past Contributers: Hayden Brown (hbrown10@binghamton.edu), Yan Ma (yma73@binghamton.edu)

Vector identification in metagenomic data using k-mers

The goal is to find the vector of a disease with an unknown vector by training an SVM on genomes with known vectors.
Training on complete genomes of diseases vectored by Aedes and Culex mosquitoes
Testing on simulated reads of complete genomes of diseases vectored by Aedes and Culex mosquitoes

Main Files

fasta_parser.py
- Parses input data and returns an array "genomes" with genome objects populated with vector, disease, and sequence for each genome
k_mer_creator.py
- Populates each genome object with all k-mers for that sequence
- For every k-mer in all genome objects, it adds a placeholder k-mer (k-mer: 0) for that k-mer all other genomes. So every genome has every k-mer that every other genome has.
- Split into training & testing sets
- Make fasta files with testing data for BBMap
- Simulate reads from test genomes
- Find significant kmers for testing and training sets and create CSV's with them
kruskal_wallis.py
- Tests if a kmer can help distinguish two classes by analyzing the difference in the medians
SVM.py
- CLassification using a Support Vector Machine

Instructions

Clone this repository into your home directory
Install BioPython and skikit learn
- https://biopython.org/
- https://scikit-learn.org/stable/
- To install the dependencies, use pip3 install -r requirements.txt
For Windows: Install Java in your Linux subsystem for the BBMap script to work
Run python3 current_code/k_mer_creator.py to generate files
Run SVM.py to classify

TODO

Need a way for SVM to return an "i dont know" - distance from the vector
Turning reads into contigs and test on those
Convert some nested for loops to list comprehensions for speed and readability
Adding script for one hot encoding

Janis

Reproduce stats for full genomes and record them
Reproduce stats for reads (should have high error rate
Produce results for classification with reads

Anna

Fix the cnn input shape sizes. Get CNN working for 300 samples
Get it to work for a bunch of genomes of different sizes

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
BBMap		BBMap
archive		archive
genomes		genomes
results		results
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BBMap

BBMap

archive

archive

genomes

genomes

results

results

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

VirLab

Vector identification in metagenomic data using k-mers

Main Files

Instructions

TODO

About

Releases

Packages

Contributors 4

Languages

GSteinberg/VirLab

Folders and files

Latest commit

History

Repository files navigation

VirLab

Vector identification in metagenomic data using k-mers

Main Files

Instructions

TODO

About

Resources

Stars

Watchers

Forks

Languages