Skip to content

Virus Identification in metagenomics using k-mers

Notifications You must be signed in to change notification settings

GSteinberg/VirLab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VirLab

Contributers: Gabriel Steinberg (gsteinb1@binghamton.edu), Kenneth Chiu (kchiu@binghamton.edu), Anna Levenberg (alevenb1@binghamton.edu), Janis Louie (jlouie2@binghamton.edu), Len Kaupert (lkauper1@binghamton.edu)

Past Contributers: Hayden Brown (hbrown10@binghamton.edu), Yan Ma (yma73@binghamton.edu)

Vector identification in metagenomic data using k-mers

  • The goal is to find the vector of a disease with an unknown vector by training an SVM on genomes with known vectors.
  • Training on complete genomes of diseases vectored by Aedes and Culex mosquitoes
  • Testing on simulated reads of complete genomes of diseases vectored by Aedes and Culex mosquitoes

Main Files

  1. fasta_parser.py
    • Parses input data and returns an array "genomes" with genome objects populated with vector, disease, and sequence for each genome
  2. k_mer_creator.py
    • Populates each genome object with all k-mers for that sequence
    • For every k-mer in all genome objects, it adds a placeholder k-mer (k-mer: 0) for that k-mer all other genomes. So every genome has every k-mer that every other genome has.
    • Split into training & testing sets
    • Make fasta files with testing data for BBMap
    • Simulate reads from test genomes
    • Find significant kmers for testing and training sets and create CSV's with them
  3. kruskal_wallis.py
    • Tests if a kmer can help distinguish two classes by analyzing the difference in the medians
  4. SVM.py
    • CLassification using a Support Vector Machine

Instructions

  1. Clone this repository into your home directory
  2. Install BioPython and skikit learn
  3. For Windows: Install Java in your Linux subsystem for the BBMap script to work
  4. Run python3 current_code/k_mer_creator.py to generate files
  5. Run SVM.py to classify

TODO

  • Need a way for SVM to return an "i dont know" - distance from the vector
  • Turning reads into contigs and test on those
  • Convert some nested for loops to list comprehensions for speed and readability
  • Adding script for one hot encoding

Janis

  • Reproduce stats for full genomes and record them
  • Reproduce stats for reads (should have high error rate
  • Produce results for classification with reads

Anna

  • Fix the cnn input shape sizes. Get CNN working for 300 samples
  • Get it to work for a bunch of genomes of different sizes

About

Virus Identification in metagenomics using k-mers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •