Contributers: Gabriel Steinberg (gsteinb1@binghamton.edu), Kenneth Chiu (kchiu@binghamton.edu), Anna Levenberg (alevenb1@binghamton.edu), Janis Louie (jlouie2@binghamton.edu), Len Kaupert (lkauper1@binghamton.edu)
Past Contributers: Hayden Brown (hbrown10@binghamton.edu), Yan Ma (yma73@binghamton.edu)
- The goal is to find the vector of a disease with an unknown vector by training an SVM on genomes with known vectors.
- Training on complete genomes of diseases vectored by Aedes and Culex mosquitoes
- Testing on simulated reads of complete genomes of diseases vectored by Aedes and Culex mosquitoes
- fasta_parser.py
- Parses input data and returns an array "genomes" with genome objects populated with vector, disease, and sequence for each genome
- k_mer_creator.py
- Populates each genome object with all k-mers for that sequence
- For every k-mer in all genome objects, it adds a placeholder k-mer (k-mer: 0) for that k-mer all other genomes. So every genome has every k-mer that every other genome has.
- Split into training & testing sets
- Make fasta files with testing data for BBMap
- Simulate reads from test genomes
- Find significant kmers for testing and training sets and create CSV's with them
- kruskal_wallis.py
- Tests if a kmer can help distinguish two classes by analyzing the difference in the medians
- SVM.py
- CLassification using a Support Vector Machine
- Clone this repository into your home directory
- Install BioPython and skikit learn
- https://biopython.org/
- https://scikit-learn.org/stable/
- To install the dependencies, use
pip3 install -r requirements.txt
- For Windows: Install Java in your Linux subsystem for the BBMap script to work
- Run
python3 current_code/k_mer_creator.py
to generate files - Run
SVM.py
to classify
- Need a way for SVM to return an "i dont know" - distance from the vector
- Turning reads into contigs and test on those
- Convert some nested for loops to list comprehensions for speed and readability
- Adding script for one hot encoding
Janis
- Reproduce stats for full genomes and record them
- Reproduce stats for reads (should have high error rate
- Produce results for classification with reads
Anna
- Fix the cnn input shape sizes. Get CNN working for 300 samples
- Get it to work for a bunch of genomes of different sizes