Project for the Machine Learning course of "Bioinformatics for Computational Genomics" MSc.
The notebook can be visualized here.
The work involves utilizing codon percentages from various species' genomes to categorize samples into their respective kingdoms. The primary goal was to explore the potential of codon usage frequencies across different organisms in classifying codon usage into 11 distinct Kingdoms: archaea, bacteria, bacteriophage, plasmid, plant, invertebrate, vertebrate, mammal, rodent, primate, and virus. This analysis encompasses the application and assessment of clustering, classification, and regression techniques acquired throughout the course.
Further information about the project can be found in the specification file. The data folder contains the train and test dataset.
Khomtchouk, Bohdan B. "Codon usage bias levels predict taxonomic identity and genetic composition." bioRxiv (2020)