Recent studies in DNA sequence classification have leveraged sophisticated machine learning techniques, achieving notable accuracy in categorizing complex genomic data. Methods such as k-mer counting have proven effective in distinguishing sequences from varied species like chimpanzees, dogs, and humans. These methods have been used widely in the latest literaure. However, these approaches often demand extensive computational resources. Our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis. This approach not only aligns with the current state-of-the-art in terms of accuracy but also offers a more resource-efficient alternative to traditional machine learning methods. A list of different algorithms was used respectively and our results demonstrate the proposed methods' comparable effectiveness in classifying DNA sequences.
git clone https://github.com/sukruozan/dna-sequence-classification.git
Conda is a cross platform package and environment manager that installs and manages packages. You can use it to replicate the same environment by simply running the following:
conda env create -f environment.yml
- The original dataset can be reached at this Kaggle post Demystify DNA Sequencing with Machine Learning
- Here I combined all the data and created a single training and test datasets. For reproduction of the same results you can find the corresponding dataset in the folder dataset of this repository.
Species | Train Size | Test Size |
---|---|---|
Chimpanzee | 1345 | 337 |
Human | 3504 | 876 |
Dog | 656 | 164 |
Total | 5505 | 1377 |
Gene Family | Class Label | Chimpanzee | Human | Dog | Total |
---|---|---|---|---|---|
G protein coupled receptors | 0 | 234 | 531 | 131 | 896 |
Tyrosine kinase | 1 | 185 | 534 | 75 | 794 |
Tyrosine phosphatase | 2 | 144 | 349 | 64 | 557 |
Synthetase | 3 | 228 | 672 | 95 | 995 |
Synthase | 4 | 261 | 711 | 135 | 1107 |
Ion channel | 5 | 109 | 240 | 60 | 409 |
Transcription factor | 6 | 521 | 1343 | 260 | 2124 |
Class Distribution - All Species | Class Distribution - Human | Class Distribution - Chimpanzee | Class Distribution - Dog |
---|---|---|---|
Confusion matrices for subspecies classifications: Human, Chimpanzee, and Dog DNA. These matrices detail the classifier's accuracy for each subspecies, highlighting the precision in distinguishing between these specific genomic sequences. The outcomes from the experiments performed by using brotli
compressor is selected for depiction.
Algorithm | Computation Time (seconds) | Accuracy | Recall | Precision | F1 Score |
---|---|---|---|---|---|
Gzip | 1735.81 | 0.962 | 0.962 | 0.963 | 0.962 |
Snappy | 1726.49 | 0.932 | 0.932 | 0.933 | 0.932 |
Brotli | 12551.60 | 0.966 | 0.966 | 0.967 | 0.966 |
LZ4 | 1618.27 | 0.942 | 0.942 | 0.943 | 0.942 |
Zstandard | 1560.72 | 0.930 | 0.930 | 0.935 | 0.931 |
BZ2 | 2657.35 | 0.924 | 0.924 | 0.924 | 0.924 |
LZMA | 9486.67 | 0.958 | 0.958 | 0.958 | 0.958 |
Algorithm | Chimpanzee Accuracy | Chimpanzee Precision | Chimpanzee Recall | Chimpanzee F1 Score | Human Accuracy | Human Precision | Human Recall | Human F1 Score | Dog Accuracy | Dog Precision | Dog Recall | Dog F1 Score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Gzip | 0.997 | 0.997 | 0.997 | 0.997 | 0.944 | 0.945 | 0.944 | 0.944 | 0.988 | 0.989 | 0.988 | 0.988 |
Snappy | 0.994 | 0.994 | 0.994 | 0.994 | 0.943 | 0.944 | 0.943 | 0.943 | 0.744 | 0.748 | 0.744 | 0.745 |
Brotli | 1.000 | 1.000 | 1.000 | 1.000 | 0.947 | 0.950 | 0.947 | 0.948 | 0.994 | 0.994 | 0.994 | 0.994 |
LZ4 | 0.997 | 0.997 | 0.997 | 0.997 | 0.944 | 0.946 | 0.944 | 0.944 | 0.817 | 0.829 | 0.817 | 0.819 |
Zstandard | 0.994 | 0.994 | 0.994 | 0.994 | 0.900 | 0.909 | 0.900 | 0.901 | 0.963 | 0.967 | 0.963 | 0.963 |
BZ2 | 0.985 | 0.985 | 0.985 | 0.985 | 0.893 | 0.894 | 0.893 | 0.893 | 0.963 | 0.966 | 0.963 | 0.964 |
LZMA | 1.000 | 1.000 | 1.000 | 1.000 | 0.936 | 0.937 | 0.936 | 0.936 | 0.988 | 0.989 | 0.988 | 0.988 |
Confusion Matrix - All Species | Confusion Matrix - Human | Confusion Matrix - Chimpanzee | Confusion Matrix - Dog |
---|---|---|---|
Confusion matrices for subspecies classifications: Human, Chimpanzee, and Dog DNA. These matrices detail the classifier's accuracy for each subspecies, highlighting the precision in distinguishing between these specific genomic sequences. The outcomes from the experiments performed by using brotli
compressor is selected for depiction.
If you need help or have a question, raise an issue or contact me at sukruozan@gmail.com.
- Sukru Ozan
- Download the manuscript from arXiv To cite this work in your publications, use the following BibTeX entry:
@article{ozan2024dna,
title={DNA Sequence Classification with Compressors},
journal={arXiv},
author={Şükrü Ozan},
year={2024},
eprint={2401.14025},
archivePrefix={arXiv},
primaryClass={q-bio.GN},
doi={https://doi.org/10.48550/arXiv.2401.14025},
}
- Jaddi, N. S., & Saniee Abadeh, M. (2022). "Cell separation algorithm with enhanced search behaviour in miRNA feature selection for cancer diagnosis." Information Systems, 104, 101906. DOI
- Khan, S., Khan, M., Iqbal, N., Li, M., & Khan, D. M. (2020). "Spark-Based Parallel Deep Neural Network Model for Classification of Large Scale RNAs into piRNAs and Non-piRNAs." IEEE Access, 8, 136978–136991. DOI
- Yagin, F. H., et al. (2023). "Explainable artificial intelligence model for identifying COVID-19 gene biomarkers." Computers in Biology and Medicine, 154, 106619. DOI
- Wen, J., et al. (2019). "A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network." BMC Bioinformatics, 20(1). DOI
- Millán Arias, P., et al. (2022). "DeLUCS: Deep learning for unsupervised clustering of DNA sequences." PLOS ONE, 17(1), e0261531. DOI
- Bentley, J. L., et al. (1986). "A locally adaptive data compression scheme." Communications of the ACM, 29(4), 320–330. DOI
- Burrows, M. (1994). "A block-sorting lossless data compression algorithm." SRS Research Report, 124.
- Huffman, D. (1952). "A Method for the Construction of Minimum-Redundancy Codes." Proceedings of the IRE, 40(9), 1098–1101. DOI
- Ziv, J., & Lempel, A. (1977). "A universal algorithm for sequential data compression." IEEE Transactions on Information Theory, 23(3), 337-343. DOI
- Alberts, B. (2014). "Molecular biology of the cell." 6th ed. New York, NY: Garland Publishing.
- Li, M., et al. (2004). "The similarity metric." IEEE Transactions on Information Theory, 50(12), 3250–3264. DOI
- Jiang, Z., et al. (2023). "Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors." Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada. Link
- Juneja, S., et al. (2022). "An Approach to DNA Sequence Classification Through Machine Learning: DNA Sequencing, K Mer Counting, Thresholding, Sequence Analysis." International Journal of Reliable and Quality E-Healthcare, 11(2), 1–15. DOI
- Orozco-Arias, S., et al. (2021). "K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes." PeerJ, 9, e11456. DOI
- Sarkar, B. K., et al. (2021). "Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter." Scientific Reports, 11(1). [DOI](http://dx.doi.org/10.
- Ozan, S. (2023). "DNA Sequence Classification." GitHub Repository
- Singh, N. (2023). "Demystify DNA Sequencing with Machine Learning." Kaggle Notebook
- Singh, N. (2023). "DNA Sequence Dataset." Kaggle Dataset