Skip to content

b1b2ttt/Bioinformatics_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Bioinformatics_Project

Compression Algorithms based on BWT and Wavelet Tree for Large K-mers Storage Problem

Calculating k-mers, a substring of length k in DNA sequence data, is an important part of many methods in bioinformatics. Although this rule is simple to understand, computing k-mers in large modern sequential data sets can easily exceed the storage capacity of standard computers. When k is large, the amount of storage is too large and thus calculation speed becomes slower because the value is too large.

In our project, we first studied and applied the BWT transform algorithm and Huffman coding to pre-process k-mer data, and then implemented wavelet tree structure and statistically analyzed the distribution of continuous 0 or 1 vectors in wavelet tree nodes. We designed a suitable Hybrid coding storage structure, in which we applied direct coding, run length and Elias Gamma coding methods, and conducted experiments on different single coding methods regarding to compression rate. Furthermore, k-mers compression was realized by adopting the multi-threaded parallel thinking and compression algorithm principle, and also we adjusted parameters of our program through test data to achieve better compression performance.

About

Compression Algorithms based on BWT and Wavelet Tree for Large K-mers Storage Problem

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages