Skip to content

felipelouza/bwsd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bwsd:

This software is an implementation of the algorithms described in [1] to compute all-pairs of Burrows-Wheeler similarity distributions (BWSD) for a string collection.

Given a collection of d strings, bwsd computes a matrix Mdxd with all pairs of BWSD-based distances.

The Burrows-Wheeler transform (BWT) and the Document array (DA) are computed using gsaca-k [2].

install

git clone https://github.com/felipelouza/bwsd
cd bwsd
make compile

Note: our implementaion needs sdsl-lite.

run

Given a collection of d strings in a single file FILE.

./bwsd [options] FILE d

Available options:

-h    this help message
-A a  preferred algorithm to use (default is alg. 1 BIT_sd)
-B b  BWSD-based distance to compute, options: 1. expectation (default), 2. Shannon entropy
-T t  use t parallel threads (default 1)
-o    write output matrix to FILE.output.bin
-p    print the output matrix (for debug)
-v    verbose output

Notes:

  • d=0 gives all strings as input.
  • Supported extensions are .txt, .fasta and .fastq.
  • If the input is changed, please run make remove DIR=dataset/, to rebuild the BWTs.

quick test

To run a test with d=10 strings from dataset/input.100.txt using Alg. 1 -A 1, writing the output to disk option -o, type:

./bwsd dataset/input.100.txt 10 -A 1 -o

Note: output matrix is written to FILE.output.bin

options

To see the resulting Mdxd, computing distances based on the Shannon entropy of BWSD option -B 2, use option -p:

./bwsd dataset/input.100.txt 10 -A 1 -o -B 2 -p

Result:

## BWSD_BIT_sd ##
writing 360 bytes to: input.100.txt.output.bin
0.000	0.000	0.684	1.459	2.156	2.128	1.750	2.156	1.896	1.299
0.000	0.000	0.684	1.459	2.156	2.128	1.750	2.156	1.896	1.299
0.684	0.684	0.000	1.614	1.811	1.224	1.750	1.750	1.614	1.299
1.459	1.459	1.614	0.000	2.250	2.059	2.156	2.156	0.959	1.750
2.156	2.156	1.811	2.250	0.000	2.322	0.944	2.020	2.522	1.686
2.128	2.128	1.224	2.059	2.322	0.000	1.792	2.252	1.961	1.459
1.750	1.750	1.750	2.156	0.944	1.792	0.000	1.627	2.522	1.392
2.156	2.156	1.750	2.156	2.020	2.252	1.627	0.000	1.753	1.506
1.896	1.896	1.614	0.959	2.522	1.961	2.522	1.753	0.000	2.000
1.299	1.299	1.299	1.750	1.686	1.459	1.392	1.506	2.000	0.000

Notes:

  • We compute only the (d2-d)/2 entries of Mdxd (upper triangular matrix), which can be accessed as here

alternatives (Alg. 1)

Using uncompressed bitvectors (bit_vector):

make SD_VECTOR=0
./bwsd [options] FILE d -A 1

Using wavelet tree:

make WT=1
./bwsd [options] FILE d -A 1

Note: Alg. 1 uses compressed bitvectors (BIT_sd) as default.

debug

To see a more detailed execution use:

make clean
make DEBUG=1

citation

Please, if you use this tool in an academic setting cite the following paper [1]:

@article{LouzaTGZ19,
  author    = {Louza, Felipe A. and Telles, Guilherme P. and Gog, Simon and Zhao, Liang},
  title     = {Algorithms to compute the Burrows-Wheeler Similarity Distribution},
  journal   = {Theor. Comput. Sci.},
  volume    = {782},
  pages     = {145-156},
  year      = {2019},
  url       = {https://www.sciencedirect.com/science/article/pii/S0304397519301653},
  doi       = {10.1016/j.tcs.2019.03.012},
}

datasets

We have used the following datasets in our experiments: https://github.com/felipelouza/bwsd/blob/master/experiments/dataset.tar.gz

references

[1] Louza, Felipe A., Telles, Guilherme P., Gog, Simon, Liang Zhao (2019): Algorithms to compute the Burrows-Wheeler Similarity Distribution. Theor. Comput. Sci. 782: 145-156 [ElsevierLink]

[2] Louza, Felipe A., Gog, Simon, Telles, Guilherme P. (2017). Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678: 22-39, github.

About

Algorithms to compute the Burrows-Wheeler Similarity Distributions [SPIRE'18, TCS 2019]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published