bwsd:

This software is an implementation of the algorithms described in [1] to compute all-pairs of Burrows-Wheeler similarity distributions (BWSD) for a string collection.

Given a collection of d strings, bwsd computes a matrix M_dxd with all pairs of BWSD-based distances.

The Burrows-Wheeler transform (BWT) and the Document array (DA) are computed using gsaca-k [2].

install

git clone https://github.com/felipelouza/bwsd
cd bwsd
make compile

Note: our implementaion needs sdsl-lite.

run

Given a collection of d strings in a single file FILE.

./bwsd [options] FILE d

Available options:

-h    this help message
-A a  preferred algorithm to use (default is alg. 1 BIT_sd)
-B b  BWSD-based distance to compute, options: 1. expectation (default), 2. Shannon entropy
-T t  use t parallel threads (default 1)
-o    write output matrix to FILE.output.bin
-p    print the output matrix (for debug)
-v    verbose output

Notes:

d=0 gives all strings as input.
Supported extensions are .txt, .fasta and .fastq.
If the input is changed, please run make remove DIR=dataset/, to rebuild the BWTs.

quick test

To run a test with d=10 strings from dataset/input.100.txt using Alg. 1 -A 1, writing the output to disk option -o, type:

./bwsd dataset/input.100.txt 10 -A 1 -o

Note: output matrix is written to FILE.output.bin

options

To see the resulting M_dxd, computing distances based on the Shannon entropy of BWSD option -B 2, use option -p:

./bwsd dataset/input.100.txt 10 -A 1 -o -B 2 -p

Result:

## BWSD_BIT_sd ##
writing 360 bytes to: input.100.txt.output.bin
0.000	0.000	0.684	1.459	2.156	2.128	1.750	2.156	1.896	1.299
0.000	0.000	0.684	1.459	2.156	2.128	1.750	2.156	1.896	1.299
0.684	0.684	0.000	1.614	1.811	1.224	1.750	1.750	1.614	1.299
1.459	1.459	1.614	0.000	2.250	2.059	2.156	2.156	0.959	1.750
2.156	2.156	1.811	2.250	0.000	2.322	0.944	2.020	2.522	1.686
2.128	2.128	1.224	2.059	2.322	0.000	1.792	2.252	1.961	1.459
1.750	1.750	1.750	2.156	0.944	1.792	0.000	1.627	2.522	1.392
2.156	2.156	1.750	2.156	2.020	2.252	1.627	0.000	1.753	1.506
1.896	1.896	1.614	0.959	2.522	1.961	2.522	1.753	0.000	2.000
1.299	1.299	1.299	1.750	1.686	1.459	1.392	1.506	2.000	0.000

Notes:

We compute only the (d²-d)/2 entries of M_dxd (upper triangular matrix), which can be accessed as here

alternatives (Alg. 1)

Using uncompressed bitvectors (bit_vector):

make SD_VECTOR=0
./bwsd [options] FILE d -A 1

Using wavelet tree:

make WT=1
./bwsd [options] FILE d -A 1

Note: Alg. 1 uses compressed bitvectors (BIT_sd) as default.

debug

To see a more detailed execution use:

make clean
make DEBUG=1

citation

Please, if you use this tool in an academic setting cite the following paper [1]:

@article{LouzaTGZ19,
  author    = {Louza, Felipe A. and Telles, Guilherme P. and Gog, Simon and Zhao, Liang},
  title     = {Algorithms to compute the Burrows-Wheeler Similarity Distribution},
  journal   = {Theor. Comput. Sci.},
  volume    = {782},
  pages     = {145-156},
  year      = {2019},
  url       = {https://www.sciencedirect.com/science/article/pii/S0304397519301653},
  doi       = {10.1016/j.tcs.2019.03.012},
}

datasets

We have used the following datasets in our experiments: https://github.com/felipelouza/bwsd/blob/master/experiments/dataset.tar.gz

references

[1] Louza, Felipe A., Telles, Guilherme P., Gog, Simon, Liang Zhao (2019): Algorithms to compute the Burrows-Wheeler Similarity Distribution. Theor. Comput. Sci. 782: 145-156 [ElsevierLink]

[2] Louza, Felipe A., Gog, Simon, Telles, Guilherme P. (2017). Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678: 22-39, github.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
dataset		dataset
experiments		experiments
external		external
lib		lib
tmp		tmp
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

experiments

experiments

external

external

lib

lib

tmp

tmp

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

main.cpp

main.cpp

Repository files navigation

bwsd:

install

run

quick test

options

alternatives (Alg. 1)

debug

citation

datasets

references

About

Releases

Packages

Contributors 2

Languages

felipelouza/bwsd

Folders and files

Latest commit

History

Repository files navigation

bwsd:

install

run

quick test

options

alternatives (Alg. 1)

debug

citation

datasets

references

About

Topics

Resources

Stars

Watchers

Forks

Languages