Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impact of bandwidth on alignment scores #16

Open
mohsenzakeri opened this issue Mar 31, 2020 · 0 comments
Open

Impact of bandwidth on alignment scores #16

mohsenzakeri opened this issue Mar 31, 2020 · 0 comments

Comments

@mohsenzakeri
Copy link

Dear Heng (@lh3),

First of all, I would like to thank you for developing such an awesome library for aligning biological sequences.

We (+ @rob-p and @fataltes) are using ksw2 as the core aligner engine of Puffaliger, which is an aligner for short read sequences. As we were looking at the alignment scores (specifically ez.mte and ez.mqe) generated by ksw2, we encountered some unexpected behavior such that the alignment score calculated by ksw2 on the same problem was different between different runs. I am going to explain the issue in this post.

The challenging part for explaining the issue is that it is only triggered while using the ksw2 aligner engine in a continuous procedure of aligning many queries rather than calling a single alignment problem from the command line. For example, consider the following case:

DNA-target : GTCACACTGTAGGAAACGATGGACCGTACCCTCCTTGTCACCAT
DNA-query: ATGAGACGCAACTATGGTGACGAA

The length of the DNA-target and DNA-query is 44 and 24 consecutively.

By running these cases individually using the ksw-test binary built from the ksw2 github repo (https://github.com/lh3/ksw2), the problem I mentioned is not triggered. However, while aligning a large number of queries in a continuous procedure (i. e. aligning many reads from a fastq file), sometimes we observe that ksw2 generates different alignment scores for the same alignment problem at different runs. This happens in both extz and gg2 mode, but not in extz2_sse.

After digging more, we realized that it only happens when the bandwidth value (w) is smaller than 20 (where 20 = 44 - 24). Looking at how the st and en value is defined, it seems that for w<20, in some iterations st becomes greater than en, and we believe this is not supposed to happen, especially because it causes out of bound access in the “eh” array in extz mode and “v” array in gg2 mode. This out of bounds access constitutes undefined behavior which, in our cases, never seems to manifest in a segmentation fault, but rather manifests in the access of and computation on uninitialized memory (containing values that are not meaningful).

So, we are wondering if adding a line which ensures st is never surpassing en (st>en), like the one at line 111 of ksw2_extz2_sse.c (https://github.com/lh3/ksw2/blob/master/ksw2_extz2_sse.c#L111) , is sufficient for dealing with this issue and generating correct and consistent scores in those cases, or if some other modification is necessary.

Besides from that, another potential unexpected behavior where the score changes just by changing bandwidth is the following:
ksw2-test -t extz2_sse -O 5 -E 3 -w 1 CTGAGAGGACAAATAAACT TCTGAGAGGACAAATAAACT
first second 38 38 18 19 38 38 38 18 6 2 1I19M
ksw2-test -t extz2_sse -O 5 -E 3 -w 2 CTGAGAGGACAAATAAACT TCTGAGAGGACAAATAAACT
first second 30 30 18 19 30 30 30 18 6 2 1I19M

Here, the only thing changing between the two queries is the bandwidth, but the score doesn’t seem consistent between the two runs. This particular example is triggered via the command line tool. In this example the right score is 30 and for every bandwidth greater than or equal to 2 the score is 30.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant