New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potentially incorrect positions from fm-index backward_search #454
Comments
I'll mention this as well (https://twitter.com/nomad421/status/1444380062988578817?s=20) — it seems to me that at the basic suffix-array stage, everything checks out. |
The issue seems to be in an optimization of the backward search (#230): rust-bio/src/data_structures/fmindex.rs Lines 128 to 132 in a3ebaa8
The suffix array interval which is returned by the function is actually {l, r + 1} so the issue might be caused by an off-by-one error. When we change the line to check for l > r instead of l == r your code returns the correct position.
|
@rob-p thanks for pointing us to this issue. Do you think you could provide a PR including a test case that reproduces your scenario? |
If the test case is not so easy to generate, a PR with the fix would also do for now. |
Currently, the backward search returns the suffix array interval as soon as it is narrowed to to a size of one (exactly one match in the reference). So I think the reference position is actually the first position of the last (backward search) substring of the read that matches the reference exactly once. |
Right, so I think this is worth resolving at the same time. When search fails, what is the desired behavior? I thought an empty interval, but now it is the last non-empty interval, so you have to check explicity if the search failed or not (if patter equals the text at the given position). The other alternative would be to return both --- a structure that contains the interval of matches and a flag specifying the length of suffix matched (the whole pattern, or something less). @johanneskoester : the repo I link to has the repo case, is that data too big for a PR, or OK? |
Apologies, my previous comment was missing a bit of context. It was meant as a minimal test case. I'd have a branch with the fix and a test around that I could quickly file a PR for, if you don't mind? |
Please, by all means go ahead and make the PR! You discovered the source of and fix for the problem :). Regarding my second comment, I was just curious if @johanneskoester had any opinions or thoughts on potentially improving the interface / API of the |
That is a great idea. I think it will make sense to return an enum that models the different possible outcomes. |
@johanneskoester : I made a PR based on @jch-13's branch that you can look at here jch-13#1. I'm curious if you have any input on the interface or implementation. |
Thanks @johanneskoester and @jch-13 for the help and feedback. This is now on PR #457. |
* Fix backward search on FM-Index (#454) * add enum to track result type in fm_index backward search * format * address feedback on initial PR * Update lib.rs * fix doc examples * cargo fmt Co-authored-by: J. Christian Heide <christian_heide@eva.mpg.de> Co-authored-by: Rob Patro <rob@newton> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
Hi all,
I noticed some strange behavior when doing locate queries with the fm-index. Specifically, the result gets me "close" to where the pattern occurs, but not to the correct location. I can't seem to figure out why it's not correct.
I created a basic minimal working example with data and code here. The code takes a sars cov2 genome and build the suffix array, bwt, and fmindex using rust-bio. Then, it reads in from a FASTA file one read that is a pattern of length
57
taken directly from position1199
of the geneome.The rust-bio query returns that the position of the string is at
1248
, but both basic BioPython find and some quick arithmetic from the results ofgrep
disagree with this, and confirm that the pattern exists only at position1199
.Hopefully, I'm just holding something wrong ;), but could anyone explain what might be going on here?
Thanks,
Rob
The text was updated successfully, but these errors were encountered: