Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially incorrect positions from fm-index backward_search #454

Closed
rob-p opened this issue Oct 2, 2021 · 13 comments · Fixed by #455
Closed

Potentially incorrect positions from fm-index backward_search #454

rob-p opened this issue Oct 2, 2021 · 13 comments · Fixed by #455

Comments

@rob-p
Copy link
Contributor

rob-p commented Oct 2, 2021

Hi all,

I noticed some strange behavior when doing locate queries with the fm-index. Specifically, the result gets me "close" to where the pattern occurs, but not to the correct location. I can't seem to figure out why it's not correct.

I created a basic minimal working example with data and code here. The code takes a sars cov2 genome and build the suffix array, bwt, and fmindex using rust-bio. Then, it reads in from a FASTA file one read that is a pattern of length 57 taken directly from position 1199 of the geneome.

The rust-bio query returns that the position of the string is at 1248, but both basic BioPython find and some quick arithmetic from the results of grep disagree with this, and confirm that the pattern exists only at position 1199.

Hopefully, I'm just holding something wrong ;), but could anyone explain what might be going on here?

Thanks,
Rob

@sjackman
Copy link
Contributor

sjackman commented Oct 2, 2021

@rob-p
Copy link
Contributor Author

rob-p commented Oct 2, 2021

I'll mention this as well (https://twitter.com/nomad421/status/1444380062988578817?s=20) — it seems to me that at the basic suffix-array stage, everything checks out.

@jch-13
Copy link
Contributor

jch-13 commented Oct 2, 2021

The issue seems to be in an optimization of the backward search (#230):

// The symbol was not found if we end up with an empty interval.
// Terminate the LF-mapping process.
if l == r {
break;
}

The suffix array interval which is returned by the function is actually {l, r + 1} so the issue might be caused by an off-by-one error. When we change the line to check for l > r instead of l == r your code returns the correct position.

@johanneskoester
Copy link
Contributor

johanneskoester commented Oct 3, 2021

@rob-p thanks for pointing us to this issue. Do you think you could provide a PR including a test case that reproduces your scenario?

@johanneskoester
Copy link
Contributor

If the test case is not so easy to generate, a PR with the fix would also do for now.

@jch-13
Copy link
Contributor

jch-13 commented Oct 3, 2021

Currently, the backward search returns the suffix array interval as soon as it is narrowed to to a size of one (exactly one match in the reference). So I think the reference position is actually the first position of the last (backward search) substring of the read that matches the reference exactly once.
So when we let pattern = text[..text.len() - 1]; for any text and search for pattern in text, the position should be [0] (but currently the bug is triggered).

@rob-p
Copy link
Contributor Author

rob-p commented Oct 3, 2021

Right, so I think this is worth resolving at the same time. When search fails, what is the desired behavior? I thought an empty interval, but now it is the last non-empty interval, so you have to check explicity if the search failed or not (if patter equals the text at the given position). The other alternative would be to return both --- a structure that contains the interval of matches and a flag specifying the length of suffix matched (the whole pattern, or something less).

@johanneskoester : the repo I link to has the repo case, is that data too big for a PR, or OK?

@jch-13
Copy link
Contributor

jch-13 commented Oct 3, 2021

Apologies, my previous comment was missing a bit of context. It was meant as a minimal test case.
I believe the correct behaviour would be to return the empty interval whenever a search fails. This is how it's been before #230 was merged, so replacing l == r with l > r would restore that while still short-circuting the symbol-not-in-text case.

I'd have a branch with the fix and a test around that I could quickly file a PR for, if you don't mind?

@rob-p
Copy link
Contributor Author

rob-p commented Oct 3, 2021

@jch-13,

Please, by all means go ahead and make the PR! You discovered the source of and fix for the problem :).

Regarding my second comment, I was just curious if @johanneskoester had any opinions or thoughts on potentially improving the interface / API of the backward_search method. While I certainly agree the current functionality is a bug that needs to be fixed, I'm suggesting it may often be useful to know where the search failed when it does fail. Given the way backward search works, we can get the maximal matching suffix essentially for free, if we're willing to change the interface of the search. I was curious if @johanneskoester had any valence toward that. Of course, it seems like a generally useful discussion and so I'd be happy to hear others' thoughts as well.

@johanneskoester
Copy link
Contributor

@jch-13,

Please, by all means go ahead and make the PR! You discovered the source of and fix for the problem :).

Regarding my second comment, I was just curious if @johanneskoester had any opinions or thoughts on potentially improving the interface / API of the backward_search method. While I certainly agree the current functionality is a bug that needs to be fixed, I'm suggesting it may often be useful to know where the search failed when it does fail. Given the way backward search works, we can get the maximal matching suffix essentially for free, if we're willing to change the interface of the search. I was curious if @johanneskoester had any valence toward that. Of course, it seems like a generally useful discussion and so I'd be happy to hear others' thoughts as well.

That is a great idea. I think it will make sense to return an enum that models the different possible outcomes.

@johanneskoester
Copy link
Contributor

Reopening since the API change proposed by @rob-p is not yet done. @rob-p would you like to provide a PR?

@rob-p
Copy link
Contributor Author

rob-p commented Oct 4, 2021

@johanneskoester : I made a PR based on @jch-13's branch that you can look at here jch-13#1. I'm curious if you have any input on the interface or implementation.

@rob-p
Copy link
Contributor Author

rob-p commented Oct 8, 2021

Thanks @johanneskoester and @jch-13 for the help and feedback. This is now on PR #457.

johanneskoester added a commit that referenced this issue Oct 19, 2021
* Fix backward search on FM-Index (#454)

* add enum to track result type in fm_index backward search

* format

* address feedback on initial PR

* Update lib.rs

* fix doc examples

* cargo fmt

Co-authored-by: J. Christian Heide <christian_heide@eva.mpg.de>
Co-authored-by: Rob Patro <rob@newton>
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
@rob-p rob-p closed this as completed Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants