Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sampled suffix array #447

Merged
merged 8 commits into from Aug 23, 2021
Merged

Conversation

Daniel-Liu-c0deb0t
Copy link
Contributor

@Daniel-Liu-c0deb0t Daniel-Liu-c0deb0t commented Aug 20, 2021

The sampled suffix array implementation has some issues (closes #70 and closes #446) from like 2016, so the code was commented out and the issues were never fixed... until now.

The issue is that wrong positions are returned when doing a last to front lookup that leads to a sentinel in the FM-index. This issue only happens when there are multiple sentinels. I think is due to how SAIS ranks the sentinels so they are different despite being the same character $ so the final result will not be a fully lexicographical sorting. The FM-index doesn't know this so you get wrong results when looking up a sentinel in occ or less for the last to front mapping. Normally, you wouldn't look up sentinels so this issue wouldn't come up, but with a sampled suffix array, sometimes you need to follow the last to front mapping and cross a suffix that starts with the sentinel in order to reach the nearest sampled suffix.

The fix is to simply use a hashmap to store, for each sentinel in the text, the index for the suffix that comes immediately after. This will make sure that we never need to cross a sentinel suffix to reach the nearest sampled suffix. The memory and performance impact of this is minimal, since typically the number of characters in the text is much larger than the number of sentinels.

An alternative solution would be to use a different sentinel from the sentinel at the end, so all sentinels but the last one are all equal from the perspective of SAIS and they will be sorted correctly. However, this will require changes to user code to use different sentinels and that is annoying.

@Daniel-Liu-c0deb0t Daniel-Liu-c0deb0t changed the title Fix sampled suffix array fix: sampled suffix array Aug 20, 2021
@coveralls
Copy link

coveralls commented Aug 20, 2021

Coverage Status

Coverage increased (+0.1%) to 88.057% when pulling 9ff05fb on Daniel-Liu-c0deb0t:sampled-sa into 125bf20 on rust-bio:master.

@Daniel-Liu-c0deb0t
Copy link
Contributor Author

I had to add a couple of extra methods to the sampled suffix array and fmd index APIs to handle a special case in our project where the sampled suffix array and the fmd index has to be put within the same struct while respecting ownership rules. This is usually fine with a shared Rc, but serde makes copies of the referenced data for each reference. To avoid this, we need to be able to reconstruct an fmd index struct on the fly without any checks so it doesn't have to own any data.

Copy link
Contributor

@pmarks pmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding the additional tests.

@pmarks pmarks merged commit 00f9846 into rust-bio:master Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Status of sampled SA Multiple sentinels and a sampled suffix array
3 participants