New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: sampled suffix array #447
Conversation
I had to add a couple of extra methods to the sampled suffix array and fmd index APIs to handle a special case in our project where the sampled suffix array and the fmd index has to be put within the same struct while respecting ownership rules. This is usually fine with a shared Rc, but serde makes copies of the referenced data for each reference. To avoid this, we need to be able to reconstruct an fmd index struct on the fly without any checks so it doesn't have to own any data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for adding the additional tests.
The sampled suffix array implementation has some issues (closes #70 and closes #446) from like 2016, so the code was commented out and the issues were never fixed... until now.
The issue is that wrong positions are returned when doing a last to front lookup that leads to a sentinel in the FM-index. This issue only happens when there are multiple sentinels. I think is due to how SAIS ranks the sentinels so they are different despite being the same character
$
so the final result will not be a fully lexicographical sorting. The FM-index doesn't know this so you get wrong results when looking up a sentinel inocc
orless
for the last to front mapping. Normally, you wouldn't look up sentinels so this issue wouldn't come up, but with a sampled suffix array, sometimes you need to follow the last to front mapping and cross a suffix that starts with the sentinel in order to reach the nearest sampled suffix.The fix is to simply use a hashmap to store, for each sentinel in the text, the index for the suffix that comes immediately after. This will make sure that we never need to cross a sentinel suffix to reach the nearest sampled suffix. The memory and performance impact of this is minimal, since typically the number of characters in the text is much larger than the number of sentinels.
An alternative solution would be to use a different sentinel from the sentinel at the end, so all sentinels but the last one are all equal from the perspective of SAIS and they will be sorted correctly. However, this will require changes to user code to use different sentinels and that is annoying.