Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Not documented that FM index / suffix array / BWT require terminal sentinel character #495

Open
corneliusroemer opened this issue Aug 1, 2022 · 1 comment

Comments

@corneliusroemer
Copy link
Contributor

I spent a good hour debugging backward_search using the code fragment from backward_search:

/// Perform backward search, yielding `BackwardSearchResult` enum that
/// contains the suffix array interval denoting exact occurrences of the given pattern
/// of length m in the text if it exists, or the suffix array interval denoting the
/// exact occurrences of a maximal matching suffix of the given pattern if it does
/// not exist. If none of the pattern can be matched, the `BackwardSearchResult` is
/// `Absent`.
/// Complexity: O(m).
///
/// # Arguments
///
/// * `pattern` - the pattern to search
///
/// # Example
///
/// ```
/// use bio::alphabets::dna;
/// use bio::data_structures::bwt::{bwt, less, Occ};
/// use bio::data_structures::fmindex::{BackwardSearchResult, FMIndex, FMIndexable};
/// use bio::data_structures::suffix_array::suffix_array;
///
/// let text = b"GCCTTAACATTATTACGCCTA$";
/// let alphabet = dna::n_alphabet();
/// let sa = suffix_array(text);
/// let bwt = bwt(text, &sa);
/// let less = less(&bwt, &alphabet);
/// let occ = Occ::new(&bwt, 3, &alphabet);
/// let fm = FMIndex::new(&bwt, &less, &occ);
///
/// let pattern = b"TTA";
/// let bsr = fm.backward_search(pattern.iter());
///
/// let positions = match bsr {
/// BackwardSearchResult::Complete(sai) => sai.occ(&sa),
/// BackwardSearchResult::Partial(sai, _l) => sai.occ(&sa),
/// BackwardSearchResult::Absent => Vec::<usize>::new()
/// };
///
/// assert_eq!(positions, [3, 12, 9]);
/// ```
fn backward_search<'b, P: Iterator<Item = &'b u8> + DoubleEndedIterator>(
&self,
pattern: P,
) -> BackwardSearchResult {

Turns out, the reference fasta I read in didn't have a sentinel character $ at the end. Lacking this terminal $ makes backward_search produce wrong results.

While the example texts do contain that character (that's how I got the idea to attach it to my reference) - it's not mentioned anywhere as far as I can see.

There should either be a note, or maybe even a check during creation of the FM index - unless there's a use case without terminal $. Or even more convenient - add a terminal $.

@corneliusroemer
Copy link
Contributor Author

Also, it would be nice if one didn't have to add that sentinel character manually. It forces me to do a .clone() simply so that I can tag on the $ to the end of my reference sequence when creating the index. Maybe that sentinel character could be hacked in avoiding that clone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant