Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Search] Query the text size #2934

Open
joergi-w opened this issue Jan 26, 2022 · 1 comment
Open

[Search] Query the text size #2934

joergi-w opened this issue Jan 26, 2022 · 1 comment
Labels
feature/proposal a new feature or an idea of

Comments

@joergi-w
Copy link
Member

For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.

I have experimented with the size() function of seqan3::bi_fm_index<dna4, text_layout::collection> in order to calculate it myself. According to the documentation, the value of size() includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar = index.size() - nseq. For nseq == 1 we have a special case with nchar = index.size() - 2 (because a single sequence has 2 sentinels).

I suggest to provide a function get_text_size() for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).

@joergi-w joergi-w added the feature/proposal a new feature or an idea of label Jan 26, 2022
@eseiler
Copy link
Member

eseiler commented Jan 26, 2022

Some ideas, without looking much at the code:

  • [number of texts] We store the text begin positions (text_begin), we also have select and rank support for this vector. The number of texts would then be rank(text_begin, size()) // +1 ??. This should be constant.
  • [number of texts] We number of texts during construction and could just store another size_t. Should be faster than rank, but will change the index serialisation.
  • [text size] Either store, or have a function that does the nseq == 1/nseq == 2 check.

Question: Do we also need the sizes of individual texts in the collection?

  • With text_begin as well as rank/select, we could determine the text size (text_size(x) == select(x, text_begin) - select(x + 1, text_begin); // probably off by one). This should also be constant.
  • We know the text sizes and number of texts during construction. So, we could store the text_lengths in a vector. Might get "quite" big for big collections, and also changes the index serialisation.

(cc @SGSSGene)

@eseiler eseiler changed the title [Index] Query the text size [Search] Query the text size Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/proposal a new feature or an idea of
Projects
None yet
Development

No branches or pull requests

2 participants