[Search] Query the text size #2934

joergi-w · 2022-01-26T13:28:04Z

For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.

I have experimented with the size() function of seqan3::bi_fm_index<dna4, text_layout::collection> in order to calculate it myself. According to the documentation, the value of size() includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar = index.size() - nseq. For nseq == 1 we have a special case with nchar = index.size() - 2 (because a single sequence has 2 sentinels).

I suggest to provide a function get_text_size() for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).

The text was updated successfully, but these errors were encountered:

eseiler · 2022-01-26T14:19:38Z

Some ideas, without looking much at the code:

[number of texts] We store the text begin positions (text_begin), we also have select and rank support for this vector. The number of texts would then be rank(text_begin, size()) // +1 ??. This should be constant.
[number of texts] We number of texts during construction and could just store another size_t. Should be faster than rank, but will change the index serialisation.
[text size] Either store, or have a function that does the nseq == 1/nseq == 2 check.

Question: Do we also need the sizes of individual texts in the collection?

With text_begin as well as rank/select, we could determine the text size (text_size(x) == select(x, text_begin) - select(x + 1, text_begin); // probably off by one). This should also be constant.
We know the text sizes and number of texts during construction. So, we could store the text_lengths in a vector. Might get "quite" big for big collections, and also changes the index serialisation.

(cc @SGSSGene)

joergi-w added the feature/proposal a new feature or an idea of label Jan 26, 2022

eseiler changed the title ~~[Index] Query the text size~~ [Search] Query the text size Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Search] Query the text size #2934

[Search] Query the text size #2934

joergi-w commented Jan 26, 2022

eseiler commented Jan 26, 2022 •

edited

[Search] Query the text size #2934

[Search] Query the text size #2934

Comments

joergi-w commented Jan 26, 2022

eseiler commented Jan 26, 2022 • edited

eseiler commented Jan 26, 2022 •

edited