Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance ContinuousIds optimisation to store the diff between docIds as a Vint #13228

Open
expani opened this issue Mar 27, 2024 · 0 comments
Open

Comments

@expani
Copy link
Contributor

expani commented Mar 27, 2024

Description

One of the optimisations introduced by LUCENE-10233 was to compress continuous doc Ids (strictly sorted) by only storing the start docId here with a flag to indicate the same.

This works well when the difference between continuous docIds is 1

I was testing datasets where only few points ( Enums ) are present in multiple docs and are inserted in a cyclic fashion.

Consider the following insertion order :

=================================

  • Insert Doc Id 1 with 1d Point Value as 1
  • Insert Doc Id 2 with 1d Point Value as 2
  • Insert Doc Id 3 with 1d Point Value as 3

=================================

  • Insert Doc Id 4 with 1d Point Value as 1
  • Insert Doc Id 5 with 1d Point Value as 2
  • Insert Doc Id 6 with 1d Point Value as 3

=================================

  • Insert Doc Id 7 with 1d Point Value as 1
  • Insert Doc Id 8 with 1d Point Value as 2
  • Insert Doc Id 9 with 1d Point Value as 3

=================================

and so on.

In such scenario's, although the docIds for every point follow an arithmetic progression, the difference between them is 3.

I tested with changing the implementation to also store the diff along with starting docId and observed high compression for such cases. My test involved indexing 1 million docs with one numeric field containing only 2 unique points that are inserted in a cyclic fashion.

Without storing the diff, the KDD File took 276kb whereas with the diff it took around 34 kb.

My proposal is to store the diff along with the starting docId to ensure all arithmetic progressions of docIds can use this optimisation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant