Enhance ContinuousIds optimisation to store the diff between docIds as a Vint #13228

expani · 2024-03-27T14:25:46Z

Description

One of the optimisations introduced by LUCENE-10233 was to compress continuous doc Ids (strictly sorted) by only storing the start docId here with a flag to indicate the same.

This works well when the difference between continuous docIds is 1

I was testing datasets where only few points ( Enums ) are present in multiple docs and are inserted in a cyclic fashion.

Consider the following insertion order :

=================================

Insert Doc Id 1 with 1d Point Value as 1
Insert Doc Id 2 with 1d Point Value as 2
Insert Doc Id 3 with 1d Point Value as 3

=================================

Insert Doc Id 4 with 1d Point Value as 1
Insert Doc Id 5 with 1d Point Value as 2
Insert Doc Id 6 with 1d Point Value as 3

=================================

Insert Doc Id 7 with 1d Point Value as 1
Insert Doc Id 8 with 1d Point Value as 2
Insert Doc Id 9 with 1d Point Value as 3

=================================

and so on.

In such scenario's, although the docIds for every point follow an arithmetic progression, the difference between them is 3.

I tested with changing the implementation to also store the diff along with starting docId and observed high compression for such cases. My test involved indexing 1 million docs with one numeric field containing only 2 unique points that are inserted in a cyclic fashion.

Without storing the diff, the KDD File took 276kb whereas with the diff it took around 34 kb.

My proposal is to store the diff along with the starting docId to ensure all arithmetic progressions of docIds can use this optimisation.

The text was updated successfully, but these errors were encountered:

expani added the type:enhancement label Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance ContinuousIds optimisation to store the diff between docIds as a Vint #13228

Enhance ContinuousIds optimisation to store the diff between docIds as a Vint #13228

expani commented Mar 27, 2024 •

edited

Enhance ContinuousIds optimisation to store the diff between docIds as a Vint #13228

Enhance ContinuousIds optimisation to store the diff between docIds as a Vint #13228

Comments

expani commented Mar 27, 2024 • edited

Description

expani commented Mar 27, 2024 •

edited