You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the optimisations introduced by LUCENE-10233 was to compress continuous doc Ids (strictly sorted) by only storing the start docId here with a flag to indicate the same.
This works well when the difference between continuous docIds is 1
I was testing datasets where only few points ( Enums ) are present in multiple docs and are inserted in a cyclic fashion.
Consider the following insertion order :
=================================
Insert Doc Id 1 with 1d Point Value as 1
Insert Doc Id 2 with 1d Point Value as 2
Insert Doc Id 3 with 1d Point Value as 3
=================================
Insert Doc Id 4 with 1d Point Value as 1
Insert Doc Id 5 with 1d Point Value as 2
Insert Doc Id 6 with 1d Point Value as 3
=================================
Insert Doc Id 7 with 1d Point Value as 1
Insert Doc Id 8 with 1d Point Value as 2
Insert Doc Id 9 with 1d Point Value as 3
=================================
and so on.
In such scenario's, although the docIds for every point follow an arithmetic progression, the difference between them is 3.
I tested with changing the implementation to also store the diff along with starting docId and observed high compression for such cases. My test involved indexing 1 million docs with one numeric field containing only 2 unique points that are inserted in a cyclic fashion.
Without storing the diff, the KDD File took 276kb whereas with the diff it took around 34 kb.
My proposal is to store the diff along with the starting docId to ensure all arithmetic progressions of docIds can use this optimisation.
The text was updated successfully, but these errors were encountered:
Description
One of the optimisations introduced by LUCENE-10233 was to compress continuous doc Ids (strictly sorted) by only storing the start docId here with a flag to indicate the same.
This works well when the difference between continuous docIds is
1
I was testing datasets where only few points ( Enums ) are present in multiple docs and are inserted in a cyclic fashion.
Consider the following insertion order :
=================================
=================================
=================================
=================================
and so on.
In such scenario's, although the docIds for every point follow an arithmetic progression, the difference between them is 3.
I tested with changing the implementation to also store the diff along with starting docId and observed high compression for such cases. My test involved indexing 1 million docs with one numeric field containing only 2 unique points that are inserted in a cyclic fashion.
Without storing the diff, the KDD File took 276kb whereas with the diff it took around 34 kb.
My proposal is to store the diff along with the starting docId to ensure all arithmetic progressions of docIds can use this optimisation.
The text was updated successfully, but these errors were encountered: