[Feature Request] Improve BKD Tree DocIds Encoding for 24 and 32 bit variations #13686
Labels
enhancement
Enhancement or improvement to existing feature or request
Performance
This is for any performance related enhancements or bugs
Search:Performance
Is your feature request related to a problem? Please describe
Lucene represents the docIds using 24 bits if the maximum docId within a leaf is <=
16,777,215
here else it represents the docId as a regular integer using 32 bits without any encoding.A lucene segment can have around 2 billion docs and a significant amount of docIds might be represented using the 24 and 32 bits representation.
Most of the time spent in range queries are while collecting docIds readDelta16,
readInts24 and readInts32
If we can reduce the amount of bytes used to represent docIds, we can improve the reading time ( search ) especially for range queries over larger ranges and reduce the index size.
Some existing tickets that have already discussed this area are :
Describe the solution you'd like
I was trying to find the common prefix amongst the docIds when they are represented using 24 and 32 bit variation based on the NYC Taxi data for the field
fare_amount
with around 20 million docs.I found that a lot of leaf blocks that were represented using 24 bits had common prefixes ( from MSB/Leftmost bit ) ranging from 8bits to 15 bits. Having 8 bits common doesn't really help in encoding as the docIds will be represented using 24 bit. But, having 15 common bits can help us save 7 bits per docId in a leaf of 512 docIds reducing the number of bytes used to represent a leaf block by 445 bytes. ((512* 7 ) / 8 ) = 448- 3 bytes ( used to represent the common prefix as VInt)
Similarly when performing a force merge on the segment, Lucene also used 32 bits representation for around 1000 leaf blocks and I saw common prefixes ranging from 7-15 bits amongst them. The best case of 15 bit common prefix in this case would save 956 bytes per leaf block.
I am planning to introduce a new encoding for 24 and 32 bit representation which will also store the common prefix and benchmark the changes in indexing and search performance.
Related component
Search:Performance
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: