Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add instrumentation to json index getMatchingFlattenedDocsMap() #13164

Merged

Conversation

itschrispeck
Copy link
Contributor

@itschrispeck itschrispeck commented May 15, 2024

This PR contains two changes:

  1. Add instrumentation for OOM protection to json index's getMatchingFlattenedDocsMap()
  2. Use lazy initialization when generating the map in json_extract_index transform function

For the first change, we've seen the map can be very large and sampling/interruption check previously could not occur during map generation.

For the second, lazy initialization skips map generation entirely when filters exclude all results which can save significant resources depending on the query.

UTs cover the logic, but also deployed in our prod instances and we do not see heap saturation from json_extract_index queries anymore.

tags: enhancement performance (?)

@codecov-commenter
Copy link

codecov-commenter commented May 15, 2024

Codecov Report

Attention: Patch coverage is 91.66667% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 62.16%. Comparing base (59551e4) to head (6b7dff7).
Report is 450 commits behind head on master.

Files Patch % Lines
...rm/function/JsonExtractIndexTransformFunction.java 90.90% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13164      +/-   ##
============================================
+ Coverage     61.75%   62.16%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2515      +79     
  Lines        133233   137876    +4643     
  Branches      20636    21335     +699     
============================================
+ Hits          82274    85713    +3439     
- Misses        44911    45776     +865     
- Partials       6048     6387     +339     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.13% <91.66%> (+0.42%) ⬆️
java-21 62.04% <91.66%> (+0.41%) ⬆️
skip-bytebuffers-false 62.15% <91.66%> (+0.40%) ⬆️
skip-bytebuffers-true 62.02% <91.66%> (+34.29%) ⬆️
temurin 62.16% <91.66%> (+0.41%) ⬆️
unittests 62.16% <91.66%> (+0.41%) ⬆️
unittests1 46.72% <83.33%> (-0.17%) ⬇️
unittests2 27.79% <8.33%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement @itschrispeck, and apologies for the newbie questions!

Comment on lines +411 to +414
/**
* Lazily initialize _valueToMatchingDocsMap, so that map generation is skipped when filtering excludes all values
*/
private Map<String, RoaringBitmap> getValueToMatchingDocsMap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, this isn't referring to the filtering done by the JsonExtractIndexTransformFunction itself right (because that's only done after the map is generated in the JsonIndexReader)? So is this referring to query level filters that might result in none of this transform function's transformToXyz methods being called (and I presume the transform function's init method is still called beforehand)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly!

try (JsonIndexCreator offHeapIndexCreator = new OffHeapJsonIndexCreator(indexDir, colName, new JsonIndexConfig());
MutableJsonIndexImpl mutableJsonIndex = new MutableJsonIndexImpl(new JsonIndexConfig())) {
// build json indexes
for (int i = 0; i < 1000000; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need such a large value here even though we're configuring the query killer's heap usage ratio to kill queries on to be 0.00f?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that the more frequent GCs in github runners make measuring thread memory difficult. When logs are enabled, I can see But all queries are below quota, no query killed a couple times before the query is picked. So my assumption is that GCs are causing the thread memory measurements to be lower than the initial/previous thread memory measurements.

Locally, it works reliably at 100k but failed when I initially pushed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, that seems like a plausible explanation, thanks!

Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

try (JsonIndexCreator offHeapIndexCreator = new OffHeapJsonIndexCreator(indexDir, colName, new JsonIndexConfig());
MutableJsonIndexImpl mutableJsonIndex = new MutableJsonIndexImpl(new JsonIndexConfig())) {
// build json indexes
for (int i = 0; i < 1000000; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, that seems like a plausible explanation, thanks!

@Jackie-Jiang Jackie-Jiang merged commit be6dd7e into apache:master May 23, 2024
19 of 20 checks passed
* Lazily initialize _valueToMatchingDocsMap, so that map generation is skipped when filtering excludes all values
*/
private Map<String, RoaringBitmap> getValueToMatchingDocsMap() {
if (_valueToMatchingDocsMap == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when this condition is true, does it mean the "filtering excludes all values"? Please clarify to match the method javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lazy initialization pattern, so the first time the method is called it will build the map. Subsequent calls will use the already built map. The pattern lets us avoid pre-building the map in transform function's init method since we might never need it, instead it is built when transforming the first ValueBlock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants