Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader #179

Merged
merged 14 commits into from Sep 2, 2021

Conversation

gautamworah96
Copy link
Contributor

@gautamworah96 gautamworah96 commented Jun 10, 2021

Description

In LUCENE-9450 we switched the Taxonomy index from Stored Fields to BinaryDocValues. In the resulting implementation of the getPath code, we create a new BinaryDocValues's values instance for each ordinal.
It may happen that we may traverse over the same nodes over and over again if the getPath API is called multiple times for ordinals in the same segment/with the same readerIndex.

This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same readerIndex (by trying to advanceExact to the correct position and not failing) thereby allowing us to reuse the previous BinaryDocValues object.

This PR was originally proposed in the older lucene-solr repo here. I discontinued it because we could not find a good "real" use case for it.

This new PR uses the getBulkPath API inside IntTaxonomyFacetCounts and FloatTaxonomyFacetCounts classes. In the getTopChildren method within these classes, we usually collect all the top-n ordinals first and then translate them into FacetLabels one by one using an iterative API. This call has been replaced with the getBulkPath method in this PR.

Solution

Steps:

  1. Sort all ordinals and remember their position so as to store the path in the correct position
  2. Try to advanceExact to the correct position with the previously calculated readerIndex. If the operation fails, try to find the correct segment for the ordinal and then advanceExact to the desired position.
  3. Store this position for future ordinals.

Tests

  1. Added a new randomized test for the API that compares the individual getPath results from ordinals with the bulk FacetLabels returned by the getBulkPath API
  2. Added a backwards compatibility test for the API that reads an older StoredFields based index

Benchmarks:

The stock luceneutil taxonomy tasks call getTopChildren(10, ..) for testing the performance of taxonomy facet counting classes. Since FastTaxonomyFacetCounts inherits from IntTaxonomyFacetCounts (and uses the same getTopChildren logic) the user should be able to reap the benefits of the bulk API directly.

luceneutil that tests the top 10 children:

                   TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy1       44.07      (6.0%)       42.44      (7.2%)   -3.7% ( -16% -   10%) 0.078
       HighTermMonthSort       22.69     (12.1%)       21.99     (11.3%)   -3.1% ( -23% -   23%) 0.402
    HighTermTitleBDVSort       19.59     (12.4%)       18.99     (12.4%)   -3.1% ( -24% -   24%) 0.434
   HighTermDayOfYearSort       54.33     (15.7%)       53.11     (16.3%)   -2.3% ( -29% -   35%) 0.656
              TermDTSort       15.58      (9.1%)       15.24     (10.9%)   -2.2% ( -20% -   19%) 0.488
                HighTerm     1010.56      (4.0%)      988.75      (5.2%)   -2.2% ( -10% -    7%) 0.142
           OrHighNotHigh      422.64      (4.4%)      415.30      (5.6%)   -1.7% ( -11% -    8%) 0.274
                 MedTerm      734.85      (5.5%)      722.45      (4.8%)   -1.7% ( -11% -    9%) 0.301
                  Fuzzy2       14.51      (3.6%)       14.39      (2.9%)   -0.9% (  -7% -    5%) 0.399
             AndHighHigh       18.22      (3.4%)       18.07      (2.6%)   -0.8% (  -6% -    5%) 0.388
              AndHighLow      181.77      (1.7%)      180.51      (2.3%)   -0.7% (  -4% -    3%) 0.277
                 Respell       30.50      (2.0%)       30.36      (1.7%)   -0.5% (  -4% -    3%) 0.438
            OrNotHighLow      347.94      (4.7%)      346.46      (3.6%)   -0.4% (  -8% -    8%) 0.747
             LowSpanNear        3.83      (1.9%)        3.82      (2.2%)   -0.3% (  -4% -    3%) 0.626
              HighPhrase       39.95      (7.0%)       39.85      (7.3%)   -0.3% ( -13% -   15%) 0.909
            OrHighNotMed      405.04      (4.9%)      404.00      (5.4%)   -0.3% ( -10% -   10%) 0.874
               OrHighLow      110.15      (3.0%)      109.87      (3.2%)   -0.2% (  -6% -    6%) 0.800
               OrHighMed       61.11      (3.1%)       60.96      (2.9%)   -0.2% (  -6% -    5%) 0.795
         MedSloppyPhrase        5.47      (6.0%)        5.46      (3.1%)   -0.2% (  -8% -    9%) 0.890
             MedSpanNear       12.25      (1.6%)       12.24      (2.0%)   -0.1% (  -3% -    3%) 0.887
                  IntNRQ       25.39      (1.3%)       25.37      (2.0%)   -0.1% (  -3% -    3%) 0.880
              AndHighMed       64.96      (3.1%)       64.95      (2.7%)   -0.0% (  -5% -    6%) 0.977
            HighSpanNear       17.37      (2.0%)       17.38      (3.7%)    0.0% (  -5% -    5%) 0.963
         LowSloppyPhrase       55.39      (4.0%)       55.42      (2.6%)    0.1% (  -6% -    6%) 0.958
              OrHighHigh        9.85      (2.1%)        9.86      (2.9%)    0.1% (  -4% -    5%) 0.918
                Wildcard       29.06      (4.5%)       29.08      (3.4%)    0.1% (  -7% -    8%) 0.942
    HighIntervalsOrdered        7.97      (2.5%)        7.98      (3.1%)    0.1% (  -5% -    5%) 0.907
                 LowTerm      883.85      (4.4%)      884.97      (6.1%)    0.1% (  -9% -   11%) 0.940
                 Prefix3       40.49      (7.1%)       40.55      (5.2%)    0.2% ( -11% -   13%) 0.939
            OrHighNotLow      440.52      (5.1%)      441.49      (4.8%)    0.2% (  -9% -   10%) 0.888
               MedPhrase      138.15      (4.3%)      138.51      (4.7%)    0.3% (  -8% -    9%) 0.855
               LowPhrase      111.07      (2.9%)      111.40      (3.0%)    0.3% (  -5% -    6%) 0.751
            OrNotHighMed      375.26      (6.7%)      376.64      (6.9%)    0.4% ( -12% -   15%) 0.865
   BrowseMonthTaxoFacets        0.60      (5.2%)        0.60      (5.6%)    0.4% (  -9% -   11%) 0.806
        HighSloppyPhrase        2.11      (5.7%)        2.12      (3.5%)    0.5% (  -8% -   10%) 0.730
    BrowseDateTaxoFacets        0.57      (5.1%)        0.58      (5.5%)    0.5% (  -9% -   11%) 0.746
BrowseDayOfYearTaxoFacets        0.57      (5.2%)        0.58      (5.5%)    0.6% (  -9% -   11%) 0.742
           OrNotHighHigh      391.99      (5.2%)      394.52      (5.3%)    0.6% (  -9% -   11%) 0.701
                PKLookup      113.96      (3.1%)      115.11      (3.4%)    1.0% (  -5% -    7%) 0.320
   BrowseMonthSSDVFacets        2.62      (1.0%)        2.69      (1.8%)    3.0% (   0% -    5%) 0.000
BrowseDayOfYearSSDVFacets        2.46      (0.3%)        2.54      (2.7%)    3.3% (   0% -    6%) 0.000

Since, the previous results did not show any gain in the Browse*TaxoFacets tasks, I realized that maybe it was because the ordinalCache with a stock size of 4000 had all ordinals cached which led to the bulk API behaving the same as the iterative getPath call.

The next benchmark tested the getTopChildren call with 10,000 children

luceneutil that tests the top 10000 children:

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  Fuzzy2       35.18      (4.4%)       34.81      (7.4%)   -1.0% ( -12% -   11%) 0.588
       HighTermMonthSort       47.71     (12.1%)       47.28     (11.8%)   -0.9% ( -22% -   26%) 0.812
              AndHighMed       33.50      (3.4%)       33.35      (2.6%)   -0.4% (  -6% -    5%) 0.648
               OrHighMed       31.86      (2.2%)       31.75      (1.9%)   -0.3% (  -4% -    3%) 0.595
    HighIntervalsOrdered        2.59      (1.8%)        2.58      (1.4%)   -0.3% (  -3% -    2%) 0.540
               OrHighLow      175.82      (2.7%)      175.43      (3.4%)   -0.2% (  -6% -    5%) 0.816
             AndHighHigh       29.36      (3.9%)       29.33      (2.7%)   -0.1% (  -6% -    6%) 0.915
         MedSloppyPhrase       18.63      (2.0%)       18.61      (2.6%)   -0.1% (  -4% -    4%) 0.909
              TermDTSort       37.97     (10.0%)       37.97     (10.2%)    0.0% ( -18% -   22%) 0.997
                  Fuzzy1       24.97      (4.4%)       24.97      (4.1%)    0.0% (  -8% -    8%) 0.983
             MedSpanNear       51.30      (2.0%)       51.35      (1.6%)    0.1% (  -3% -    3%) 0.860
   HighTermDayOfYearSort       45.86     (16.5%)       45.92     (16.4%)    0.1% ( -28% -   39%) 0.980
              OrHighHigh        8.55      (2.1%)        8.56      (1.9%)    0.1% (  -3% -    4%) 0.826
            HighSpanNear        9.30      (1.7%)        9.32      (1.7%)    0.2% (  -3% -    3%) 0.766
    BrowseDateTaxoFacets        0.58      (4.7%)        0.58      (4.8%)    0.2% (  -8% -   10%) 0.915
   BrowseMonthTaxoFacets        0.60      (4.9%)        0.60      (4.8%)    0.2% (  -9% -   10%) 0.894
BrowseDayOfYearTaxoFacets        0.57      (4.7%)        0.58      (4.8%)    0.3% (  -8% -   10%) 0.861
                PKLookup      116.98      (3.6%)      117.30      (3.1%)    0.3% (  -6% -    7%) 0.794
         LowSloppyPhrase        5.58      (4.0%)        5.60      (3.9%)    0.3% (  -7% -    8%) 0.785
             LowSpanNear       11.11      (1.6%)       11.15      (1.7%)    0.4% (  -2% -    3%) 0.451
    HighTermTitleBDVSort       25.79      (7.2%)       25.89      (8.7%)    0.4% ( -14% -   17%) 0.875
                  IntNRQ       25.27      (1.5%)       25.37      (1.3%)    0.4% (  -2% -    3%) 0.347
               LowPhrase       40.67      (2.2%)       40.84      (1.4%)    0.4% (  -3% -    4%) 0.478
        HighSloppyPhrase        5.99      (4.1%)        6.01      (4.1%)    0.4% (  -7% -    8%) 0.745
                Wildcard       20.27      (3.5%)       20.41      (3.4%)    0.7% (  -5% -    7%) 0.500
                 Respell       34.15      (2.1%)       34.41      (1.4%)    0.8% (  -2% -    4%) 0.175
              AndHighLow      358.53      (3.8%)      361.84      (3.8%)    0.9% (  -6% -    8%) 0.438
            OrNotHighLow      369.77      (4.3%)      374.69      (2.7%)    1.3% (  -5% -    8%) 0.239
                 MedTerm      980.66      (4.2%)      994.40      (5.2%)    1.4% (  -7% -   11%) 0.347
           OrHighNotHigh      388.09      (4.6%)      394.10      (4.0%)    1.5% (  -6% -   10%) 0.258
            OrNotHighMed      419.20      (3.9%)      426.30      (3.9%)    1.7% (  -5% -    9%) 0.173
               MedPhrase      281.59      (4.2%)      287.03      (3.3%)    1.9% (  -5% -    9%) 0.107
                 LowTerm      758.64      (5.7%)      775.00      (4.9%)    2.2% (  -8% -   13%) 0.200
              HighPhrase       69.25      (6.5%)       70.85      (6.2%)    2.3% (  -9% -   16%) 0.250
           OrNotHighHigh      517.94      (5.7%)      530.26      (3.5%)    2.4% (  -6% -   12%) 0.111
            OrHighNotMed      438.39      (5.5%)      449.26      (3.8%)    2.5% (  -6% -   12%) 0.096
                HighTerm      650.61      (4.9%)      667.25      (4.4%)    2.6% (  -6% -   12%) 0.081
            OrHighNotLow      308.34      (4.3%)      317.07      (5.0%)    2.8% (  -6% -   12%) 0.053
                 Prefix3       88.13      (8.6%)       91.32      (8.5%)    3.6% ( -12% -   22%) 0.181
   BrowseMonthSSDVFacets        2.61      (1.0%)        2.71      (1.4%)    3.9% (   1% -    6%) 0.000
BrowseDayOfYearSSDVFacets        2.46      (0.4%)        2.56      (2.1%)    4.1% (   1% -    6%) 0.000

I surprisingly see some gain in BrowseDayOfYearSSDVFacets and BrowseMonthSSDVFacets tasks but I think that is noise (because taxonomy facets have a different impl. as compared to SortedSetDocValuesFacets).
Edit: Did an A/A test and BrowseDayOfYearSSDVFacets does not show any gain. So the BrowseDayOfYearSSDVFacets 4.1% perf improvement is unexplained..
Browse*TaxoFacets tasks don't show gain.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

@gautamworah96 gautamworah96 changed the title LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyWriter LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader Jun 10, 2021
@mikemccand
Copy link
Member

That is really odd that the SSDVFacets show ~4% gain! I wonder if there is a bug in luceneutil mis-labeling which facets task is using which impl!? You could run a benchmark, filtering tasks to only do these SSDVFacets tasks, and then check whether the taxo facet impls are being used at all maybe.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is close! I left a bunch of small comments. The performance is odd but I don't think that should block this improvement.

/*
If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
*/
if (values == null || ordinals[i] >= leafReaderMaxDoc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't we need to compare ordinals[i] >= leafReaderMaxDoc + leafReaderDocBase instead? I.e. the ordinal is in the global IndexReader docid space, but the leafReaderMaxDoc is just for this leaf. How come tests are not failing due to this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a big miss. Thanks for catching this.

Here is why the test was passing:
When the ordinals were all in the first leaf, the ordinal was < leafReaderMaxDoc and the wrong code worked correctly.

When the ordinals started spilling into the second leaf, the wrong code would recalculate the leaf each time and then find the correct Label. Thus were not actually making use of the increasing values of the ordinals (and were recalculating the leafContext each time).
This would go on for all other leaves as well.

I'll rerun benchmarks

@gautamworah96
Copy link
Contributor Author

Hey @mikemccand , my apologies for leaving this PR hanging! The unexpected benchmark results threw me off..

I've updated the PR to fix the bug you had previously discovered and updated the PR to bring it on par with mainline.

I re-ran benchmarks after fixing the leafReaderDocBase not added to the leafReaderMaxDoc bug:

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               OrHighLow      394.22      (4.1%)      390.60      (4.7%)   -0.9% (  -9% -    8%) 0.508
           OrHighNotHigh      393.24      (5.0%)      393.11      (4.4%)   -0.0% (  -9% -    9%) 0.982
    HighIntervalsOrdered        2.23      (1.3%)        2.23      (1.2%)    0.1% (  -2% -    2%) 0.768
              HighPhrase      169.85      (4.9%)      170.09      (5.7%)    0.1% (  -9% -   11%) 0.932
    BrowseDateTaxoFacets        0.57      (1.6%)        0.57      (1.6%)    0.2% (  -2% -    3%) 0.684
            HighSpanNear        3.57      (2.2%)        3.58      (2.1%)    0.2% (  -3% -    4%) 0.711
             LowSpanNear       11.32      (1.8%)       11.35      (1.4%)    0.3% (  -2% -    3%) 0.604
            OrNotHighMed      398.53      (6.1%)      399.62      (7.5%)    0.3% ( -12% -   14%) 0.900
             MedSpanNear       11.37      (1.8%)       11.41      (1.2%)    0.3% (  -2% -    3%) 0.522
                 LowTerm      949.41      (5.7%)      952.69      (5.2%)    0.3% (  -9% -   11%) 0.841
              OrHighHigh       27.45      (2.5%)       27.55      (2.1%)    0.4% (  -4% -    5%) 0.615
                  IntNRQ       19.98     (30.1%)       20.06     (29.6%)    0.4% ( -45% -   85%) 0.967
                 Respell       24.48      (2.4%)       24.58      (2.5%)    0.4% (  -4% -    5%) 0.575
         LowSloppyPhrase       19.15      (2.6%)       19.23      (2.3%)    0.4% (  -4% -    5%) 0.569
               OrHighMed       23.02      (2.9%)       23.14      (1.8%)    0.5% (  -4% -    5%) 0.482
BrowseDayOfYearTaxoFacets     6311.77      (3.7%)     6349.39      (4.2%)    0.6% (  -7% -    8%) 0.634
                  Fuzzy1       42.80      (3.9%)       43.06      (5.0%)    0.6% (  -7% -    9%) 0.666
            OrHighNotMed      446.39      (5.5%)      449.32      (5.6%)    0.7% (  -9% -   12%) 0.707
        HighSloppyPhrase        3.57      (4.4%)        3.59      (3.8%)    0.7% (  -7% -    9%) 0.610
            OrNotHighLow      418.72      (3.6%)      421.68      (3.6%)    0.7% (  -6% -    8%) 0.535
               LowPhrase       12.47      (2.8%)       12.56      (2.8%)    0.8% (  -4% -    6%) 0.391
            OrHighNotLow      494.57      (4.9%)      498.46      (5.4%)    0.8% (  -9% -   11%) 0.629
   HighTermDayOfYearSort        7.07     (12.8%)        7.13     (10.2%)    0.9% ( -19% -   27%) 0.799
         MedSloppyPhrase       21.85      (2.8%)       22.06      (2.5%)    0.9% (  -4% -    6%) 0.266
                  Fuzzy2       35.04      (3.6%)       35.43      (3.7%)    1.1% (  -5% -    8%) 0.330
    HighTermTitleBDVSort       39.93      (8.7%)       40.40      (6.9%)    1.2% ( -13% -   18%) 0.638
                Wildcard       38.16      (4.4%)       38.60      (3.9%)    1.2% (  -6% -    9%) 0.370
              AndHighMed       47.87      (2.7%)       48.45      (3.6%)    1.2% (  -4% -    7%) 0.224
           OrNotHighHigh      367.90      (4.8%)      372.56      (3.9%)    1.3% (  -7% -   10%) 0.362
             AndHighHigh       20.63      (3.2%)       20.94      (4.6%)    1.5% (  -6% -    9%) 0.221
       HighTermMonthSort       25.15     (12.1%)       25.54     (13.8%)    1.6% ( -21% -   31%) 0.703
               MedPhrase       54.64      (6.1%)       55.54      (6.9%)    1.6% ( -10% -   15%) 0.424
              TermDTSort       16.01     (14.6%)       16.28     (16.7%)    1.7% ( -25% -   38%) 0.737
   BrowseMonthSSDVFacets        2.62      (0.9%)        2.66      (2.1%)    1.7% (  -1% -    4%) 0.001
              AndHighLow      252.16      (3.3%)      256.71      (3.3%)    1.8% (  -4% -    8%) 0.086
                HighTerm      998.70      (6.8%)     1020.66      (5.8%)    2.2% (  -9% -   15%) 0.271
   BrowseMonthTaxoFacets     6354.02      (5.9%)     6495.78      (5.4%)    2.2% (  -8% -   14%) 0.211
                PKLookup      113.56      (4.2%)      116.24      (4.4%)    2.4% (  -6% -   11%) 0.084
BrowseDayOfYearSSDVFacets        2.46      (0.5%)        2.52      (2.9%)    2.5% (   0% -    5%) 0.000
                 MedTerm      815.18      (4.9%)      837.70      (6.0%)    2.8% (  -7% -   14%) 0.109
                 Prefix3      144.90     (10.0%)      149.12     (11.5%)    2.9% ( -16% -   27%) 0.392

The gain in taxonomy based facets is about 2%. I ran it multiple times locally and it gave around 2% gain in those runs as well. SSDV facets still show a 3% gain which is unexplained.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iteration @gautamworah96 -- we are getting closer!

Gautam Worah added 2 commits August 26, 2021 15:24
contention).

Modified the test case to be super random (threads,
iterations etc).

Modified TaxonomyReader with a getBulkPath method that
is overridden in DirTaxonomyReader.

Improved documentation, improved
method signatures to use the newer (...) style syntax, fixed the CHANGES
entry
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super close! I left some small comments.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- I left tiny polish comments -- almost done! Thanks @gautamworah96!

/*
If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
*/
if (values == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sort of weird to do this check per-segment, since we can know up front, based on indexed created version, whether it's stored fields or doc values? I.e. this is not a segment by segment decision. But let's do it in follow-on issue. I think this one is ready!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged :D

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96 -- looks great! I'll try to merge soon! Then we should watch nightly Lucene benchmarks and see if facets got a bit faster.

@mikemccand mikemccand merged commit 1036c70 into apache:main Sep 2, 2021
@mikemccand
Copy link
Member

Merged! @gautamworah96 I think we should backport this one to 8.x? It is just adding a faster API.

mikemccand pushed a commit to mikemccand/lucene that referenced this pull request Sep 3, 2021
Resolved Conflicts:
	gradle/validation/validate-source-patterns.gradle
	solr/core/src/java/org/apache/solr/core/CoreContainer.java
@gautamworah96
Copy link
Contributor Author

I think we should backport this one to 8.x

This change relies on the older repo having the "LUCENE-9450 Switch taxonomy index to use BinaryDocValues" change.

We had initially decided that we will not backport it to 8x because it was a major change to the index type/format. Thus this change also can't be backported :/

@gautamworah96 gautamworah96 deleted the LUCENE-9476 branch September 7, 2021 17:57
@mikemccand
Copy link
Member

I think we should backport this one to 8.x

This change relies on the older repo having the "LUCENE-9450 Switch taxonomy index to use BinaryDocValues" change.

We had initially decided that we will not backport it to 8x because it was a major change to the index type/format. Thus this change also can't be backported :/

Ahh, yes, sorry! We cannot backport this change then. So this is 9.0 only optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants