New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Benchmark custom #2345

Open

balmukundblr wants to merge 7 commits into apache:master from balmukundblr:benchmark_custom

balmukundblr commented Feb 10, 2021

Description

Lucene Benchmark Scaling Problem with Reuters Corpus

While Indexing 1 million documents with reuters21578 (plain text Document derived from reuters21578 corpus), we observed that with higher number of Index threads, the Index throughput does not scale and degrades. Existing implementation with synchronization block allows only one thread to pick up a document/file from list, at any given time – this code is part of getNextDocData() in ReutersContentSource.java. With multiple index threads, this becomes a thread contention bottleneck and does not allow the system CPU resource to be used efficiently.

Solution

We developed a strategy to distribute total number of files across multiple number of Indexing threads, so that these threads work independently and parallelly.

Tests

We mainly modified existing getNextDocData(), which is not altering functionality, hence not added any new test cases.

Passed existing tests

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
[ ] I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ./gradlew check.
[ ] I have added tests for my changes.
[ ] I have added documentation for the Ref Guide (for Solr changes only).

balmukundblr added 3 commits

February 10, 2021 06:13


          Added the code to read input files from multiple parallel threads for…

1a771bb

… better resource utilizations


          Adding comments

1b653aa


          Fixed the formating issue

3e5bffd

mikemccand reviewed

View reviewed changes

Member

mikemccand left a comment

In general, I love this idea, to improve the concurrency of ReutersContentSource!

This source is very inefficient since it opens/closes a new file per document. This inefficiency is why we added the LineFileDocs representation instead, which uses a single file and one readLine() per document, and (in Lucene's nightly benchmarks) the chunked binary form of LineFileDocs to even further reduce single-thread contention in reading/creating documents to index.

In general when testing Lucene's indexing performance it is vital to make the source of the documents as absolutely trivial as possible, to create as pure an indexing performance test as we can.

I left some small code style feedback, and also I did not understand how this concurrency optimization is functionally correct since you cannot ensure .getId() % N always distributes across all values 0 .. N?

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java Outdated

+                   * name = f.toRealPath() + "_" +iteration;
+                   * }
+                   */
+                  if (!threadIndexCreated) {

Member

mikemccand Feb 23, 2021

if (threadIndexCreated == false) { instead (to reduce chance of accidental future refactoring bugs)? This likely won't pass our code style checker (gradle precommit).

Author

balmukundblr Mar 4, 2021

Sure, will do the required changes.

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java Outdated

+                  int inputFilesSize = inputFiles.size();
+                  /*
+                   * synchronized (this) {

Member

mikemccand Feb 23, 2021

Just delete this old code? You are replacing it with a more concurrent version, yay!

Author

balmukundblr Mar 4, 2021

Sure, will delete the commented codes.

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java Outdated

+                  threadIndex[index]++;
+                  // Sanity check, if # threads is greater than # input files, wrap index
+                  if (index >= inputFilesSize) index %= inputFilesSize;

Member

mikemccand Feb 23, 2021

Can you move the index %= inputFilesSize to newline and inside { ... } body?

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java

+                  int index = (int) Thread.currentThread().getId() % threadIndex.length;
+                  int fIndex = index + threadIndex[index] * threadIndex.length;
+                  threadIndex[index]++;

Member

mikemccand Feb 25, 2021

I'm confused how this approach ensures that we will indeed index every document in the inputFiles?

Thread.currentThread().getId() % threadIndex.length is not guaranteed to reach every possible int from 0 .. threadIndex.length?

Author

balmukundblr Mar 4, 2021

Although, getId() is controlled by JVM but in our case, all threadIndex are getting initialized at once. Hence, there is high chance of getting guaranteed sequence of thread id, as we also observed. However, we understand your concern and tweaked our code in such a way that it guaranteed to reach every possible int from 0 .. threadIndex.length. We achieved it by setting a unique thread name and parsing the same for calculating the index value.

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java Outdated

                   nextFile = 0;
                   iteration = 0;
                 }
+                private synchronized void createThreadIndex() {
+                  if (!threadIndexCreated) {

Member

mikemccand Feb 25, 2021

== false instead? Or maybe change to assert threadIndexCreated == false since you also check this up above with a real if already?

Author

balmukundblr Mar 4, 2021

Sure, will do the required changes.

balmukundblr added 4 commits

March 5, 2021 08:38


          Incorporating formate issue and some changes for guaranteed file index

517bdea


          Incorporating formate issue and some changes for guaranteed file index

84566c1


          Resolved some gradle build issues and added the comments

0c43c86


          Resolved some gradle build issues as well added new comments

7d67961

balmukundblr mentioned this pull request

Parallel processing apache/lucene#132

Merged

6 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment