Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark custom #2345

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

balmukundblr
Copy link

Description

Lucene Benchmark Scaling Problem with Reuters Corpus

While Indexing 1 million documents with reuters21578 (plain text Document derived from reuters21578 corpus), we observed that with higher number of Index threads, the Index throughput does not scale and degrades. Existing implementation with synchronization block allows only one thread to pick up a document/file from list, at any given time – this code is part of getNextDocData() in ReutersContentSource.java. With multiple index threads, this becomes a thread contention bottleneck and does not allow the system CPU resource to be used efficiently.

Solution

We developed a strategy to distribute total number of files across multiple number of Indexing threads, so that these threads work independently and parallelly.

Tests

We mainly modified existing getNextDocData(), which is not altering functionality, hence not added any new test cases.

  • Passed existing tests

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
    [ ] I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
    [ ] I have added tests for my changes.
    [ ] I have added documentation for the Ref Guide (for Solr changes only).

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I love this idea, to improve the concurrency of ReutersContentSource!

This source is very inefficient since it opens/closes a new file per document. This inefficiency is why we added the LineFileDocs representation instead, which uses a single file and one readLine() per document, and (in Lucene's nightly benchmarks) the chunked binary form of LineFileDocs to even further reduce single-thread contention in reading/creating documents to index.

In general when testing Lucene's indexing performance it is vital to make the source of the documents as absolutely trivial as possible, to create as pure an indexing performance test as we can.

I left some small code style feedback, and also I did not understand how this concurrency optimization is functionally correct since you cannot ensure .getId() % N always distributes across all values 0 .. N?

* name = f.toRealPath() + "_" +iteration;
* }
*/
if (!threadIndexCreated) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (threadIndexCreated == false) { instead (to reduce chance of accidental future refactoring bugs)? This likely won't pass our code style checker (gradle precommit).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do the required changes.

int inputFilesSize = inputFiles.size();

/*
* synchronized (this) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete this old code? You are replacing it with a more concurrent version, yay!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will delete the commented codes.

threadIndex[index]++;

// Sanity check, if # threads is greater than # input files, wrap index
if (index >= inputFilesSize) index %= inputFilesSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the index %= inputFilesSize to newline and inside { ... } body?


int index = (int) Thread.currentThread().getId() % threadIndex.length;
int fIndex = index + threadIndex[index] * threadIndex.length;
threadIndex[index]++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused how this approach ensures that we will indeed index every document in the inputFiles?

Thread.currentThread().getId() % threadIndex.length is not guaranteed to reach every possible int from 0 .. threadIndex.length?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, getId() is controlled by JVM but in our case, all threadIndex are getting initialized at once. Hence, there is high chance of getting guaranteed sequence of thread id, as we also observed. However, we understand your concern and tweaked our code in such a way that it guaranteed to reach every possible int from 0 .. threadIndex.length. We achieved it by setting a unique thread name and parsing the same for calculating the index value.

@@ -146,4 +172,11 @@ public synchronized void resetInputs() throws IOException {
nextFile = 0;
iteration = 0;
}

private synchronized void createThreadIndex() {
if (!threadIndexCreated) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== false instead? Or maybe change to assert threadIndexCreated == false since you also check this up above with a real if already?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do the required changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants