Parallel processing #132

balmukundblr · 2021-05-11T07:29:19Z

Description

Please note- This is not a new PR- Original PR (apache/lucene-solr#2345) was raised on old apache/lucene-solr github repository. This is just a copy in new repo.

Lucene Benchmark Scaling Problem with Reuters Corpus

While Indexing 1 million documents with reuters21578 (plain text Document derived from reuters21578 corpus), we observed that with higher number of Index threads, the Index throughput does not scale and degrades. Existing implementation with synchronization block allows only one thread to pick up a document/file from list, at any given time – this code is part of getNextDocData() in ReutersContentSource.java. With multiple index threads, this becomes a thread contention bottleneck and does not allow the system CPU resource to be used efficiently.

Solution

We developed a strategy to distribute total number of files across multiple number of Indexing threads, so that these threads work independently and parallelly.

Tests

We mainly modified existing getNextDocData(), which is not altering functionality, hence not added any new test cases.

Passed existing tests

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

…pletes the processing

balmukundblr · 2021-05-11T13:04:43Z

@mikemccand
We have raised a new PR as you suggested in new lucene github repo.

Thanks & Regards,
Balmukund

mikemccand

This looks close! I'm a little worried about the corner case when number of threads exceeds number of documents.

Can you share what speedup you saw on what kind of concurrent computer with this versus mainline?

mikemccand · 2021-05-11T14:49:04Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java

+    }
+
+    // Getting file index value which is set for each thread
+    int index = Integer.parseInt(Thread.currentThread().getName().substring(12));


Is TaskSequence / ParallelTask the only place where new Threads are created in benchmarks?

Could you add a comment here pointing to ParallelTask.java explaining that we named/numbered the threads carefully, and that's why this parsing to int is safe?

-Yes, TaskSequence.java is only where new Index threads are created.

-We want to ensure that the name of Index threads maintains a guaranteed sequence and we explicitly setup thread names in TaskSequence.java. The thread name maintains "IndexThread-" pattern where is an integer. So, it is safe to parse the thread name to int.
We'll also add necessary comments in ReutersContentSource.java as well.

mikemccand · 2021-05-11T14:49:49Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java

+
+    // Getting file index value which is set for each thread
+    int index = Integer.parseInt(Thread.currentThread().getName().substring(12));
+    int fIndex = index + threadIndex[index] * threadIndex.length;


Maybe add assert index >= 0 && index < threadIndex.length above this? This way if there is some thread naming bug, and assertions are enabled, we hit AssertionError before AIOOBE.

Sure. We'll incorporate this check.

mikemccand · 2021-05-11T14:55:37Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java

    }

+    // Check if this thread has exhausted its files
+    if (fIndex >= inputFilesSize) {
+      threadIndex[index] = 0;


Hmm, in the case where number-of-threads is bigger than number-of-input-files, aren't we (always) setting the wrong index back to 0 here? Does that matter? Maybe add a dedicated test case so this new code is exercised?

…/lucene into parallel_processing

mikemccand · 2021-06-04T13:50:39Z

Thanks for the updates!

It looks like gradle check is upset -- if you run ./gradlew tidy it will re-format your changes and it should pass again!

mikemccand

I think this is really close! I left a few small comments. Thanks @balmukundblr!

mikemccand · 2021-06-04T14:02:09Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java

+    int inFileSize = inputFiles.size();
+
+    //Modulo Operator covers all three possible senarios i.e. 1. If inputFiles.size() < Num Of Threads 2.inputFiles.size() == Num Of Threads 3.inputFiles.size() > Num Of Threads
+    int fileIndex = stride % inFileSize;


Hmm do we already guard for the (degenerate) case of inFileSize == 0? If not can we add some protection here, e.g. maybe throw a clear exception that there is nothing to index?

Mike, its already handling in ReutersContentSource.java's setConfig(). Please find the code snippet for the same.
if (inputFiles.size() == 0) {
throw new RuntimeException("No txt files in dataDir: "+dataDir.toAbsolutePath());
}

Sorry Mike, i forgot to mention that i've tested with inFileSize == 0 and it throws expected exception.

mikemccand · 2021-06-04T14:03:05Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/TaskSequence.java

-        t[index++] = new ParallelTask(task);
+        t[index] = new ParallelTask(task);
+        //Setting unique ThreadName with index value which is used in ReuersContentSource.java's getNextDocData()
+        t[index].setName("IndexThread-" + index);


In general, parallel tasks might be running queries too right? Maybe we should pick a more generic name? Maybe ParallelTaskThread-N?

Thank you Mike, did the required changes.

mikemccand · 2021-06-04T14:05:41Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/TaskSequence.java

    // prepare threads
    int index = 0;
    for (int k = 0; k < repetitions; k++) {
      for (int i = 0; i < tasksArray.length; i++) {
        final PerfTask task = tasksArray[i].clone();
-        t[index++] = new ParallelTask(task);
+        t[index] = new ParallelTask(task);
+        //Setting unique ThreadName with index value which is used in ReuersContentSource.java's getNextDocData()


Can you strengthen the comment to state that we should NOT change this thread name, unless we also fix the String -> int parsing logic in ReutersContentSource?

Actually, could we factor out this string part of the thread name into a static final String constant, e.g.static final String PARALLEL_TASK_THREAD_NAME_PREFIX = "ParallelTaskThread";, and reference that constant from both places?

Incorporated the required changes through adding it in Constants.java file and referred from both places.

mikemccand · 2021-06-04T14:06:16Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java

+
+    //Modulo Operator covers all three possible senarios i.e. 1. If inputFiles.size() < Num Of Threads 2.inputFiles.size() == Num Of Threads 3.inputFiles.size() > Num Of Threads
+    int fileIndex = stride % inFileSize;
+    int iteration = stride / inFileSize;


Thank you for improving this logic -- much easier to understand now!

dsmiley · 2021-06-06T03:22:58Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FlushIndexTask.java

+  public int doLogic() throws Exception {
+    IndexWriter iw = getRunData().getIndexWriter();
+    if (iw != null) {
+      iw.flushNextBuffer();


This flushes one thread; not all. I'm honestly not sure what the use-case is of that method. Did you mean to call iw.flush()?

Sorry for delay response. We observed that post- processing was taking longer time because it was singly threaded. Also, it was depending upon the per-thread indexed data. Hence, we are explicitly flushing the per thread data as soon as it finishes the indexing process.

Also, this task is optional and can be used purely on need basis.

…/lucene into parallel_processing

balmukundblr · 2021-06-11T06:35:39Z

@mikemccand
Mike, We've committed the code 4 days before and it still in Checking phase. Ideally 'musedev' should not take longer time. Looks like, there is some problem is the checking process. It would be really helpful if you could advise us for the next steps.

balmukundblr · 2021-06-24T09:43:30Z

@mikemccand
Mike, I was wondering, are there any suggestions to incorporate in this PR. Also, need your help to resolve this ** pending checks** issues, it has been more than 15 days and "musedev" still showing in pending state.

mikemccand · 2021-06-24T12:54:56Z

it has been more than 15 days and "musedev" still showing in pending state.

Egads! I don't know why it's stuck in Pending. We can skip it -- I'll confirm gradlew check is happy.

mikemccand

Thanks @balmukundblr this is a nice improvement in concurrency. I'll try to push soon!

mikemccand · 2021-06-24T14:13:42Z

Thanks @balmukundblr -- I merged this with Lucene's main branch and also backported to 8.x (for eventual future 8.10.0 release).

balmukundblr · 2021-06-24T14:53:28Z

@mikemccand
Thank you very much for your great suggestions which really helped us to improve the code at greater level.

balmukundblr added 7 commits April 29, 2021 15:42

Added a explicit Flush Task to flush data at Thread level once it com…

c52509b

…pletes the processing

Included explicit flush per Thread level

d9d95a9

Done changes for parallel processing

0122966

Removed extra brace

760604d

Removed unused variable

f660c9f

Removed unused variable initialization

2a871d5

Did the required formating

04e7d6d

mikemccand reviewed May 11, 2021

View reviewed changes

balmukundblr added 4 commits June 4, 2021 12:38

Refactored the code and added required comments & checks

8318a16

Merge branch 'parallel_processing' of https://github.com/balmukundblr…

31be935

…/lucene into parallel_processing

Corrected the Variable name

473e7a0

Merge branch 'parallel_processing' of https://github.com/balmukundblr…

35080e6

…/lucene into parallel_processing

mikemccand reviewed Jun 4, 2021

View reviewed changes

dsmiley reviewed Jun 6, 2021

View reviewed changes

balmukundblr added 2 commits June 7, 2021 13:22

Refactored the code and added required comments

11c80de

Merge branch 'parallel_processing' of https://github.com/balmukundblr…

a033a4b

…/lucene into parallel_processing

mikemccand approved these changes Jun 24, 2021

View reviewed changes

mikemccand merged commit f1d54f7 into apache:main Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing #132

Parallel processing #132

balmukundblr commented May 11, 2021

balmukundblr commented May 11, 2021

mikemccand left a comment

mikemccand May 11, 2021

balmukundblr May 19, 2021

mikemccand May 11, 2021

balmukundblr Jun 4, 2021

mikemccand May 11, 2021

mikemccand commented Jun 4, 2021

mikemccand left a comment

mikemccand Jun 4, 2021

balmukundblr Jun 7, 2021

balmukundblr Jun 7, 2021

mikemccand Jun 4, 2021

balmukundblr Jun 7, 2021

mikemccand Jun 4, 2021

balmukundblr Jun 7, 2021

mikemccand Jun 4, 2021

dsmiley Jun 6, 2021

balmukundblr Jul 23, 2021

balmukundblr Jul 23, 2021

balmukundblr commented Jun 11, 2021

balmukundblr commented Jun 24, 2021

mikemccand commented Jun 24, 2021

mikemccand left a comment

mikemccand commented Jun 24, 2021

balmukundblr commented Jun 24, 2021

Parallel processing #132

Parallel processing #132

Conversation

balmukundblr commented May 11, 2021

Description

Solution

Tests

Checklist

balmukundblr commented May 11, 2021

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented Jun 4, 2021

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

balmukundblr commented Jun 11, 2021

balmukundblr commented Jun 24, 2021

mikemccand commented Jun 24, 2021

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand commented Jun 24, 2021

balmukundblr commented Jun 24, 2021