OAK-10778 - Support downloading from Mongo in parallel. #1435

nfsantos · 2024-04-25T12:47:31Z

A Mongo cluster (REPLICA_SET) usually consists of one primary and two secondaries. This PR adds support to download in parallel from the two secondaries.

Adds new boolean system property: oak.indexer.pipelined.mongoParallelDump defaults to false.

Parallel download can be enabled only when oak.indexer.pipelined.retryOnConnectionErrors is true. (Parallel download requires ordered traversals on Mongo which are enabled by retry on connection errors.)

Goals of this PR

Do not download from the primary because this is more likely to slow down the operation of the Mongo and affect the other workloads (read/writes by Oak).
Use one and only one connection at a time to each secondary to avoid overloading any given replica. More on this in the notes below.
Gracefully handle scale up/down operations and failures in general, pausing the second download when there is only one secondary available and reconnecting when all secondaries are up.

Implementation

One of the difficulties of parallel downloading is to partition the range of documents among the download threads. We do not know in advance the distribution of the documents over the range of keys (_modified, _id) which are used to download, so it is challenging to distribute them evenly among the download threads. This PR sidesteps this problem by having one thread download in ascending and the other in descending order. This has the limitation that we can only have 2 parallel threads, but this fits nicely the typical configuration of a Mongo cluster with two secondaries. Adding more parallel downloads would provide a smaller increase in overall download speed and would risk overloading the replicas.

The two download threads coordinate to check when the ranges they have downloaded have crossed, to stop the download at this point.

To distribute the threads among replicas this PR creates a custom implementation of ServerSelector. The Mongo Java driver calls the registered ServerSelectors whenever it needs to open a connection to a Mongo cluster to get a list of eligible servers. The default implementations are taken from the readPreference settings, but it is possible to create and register a custom implementation. This PR uses an implementation that allows connections only to secondaries and keeps track of which thread last received a given secondary, so that this secondary is not given to any other thread.

There is also a mechanism to disconnect from a replica who was promoted from secondary to primary. This happens whenever there is a scale up/down. Mongo will first take one secondary down and replace it by a new one, then the other secondary, and before taking down the primary, it promotes one of the new secondaries to primary. Therefore, we may have connected to a secondary which in the meantime was promoted to primary. Since is very likely to happen during a scale up/down, it was important to detect promotions of secondaries to primary and disconnect. This is done by listening to ClusterListener events and having each download thread periodically query to check if the replica it is using is now the Primary. If it is, the replica disconnects and tries to establish a new connection, which will be redirected to a secondary.

Performance results

System 1

Sequential download:

Timings:
  Mongo dump: 12:09:11
  Merge sort: 00:05:28
  Build FFS (Dump+Merge): 12:14:40

Parallel download

Timings:
  Mongo dump: 06:14:16
  Merge sort: 00:05:26
  Build FFS (Dump+Merge): 06:19:46

System 2

Sequential download:

Timings:
  Mongo dump: 00:14:10
  Merge sort: 00:02:19
  Build FFS (Dump+Merge): 00:16:29

Parallel download

Timings:
  Mongo dump: 00:06:41
  Merge sort: 00:02:13
  Build FFS (Dump+Merge): 00:08:54

Note on risk of slowing down the Mongo cluster

When parallel download is enabled, both secondaries will be under load. There may be a concern if this will slow down writes, which need an ack from the primary and a secondary. I do not think this would be a problem in the common cases. A single download connection should not be enough to saturate a node. Mongo seems to allocate only one thread to handle a connection, so as long as the Mongo node has at least 2 cores, the download query will not take up all the CPU. The pressure on IOPS, disk throughput and network bandwidth created by the downloader may in some situations get close to the limit, but even there this pressure is not kept all the time. The protocol used by the Mongo Java driver is synchronous request/response, and the client only sends the request for the next batch of results after the current one is parsed and iterated over, so during this time the Mongo server is idle. I have observed that the IOPS and disk throughput usually stay below 80% on Mongo, which gives some spare headroom to process other queries.

Additional changes

Improve the test of recovery from disconnections from Mongo. The previous test was relying on mockito to simulate a connection failure. This was very complex and tedious to write. This PR replaces that test by one that uses a real Mongo server inside a Docker container and toxiproxy to simulate a connection failure.
Create a parametrized version of PipelinedIT tests which test all combinations of: regex path filtering, parallel download and retry on connection errors.

This PR has a large change set in part because it refactors the PipelinedMongoDownloadTask in order to make it more concise and cohesive:

It moves out the logic to handle regex filtering to a separate class
It creates a new DownloadTask class that includes the logic to do the actual download from Mongo, trying to separate it from the logic to setup and launch download threads.

…roperty: oak.indexer.pipelined.mongoParallelDump.

…ests.

… to be used for tests.

… declaration where this makes code more clear.

…tegration tests.

…every 10k to every 20k.

oak-run-commons/pom.xml

…nd $lte(Long.MAX_VALUE) instead $exits(_modified). $exists also checks for the property being equals to null, which cannot be verified just by looking at an index, because indexes in MongoDB do not contain null values. Using $exists requires retrieving the full document from the column store, which dramatically slows down the traversal.

…on. Ensures that both download threads are shutdown gracefully. Small refactoring.

fabriziofortino · 2024-05-07T12:49:26Z

...e/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedMongoServerSelector.java

+        connectedToPrimaryThreads.clear();
+        lastSeenClusterDescription.getServerDescriptions().stream()
+                .filter(ServerDescription::isPrimary)
+                .map(ServerDescription::getAddress)
+                .forEach(primaryAddress -> {
+                    for (var entry : serverAddressHashMap.entrySet()) {
+                        if (entry.getValue().equals(primaryAddress)) {
+                            connectedToPrimaryThreads.add(entry.getKey());
+                        }
+                    }
+                });


to be consistent in terms of style, you can write this logic in a fully functional way. This would require connectedToPrimaryThreads to be set as non final though. It's a matter of taste.

private Set<Long> connectedToPrimaryThreads = new HashSet<>();

Suggested change

connectedToPrimaryThreads.clear();

lastSeenClusterDescription.getServerDescriptions().stream()

.filter(ServerDescription::isPrimary)

.map(ServerDescription::getAddress)

.forEach(primaryAddress -> {

for (var entry : serverAddressHashMap.entrySet()) {

if (entry.getValue().equals(primaryAddress)) {

connectedToPrimaryThreads.add(entry.getKey());

}

}

});

connectedToPrimaryThreads = lastSeenClusterDescription.getServerDescriptions().stream()

.filter(ServerDescription::isPrimary)

.map(ServerDescription::getAddress)

.flatMap(primaryAddress -> serverAddressHashMap.entrySet().stream()

.filter(entry -> primaryAddress.equals(entry.getValue()))

.map(Map.Entry::getKey))

.collect(Collectors.toSet());

I initially wrote the code all as streams, but I did not like the stream-within-a-stream nesting and the use of flatMap. I feel that the forEach loop is a bit more clear.

...ckrabbit/oak/index/indexer/document/flatfile/pipelined/MongoParallelDownloadCoordinator.java

fabriziofortino · 2024-05-08T07:48:08Z

...he/jackrabbit/oak/index/indexer/document/flatfile/pipelined/MongoRegexPathFilterFactory.java

@@ -170,9 +170,9 @@ static List<String> mergeIndexAndCustomExcludePaths(List<String> indexExcludedPa
            return indexExcludedPaths;
        }

-        var excludedUnion = new HashSet<>(indexExcludedPaths);


out of curiosity: I have seen you have replaced vars in multiple places. Any specific reason?

No specific reason, I'm still unsure about when to use var vs explicit type. In many cases keeping the type annotation makes the code more clear and does not significantly increase verbosity, for instance:

var i = 1; int i = 1;

The second option makes it clear that it is an int, while the first is ambiguous.

In other cases, even if the type annotation is longer, I still somewhat prefer that it appears at least one time. For instance, here I feel it's better to use var:

var bi = new DownloadPosition(batch[i].getModified(), batch[i].getId());

Because the type is explicit on the right hand side.

But here:

FindIterable<NodeDocument> mongoIterable = dbCollection .find(findQuery) .sort(sortOrder);

I think it's nicer to have the type annotation because the right hand side does not contain the type.

But it's just my gut feeling of what seems more clear and easy to read, I'm not following any set of best-practices. It would be interesting to have a discussion about the use of var.

...va/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/MongoTestBackend.java

...java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelineITUtil.java

…ch any documents. Add additional comments.

…ment to reduce its size.

...ache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/MongoDownloaderRegexUtils.java

steffenvan · 2024-05-10T09:49:58Z

...che/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedMongoDownloadTask.java

+                    while (cursor.hasNext()) {
+                        NodeDocument next = cursor.next();
+                        String id = next.getId();
+                        this.nextLastModified = next.getModified();


My editor warns me that this might produce a NullPointerException. And we use this a few other places.

Good catch, this would indeed be a problem when doing a column traversal because the query used for that was downloading all documents, even those without the _modified field. We do not need these documents to build the FFS, so it is safe to filter them on the Mongo query. Like this, there is not need to have a null check here. This code is on the critical path, so it should be kept as lean as possible.

...ckrabbit/oak/index/indexer/document/flatfile/pipelined/MongoParallelDownloadCoordinator.java

...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java

…traversal (retry on connection errors false). Documents without _modified field are not needed to build the FFS and like this there is no need to do a null check when calling getModified on the documents.

nfsantos added 2 commits April 25, 2024 14:40

Support downloading from Mongo in parallel. Adds new boolean system p…

e903c7a

…roperty: oak.indexer.pipelined.mongoParallelDump.

Add missing license headers.

a074ea0

nfsantos changed the title ~~OAK-10778 - Support downloading from Mongo in parallel. Adds new boolean system property: oak.indexer.pipelined.mongoParallelDump.~~ OAK-10778 - Support downloading from Mongo in parallel. Apr 25, 2024

nfsantos added 4 commits April 25, 2024 17:06

Add missing registration of MongoClientURI to whiteboard in failing t…

5f58aa9

…ests.

Retrieve from system properties the version of the Mongo docker image…

65387ce

… to be used for tests.

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

45623cf

Improve documentation and replace use of var keyword by explicit type…

934bea6

… declaration where this makes code more clear.

nfsantos requested review from thomasmueller, fabriziofortino and nit0906 April 26, 2024 09:36

nfsantos added 5 commits April 29, 2024 09:30

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

3440733

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

55b1157

Reuse the name defined in MongoDockerRule for the Docker image for in…

c412f2f

…tegration tests.

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

b2684bd

Reduce frequency of logging progress messages in the downloader from …

051cd51

…every 10k to every 20k.

nit0906 reviewed May 2, 2024

View reviewed changes

oak-run-commons/pom.xml Show resolved Hide resolved

nfsantos added 4 commits May 2, 2024 15:08

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

3fd5c3c

Improve error handling when parallel download fails with some excepti…

63254c2

…on. Ensures that both download threads are shutdown gracefully. Small refactoring.

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

70f431e

fabriziofortino approved these changes May 8, 2024

View reviewed changes

nfsantos added 4 commits May 8, 2024 10:28

Apply review comments.

a1bc963

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

8aceb22

Add a test for Pipelined strategy where the mongo filter does not mat…

9e1a1a3

…ch any documents. Add additional comments.

Make NodeDocumentCodec thread safe and simplify logic of switch state…

d76d5fe

…ment to reduce its size.

steffenvan reviewed May 10, 2024

View reviewed changes

...ache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/MongoDownloaderRegexUtils.java Outdated Show resolved Hide resolved

steffenvan reviewed May 10, 2024

View reviewed changes

...ckrabbit/oak/index/indexer/document/flatfile/pipelined/MongoParallelDownloadCoordinator.java Show resolved Hide resolved

steffenvan reviewed May 10, 2024

View reviewed changes

...ckrabbit/oak/index/indexer/document/flatfile/pipelined/MongoParallelDownloadCoordinator.java Show resolved Hide resolved

steffenvan reviewed May 10, 2024

View reviewed changes

...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java Outdated Show resolved Hide resolved

nfsantos added 5 commits May 10, 2024 14:44

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

0d6f607

Address review comments.

7dad006

Simplify logic.

d7d2f12

Fix: Download only documents with _modified also when doing a column …

1dcb944

…traversal (retry on connection errors false). Documents without _modified field are not needed to build the FFS and like this there is no need to do a null check when calling getModified on the documents.

Merge remote-tracking branch 'upstream/trunk' into OAK-10778

d3f2377

nfsantos merged commit 98206cb into apache:trunk May 13, 2024
1 of 2 checks passed

nfsantos deleted the OAK-10778 branch May 13, 2024 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-10778 - Support downloading from Mongo in parallel. #1435

OAK-10778 - Support downloading from Mongo in parallel. #1435

nfsantos commented Apr 25, 2024 •

edited

fabriziofortino May 7, 2024

nfsantos May 8, 2024

fabriziofortino May 8, 2024

nfsantos May 8, 2024 •

edited

steffenvan May 10, 2024

nfsantos May 10, 2024

OAK-10778 - Support downloading from Mongo in parallel. #1435

OAK-10778 - Support downloading from Mongo in parallel. #1435

Conversation

nfsantos commented Apr 25, 2024 • edited

Goals of this PR

Implementation

Performance results

System 1

System 2

Note on risk of slowing down the Mongo cluster

Additional changes

fabriziofortino May 7, 2024

Choose a reason for hiding this comment

nfsantos May 8, 2024

Choose a reason for hiding this comment

fabriziofortino May 8, 2024

Choose a reason for hiding this comment

nfsantos May 8, 2024 • edited

Choose a reason for hiding this comment

steffenvan May 10, 2024

Choose a reason for hiding this comment

nfsantos May 10, 2024

Choose a reason for hiding this comment

nfsantos commented Apr 25, 2024 •

edited

nfsantos May 8, 2024 •

edited