Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGNITE-21805 Refactor TableManager and move all RAFT related pieces to Replica #3633

Open
wants to merge 68 commits into
base: main
Choose a base branch
from

Conversation

JAkutenshi
Copy link
Contributor

@JAkutenshi JAkutenshi commented Apr 18, 2024

Apache JIRA ticket's link

The goal

The goal of this PR is to remove RaftManager from TableManager and place it and its calls to ReplicaManager.

The current issues

The main issues now related to the TableManager's code in lines 967-993:

  1. The ordering of internal table's update and replica creation-starting is important.
  2. Internal table's update should be proceed in any case, while replica should be started only if commented out condition on lines 971-973 isn't true.

Related tests failures:

org.apache.ignite.internal.table.distributed.TableManagerRecoveryTest

Probably, the reason of failure for two following tests is null somewhere around ReplicaManager:L679.

testTableIgnoredOnRecovery

Caused by: java.lang.NullPointerException
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:992) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$33(TableManager.java:967) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
  ... 4 more

testTableStartedOnRecovery

Caused by: java.lang.NullPointerException
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:992) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$33(TableManager.java:967) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
  ... 4 more

org.apache.ignite.internal.rebalance.ItRebalanceDistributedTest

testRebalanceWithTheSameNodes

The reason of failure is point 2 of main issues: we should start replica only once per node

org.mockito.exceptions.verification.TooManyActualInvocations: 
replicaManager.startReplica(
    <any>,
    <any>,
    <any java.util.function.Function>,
    <any>
);
Wanted 1 time:
-> at org.apache.ignite.internal.replicator.ReplicaManager.startReplica(ReplicaManager.java:583)
But was 3 times:
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)

org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest

Both of failed tests testManualRebalanceIfPartitionIsLost and testManualRebalanceIfMajorityIsLost are unfamiliar and unclear now for me. The common reason is somewhat like:

java.lang.AssertionError: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
  at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78)
  at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35)
  at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
  at org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testManualRebalanceIfPartitionIsLost(ItDisasterRecoveryReconfigurationTest.java:229)
  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
  at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
  at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
  at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
  at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
  at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:74)
  ... 8 more
Caused by: java.util.concurrent.TimeoutException
  at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:834)

But I'm sure that isn't the root cause.


Thank you for submitting the pull request.

To streamline the review process of the patch and ensure better code quality
we ask both an author and a reviewer to verify the following:

The Review Checklist

  • Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
    - There is a single JIRA ticket related to the pull request.
    - The web-link to the pull request is attached to the JIRA ticket.
    - The JIRA ticket has the Patch Available state.
    - The description of the JIRA ticket explains WHAT was made, WHY and HOW.
    - The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
  • Design: new code conforms with the design principles of the components it is added to.
  • Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
  • Code quality: code is clean and readable, necessary developer documentation is added if needed.
  • Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.

Notes

@JAkutenshi JAkutenshi marked this pull request as ready for review April 18, 2024 21:20
@JAkutenshi JAkutenshi changed the title IGNITE-21805 WIP Refactor TableManager and move all RAFT related pieces to Replica IGNITE-21805 Refactor TableManager and move all RAFT related pieces to Replica Apr 23, 2024
@JAkutenshi
Copy link
Contributor Author

A comment about a test's fix there: before the ticket there wasn't .join() on TableManager, but now there is and if startReplica() returns null -- it fails with NPE. In context of the test, RepelicaManager is mocked and then, e.g. busyLock is null and so on. Without the method mocking the result of startReplica() is null and then .join() faces NPE that leads to TimeoutException on the top of stacktrace. As a solution I just mocking startReplica() that returns completed with null-value future instead just null.

@JAkutenshi JAkutenshi requested a review from kgusakov May 14, 2024 17:51
…antiation into ReplicaManager; Rebalance scheduled executor and VolatileLogStorageFactoryCreator are moved too from TableManager
…changing replicas' starting condition on substraction between pending and stable assignments
…RaftClient in TableManagerRecoveryTest#startComponents and TableManagerTest#before
…d moving back reset peers from startReplica to TableManager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants