Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError in AppendEntriesRequestProcessor #1091

Open
Cyrill opened this issue Mar 14, 2024 · 1 comment
Open

AssertionError in AppendEntriesRequestProcessor #1091

Cyrill opened this issue Mar 14, 2024 · 1 comment

Comments

@Cyrill
Copy link

Cyrill commented Mar 14, 2024

I managed to hit an AssertionError in AppendEntriesRequestProcessor. Apparently, there is a race.
The crash was observed on a custom branch, though the code in master is the same.

First, the code:

AppendEntriesRequestProcessor.PeerExecutorSelector has the following code (Intentionally removed unrelated lines):

public Executor select(final String reqClass, final Object reqHeader) {
            // ...

            final Node node = NodeManager.getInstance().get(groupId, peer);

            if (node == null || !node.getRaftOptions().isReplicatorPipeline()) {
                return executor();
            }

            // The node enable pipeline, we should ensure bolt support it.
            RpcFactoryHelper.rpcFactory().ensurePipeline();

            final PeerRequestContext ctx = getOrCreatePeerRequestContext(groupId, pairOf(peerId, serverId), null);

            return ctx.executor;
        }

getOrCreatePeerRequestContext looks as follows:

PeerRequestContext getOrCreatePeerRequestContext(final String groupId, final PeerPair pair, final Connection conn) {
        ConcurrentMap<PeerPair, PeerRequestContext> groupContexts = this.peerRequestContexts.get(groupId);
        // ....

        PeerRequestContext peerCtx = groupContexts.get(pair);
        if (peerCtx == null) {
            synchronized (Utils.withLockObject(groupContexts)) {
                peerCtx = groupContexts.get(pair);
                // double check in lock
                if (peerCtx == null) {
                    // only one thread to process append entries for every jraft node
                    final PeerId peer = new PeerId();
                    final boolean parsed = peer.parse(pair.local);
                    assert (parsed);
                    final Node node = NodeManager.getInstance().get(groupId, peer);
                    assert (node != null); // <<<<<<<<<<<<<<AssertionError here!
                    peerCtx = new PeerRequestContext(groupId, pair, node.getRaftOptions()
                        .getMaxReplicatorInflightMsgs());
                    groupContexts.put(pair, peerCtx);
                }
            }
        }
        // ...
 
        return peerCtx;
    }

Execution flow

I don't have a specific code to reproduce this issue, but the flow is simple. I observed a slight delay in messaging/threads which ended up with an error.

My assumptions regarding the execution flow are:

  • select is called. NodeManager.getInstance().get(groupId, peer) returns a non-null result, continue to getOrCreatePeerRequestContext
  • Another thread stops the app,NodeManager.getInstance().remove() is called for this node.
  • Inside getOrCreatePeerRequestContext the result of final Node node = NodeManager.getInstance().get(groupId, peer); is null, since the node has already been removed moments ago.
  • The execution crash on the following line assert (node != null);
@fengjiachun
Copy link
Contributor

I agree with you; that situation could definitely happen. I suspect the problem arises from sharing the RpcServer. The RaftGroupService closes before the RpcServer, causing a delay in processing new requests because the raft service is already closed by then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@Cyrill @fengjiachun and others