Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSANDRA-19042: Repair fuzz tests fail with paxos_variant: v2 #3100

Closed
wants to merge 1 commit into from

Conversation

dcapwell
Copy link
Contributor

No description provided.

@@ -385,13 +386,18 @@ public EndpointsForToken readCandidates()
}

static Participants get(TableMetadata table, Token token, ConsistencyLevel consistencyForConsensus)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this here as Paxos uses it... I updated the stuff that mattered for the existing tests to use the new API.

I did see PaxosRepair touches this as well, but I assume it didn't matter as we don't have data in paxos so that logic prob didn't get touched?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, I don't know why my IDE Intellij Idea emits a warning that the Config class is deprecated. Not after your patch, though; it's just observation.

{
EndpointState epState = Gossiper.instance.getEndpointStateForEndpoint(peer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is duplicate logic from Gossiper, the only difference was

try
        {
            return value.value;
        }
        catch (IllegalArgumentException e)
        {
            return null;
        }

Not sure how a field access leads to IllegalArgumentException, so removed in favor of the gossip function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the IllegalArgumentException, probably some leftover


public class PaxosCleanupSession extends AsyncFuture<Void> implements Runnable,
IEndpointStateChangeSubscriber,
IFailureDetectionEventListener,
RequestCallbackWithFailure<Void>
{
private static final Map<UUID, PaxosCleanupSession> sessions = new ConcurrentHashMap<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to org.apache.cassandra.service.paxos.cleanup.PaxosRepairState

@@ -95,49 +93,40 @@ private static ScheduledFuture<?> schedule(PaxosCleanupSession session)
}
}

private final UUID session = UUID.randomUUID();
private final SharedContext ctx;
public final UUID session = UUID.randomUUID();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is non-deterministic, but this problem exists in repair as well... Repair uses random/time uuid outside of the random source/clock from SharedContext... since these are keys for a map, I have mostly looked the other way... it makes debugging annoying (as you can not block on the repair id) but its not been an issue so just ignore

@@ -77,93 +68,10 @@ public synchronized void onResponse(Message<Void> msg)
trySuccess(null);
}

static class PendingCleanup extends IntrusiveStack<PendingCleanup>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to org.apache.cassandra.service.paxos.cleanup.PaxosRepairState


public void setSession(PaxosCleanupSession session)
{
//TODO (review): I copy/pasted from PaxosCleanupSession... this is a check-then-act pattern which is not thread safe...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fix if desired, I left alone so I could lower the risk in the refactor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental changes are ok. I guess it is also in 4.1 so not a regression? Let's open a ticket for the community so someone can pick it up

@@ -211,25 +208,4 @@ AbstractPaxosRepair createRepair(DecoratedKey key, Ballot incompleteBallot, Cons
{
return PaxosRepair.create(consistency, key, incompleteBallot, table);
}

private static final ConcurrentMap<TableId, PaxosTableRepairs> tableRepairsMap = new ConcurrentHashMap<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to org.apache.cassandra.service.paxos.cleanup.PaxosRepairState

@@ -339,6 +346,12 @@ public Set<Faults> apply(Cluster.Node node, Message<?> message)
// these messages are not resilent to ephemeral issues
case STATUS_REQ:
case STATUS_RSP:
// paxos repair does not support faults and will cause a TIMEOUT error, failing the repair
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested each verb and we lead to timeout in different code paths... some times the user is told its paxos related, some times not... logs are always clear though at least...

This patch does not try to make these messages resilient to dropped messages or concurrent retries, so opts-out so these messages will be happy path only.

@@ -654,6 +667,7 @@ static class Cluster

// We run tests in an isolated JVM per class, so not cleaing up is safe... but if that assumption ever changes, will need to cleanup
Stage.ANTI_ENTROPY.unsafeSetExecutor(orderedExecutor);
Stage.MISC.unsafeSetExecutor(orderedExecutor);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one line of code did this rather than Stage.ANTI_ENTROPY... not sure why we do MISC rather than ANTI_ENTROPY given this is repair... but figured best not to change it... For this test they are the same executor so doesn't matter.

@@ -776,10 +790,46 @@ public void onFailure(InetAddressAndPort from, RequestFailureReason failureReaso
}
}

private static class CallbackKey
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repair doesn't reuse messages/callbacks so didn't see that this logic was slightly different than MessagingService... so need to have both (message_id, peer) as the key and not just message_id, else the paxos messages clear the callback and hang forever (since it doesn't enqueue work the test actually fails right away).

Copy link
Member

@bdeggleston bdeggleston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM (though limited knowledge in that part of the codebase, I admit). Some nits left which can be addressed on commit

Comment on lines +34 to +35
void notifyFailureDetector(Map<InetAddressAndPort, EndpointState> remoteEpStateMap);
void applyStateLocally(Map<InetAddressAndPort, EndpointState> epStateMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two can benefit from some javadoc

@@ -28,4 +29,8 @@ public interface MessageDelivery
public <REQ, RSP> void sendWithCallback(Message<REQ> message, InetAddressAndPort to, RequestCallback<RSP> cb, ConnectionType specifyConnection);
public <REQ, RSP> Future<Message<RSP>> sendWithResult(Message<REQ> message, InetAddressAndPort to);
public <V> void respond(V response, Message<?> message);
public default void respondWithFailure(RequestFailureReason reason, Message<?> message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the public modifiers from this interface, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the rest of the file, I left public there...

@@ -385,13 +386,18 @@ public EndpointsForToken readCandidates()
}

static Participants get(TableMetadata table, Token token, ConsistencyLevel consistencyForConsensus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, I don't know why my IDE Intellij Idea emits a warning that the Config class is deprecated. Not after your patch, though; it's just observation.


public void setSession(PaxosCleanupSession session)
{
//TODO (review): I copy/pasted from PaxosCleanupSession... this is a check-then-act pattern which is not thread safe...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental changes are ok. I guess it is also in 4.1 so not a regression? Let's open a ticket for the community so someone can pick it up

import static org.apache.cassandra.exceptions.RequestFailureReason.UNKNOWN;
import static org.apache.cassandra.net.NoPayload.noPayload;

public class PaxosRepairState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Javadoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracks the state of paxos repair cleanup work

{
EndpointState epState = Gossiper.instance.getEndpointStateForEndpoint(peer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the IllegalArgumentException, probably some leftover

test/unit/org/apache/cassandra/repair/FuzzTestBase.java Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants