New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CASSANDRA-19042: Repair fuzz tests fail with paxos_variant: v2 #3100
Conversation
@@ -385,13 +386,18 @@ public EndpointsForToken readCandidates() | |||
} | |||
|
|||
static Participants get(TableMetadata table, Token token, ConsistencyLevel consistencyForConsensus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this here as Paxos
uses it... I updated the stuff that mattered for the existing tests to use the new API.
I did see PaxosRepair
touches this as well, but I assume it didn't matter as we don't have data in paxos so that logic prob didn't get touched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, I don't know why my IDE Intellij Idea emits a warning that the Config class is deprecated. Not after your patch, though; it's just observation.
{ | ||
EndpointState epState = Gossiper.instance.getEndpointStateForEndpoint(peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is duplicate logic from Gossiper
, the only difference was
try
{
return value.value;
}
catch (IllegalArgumentException e)
{
return null;
}
Not sure how a field access leads to IllegalArgumentException
, so removed in favor of the gossip function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the IllegalArgumentException, probably some leftover
|
||
public class PaxosCleanupSession extends AsyncFuture<Void> implements Runnable, | ||
IEndpointStateChangeSubscriber, | ||
IFailureDetectionEventListener, | ||
RequestCallbackWithFailure<Void> | ||
{ | ||
private static final Map<UUID, PaxosCleanupSession> sessions = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to org.apache.cassandra.service.paxos.cleanup.PaxosRepairState
@@ -95,49 +93,40 @@ private static ScheduledFuture<?> schedule(PaxosCleanupSession session) | |||
} | |||
} | |||
|
|||
private final UUID session = UUID.randomUUID(); | |||
private final SharedContext ctx; | |||
public final UUID session = UUID.randomUUID(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-deterministic, but this problem exists in repair as well... Repair uses random/time uuid outside of the random source/clock from SharedContext
... since these are keys for a map, I have mostly looked the other way... it makes debugging annoying (as you can not block on the repair id) but its not been an issue so just ignore
@@ -77,93 +68,10 @@ public synchronized void onResponse(Message<Void> msg) | |||
trySuccess(null); | |||
} | |||
|
|||
static class PendingCleanup extends IntrusiveStack<PendingCleanup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to org.apache.cassandra.service.paxos.cleanup.PaxosRepairState
|
||
public void setSession(PaxosCleanupSession session) | ||
{ | ||
//TODO (review): I copy/pasted from PaxosCleanupSession... this is a check-then-act pattern which is not thread safe... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can fix if desired, I left alone so I could lower the risk in the refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incremental changes are ok. I guess it is also in 4.1 so not a regression? Let's open a ticket for the community so someone can pick it up
src/java/org/apache/cassandra/service/paxos/cleanup/PaxosRepairState.java
Show resolved
Hide resolved
@@ -211,25 +208,4 @@ AbstractPaxosRepair createRepair(DecoratedKey key, Ballot incompleteBallot, Cons | |||
{ | |||
return PaxosRepair.create(consistency, key, incompleteBallot, table); | |||
} | |||
|
|||
private static final ConcurrentMap<TableId, PaxosTableRepairs> tableRepairsMap = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to org.apache.cassandra.service.paxos.cleanup.PaxosRepairState
@@ -339,6 +346,12 @@ public Set<Faults> apply(Cluster.Node node, Message<?> message) | |||
// these messages are not resilent to ephemeral issues | |||
case STATUS_REQ: | |||
case STATUS_RSP: | |||
// paxos repair does not support faults and will cause a TIMEOUT error, failing the repair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested each verb and we lead to timeout
in different code paths... some times the user is told its paxos related, some times not... logs are always clear though at least...
This patch does not try to make these messages resilient to dropped messages or concurrent retries, so opts-out so these messages will be happy path only.
@@ -654,6 +667,7 @@ static class Cluster | |||
|
|||
// We run tests in an isolated JVM per class, so not cleaing up is safe... but if that assumption ever changes, will need to cleanup | |||
Stage.ANTI_ENTROPY.unsafeSetExecutor(orderedExecutor); | |||
Stage.MISC.unsafeSetExecutor(orderedExecutor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one line of code did this rather than Stage.ANTI_ENTROPY
... not sure why we do MISC rather than ANTI_ENTROPY given this is repair... but figured best not to change it... For this test they are the same executor so doesn't matter.
@@ -776,10 +790,46 @@ public void onFailure(InetAddressAndPort from, RequestFailureReason failureReaso | |||
} | |||
} | |||
|
|||
private static class CallbackKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repair doesn't reuse messages/callbacks so didn't see that this logic was slightly different than MessagingService
... so need to have both (message_id, peer)
as the key and not just message_id
, else the paxos messages clear the callback and hang forever (since it doesn't enqueue work the test actually fails right away).
8fa16d5
to
1d3abd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
1d3abd9
to
fe9b7e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM (though limited knowledge in that part of the codebase, I admit). Some nits left which can be addressed on commit
void notifyFailureDetector(Map<InetAddressAndPort, EndpointState> remoteEpStateMap); | ||
void applyStateLocally(Map<InetAddressAndPort, EndpointState> epStateMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two can benefit from some javadoc
@@ -28,4 +29,8 @@ public interface MessageDelivery | |||
public <REQ, RSP> void sendWithCallback(Message<REQ> message, InetAddressAndPort to, RequestCallback<RSP> cb, ConnectionType specifyConnection); | |||
public <REQ, RSP> Future<Message<RSP>> sendWithResult(Message<REQ> message, InetAddressAndPort to); | |||
public <V> void respond(V response, Message<?> message); | |||
public default void respondWithFailure(RequestFailureReason reason, Message<?> message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the public modifiers from this interface, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with the rest of the file, I left public
there...
@@ -385,13 +386,18 @@ public EndpointsForToken readCandidates() | |||
} | |||
|
|||
static Participants get(TableMetadata table, Token token, ConsistencyLevel consistencyForConsensus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, I don't know why my IDE Intellij Idea emits a warning that the Config class is deprecated. Not after your patch, though; it's just observation.
|
||
public void setSession(PaxosCleanupSession session) | ||
{ | ||
//TODO (review): I copy/pasted from PaxosCleanupSession... this is a check-then-act pattern which is not thread safe... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incremental changes are ok. I guess it is also in 4.1 so not a regression? Let's open a ticket for the community so someone can pick it up
import static org.apache.cassandra.exceptions.RequestFailureReason.UNKNOWN; | ||
import static org.apache.cassandra.net.NoPayload.noPayload; | ||
|
||
public class PaxosRepairState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracks the state of paxos repair cleanup work
{ | ||
EndpointState epState = Gossiper.instance.getEndpointStateForEndpoint(peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the IllegalArgumentException, probably some leftover
src/java/org/apache/cassandra/service/paxos/cleanup/PaxosRepairState.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/paxos/cleanup/PaxosCleanupSession.java
Outdated
Show resolved
Hide resolved
8cc5229
to
8c906d9
Compare
No description provided.