New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IGNITE-22130 Fix retries. #3704
base: main
Are you sure you want to change the base?
Conversation
ascherbakoff
commented
May 6, 2024
•
edited
edited
- Retries both for implicit and explicit transactions are submitted to separate pool with a delay to avoid recursion in embedded mode and give some time to wait for lock release.
- Added retries for reads and data streamer.
- Removed wrapReplicationException method
@@ -210,7 +225,14 @@ private <R> CompletableFuture<R> sendToReplica(String targetNodeConsistentId, Re | |||
return null; | |||
}); | |||
} else { | |||
res.completeExceptionally(errResp.throwable()); | |||
if (retryExecutor != null && matchAny(unwrapCause(errResp.throwable()), ACQUIRE_LOCK_ERR, REPLICA_MISS_ERR)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that using exception classes is better than error codes, in general.
By the way, should ACQUIRE_LOCK_TIMEOUT_ERR
be taken into account as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that using exception classes is better than error codes, in general.
I disagree. The whole exception design is based on error codes. Checking error codes is more clean then comparing exception classes.
By the way, should ACQUIRE_LOCK_TIMEOUT_ERR be taken into account as well?
No it should not. This error code and related functionality should be removed, because we got retries from client side. I plan to create a ticket for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. The whole exception design is based on error codes. Checking error codes is more clean then comparing exception classes.
I disagree, using Java classes that represent an exception is a widespread practice. Error codes are a way to provide the user with an additional clue on critical situations, especially in the case of thin clients that are not supported exceptions.
No it should not. This error code and related functionality should be removed, because we got retries from client side. I plan to create a ticket for this.
Ok, I got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, using Java classes that represent an exception is a widespread practice. Error codes are a way to provide the user with an additional clue on critical situations, especially in the case of thin clients that are not supported exceptions.
Looks like we have no agreement here, because changing error code checking to instanceof
doesn't make much sense to me. This however should not block PR changes. Or is this a merge blocker ? We can fix it later then we have strict rules on working with errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely not a blocker.
@@ -59,6 +68,8 @@ public class ReplicaService { | |||
|
|||
private final ReplicationConfiguration replicationConfiguration; | |||
|
|||
private final ScheduledExecutorService retryExecutor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add @Nullable
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🆗
/** | ||
* Transaction module exception mapper. | ||
*/ | ||
@AutoService(IgniteExceptionMappersProvider.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
retryExecutor.schedule( | ||
// Need to resubmit again to pool which is valid for synchronous IO execution. | ||
() -> partitionOperationsExecutor.execute(() -> res.completeExceptionally(errResp.throwable())), | ||
RETRY_TIMEOUT_MILLIS, MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood it correctly,, we are holding response for several milliseconds to prevent instantaneous retry.
What will change if we don't do that? Because it is incorrect to delay a response instead of retrying, especially in cases where we don't make the retry again.
if (full) { // Full transaction retries are handled in postEnlist. | ||
return replicaSvc.invoke(primaryReplicaAndConsistencyToken.get1(), request); | ||
} else { | ||
if (write) { // Track only write requests from explicit transactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (write) {
This code format is acceptable here.
@@ -1022,6 +1028,10 @@ private static boolean allSchemaVersionsSame(Collection<? extends BinaryRow> row | |||
boolean first = true; | |||
|
|||
for (BinaryRow row : rows) { | |||
if (row == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a bug, how is it happening?