New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanner: retry non-idempotent admin RPCs #1964
Comments
The CreateDatabase RPC should be retried if it fails with a transient error code, such as UNAVAILABLE or DEADLINE_EXCEEDED. The actual RPC should only be retried if the initial RPC did not already start an operation for creating a new database. In case the initial call succeeded, but the response was lost, the retry logic will return a reference to the existing long-running operation. Updates #1964 Fixes #1922 and #1921 Change-Id: I2871b4ade440bd371ba717ca324f6542a6321a59 Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/55570 Reviewed-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Hengfeng Li <hengfeng@google.com>
@olavloite Can you confirm that this issue has been fixed? |
@hengfengli The change was implemented for the For the Java client we also implemented this retry logic for the |
I believe we only have an IT for I'm still a bit unsure about doing this since we wanted to minimize the hand-written code here. It's only worth considering if the test fails regularly. |
That's true. It also means that it would not be possible to implement it for |
Thanks @olavloite. We'll reopen this if it happens for |
@skuruppu @hengfengli It seems that the build error in #2393 was caused by a similar error. The error code is not |
The error in that case was:
The failure is at https://github.com/googleapis/google-cloud-go/blob/master/spanner/integration_test.go#L2728 and that's a |
@skuruppu You're right. Sorry, I should have checked the code to see which line it actually failed before commenting. Based on the tests and checks above, it seems that the test would only get to this point if the backup has actually successfully been created. The backup creation however probably took a long time and then caused this timeout. The flow in this case seems to be:
I don't think the Note that the test case has now been disabled, as it has been having multiple failures in the past. Those failures were mostly caused by the problem that not all test instances were cleaned up, which again caused |
Oh I see, 15 minutes is a really short time to wait for the backup. I think the timeouts we set in the configs are something like 1 hour to get the LRO and the LRO can be polled up to 48 hours. I suppose we wouldn't want to wait this long for an IT to run. I think @hengfengli found that the generated code in Go doesn't respect the timeouts in the config. Should we consider setting the deadline in the context to a higher value manually? Another question is, is it possible for GetBackup to use a new context? |
Yes, that would also be possible, but I think setting a higher initial value would be better. |
Should we just double it to 30 mins? or what value do you think is better? |
Let's try to double it to 30 minutes, but also use the same context for all operations, so it's clearer which operation causes the timeout. That would mean also using the context with the deadline for the wait operation. |
Non-idempotent admin RPCs such as
CreateDatabase
can fail when a transient error likeUNAVAILABLE
occur. This can cause unexpected failures in both user code and flaky test failures.For the Java client we implemented retry logic for these RPCs by querying the backend for any corresponding long-running operation when such a failure occurred, and based on the result of that query, the RPC will either be retried or the corresponding long-running operation will be returned to the user.
The text was updated successfully, but these errors were encountered: