Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner: retry non-idempotent admin RPCs #1964

Closed
olavloite opened this issue Apr 30, 2020 · 12 comments
Closed

spanner: retry non-idempotent admin RPCs #1964

olavloite opened this issue Apr 30, 2020 · 12 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@olavloite
Copy link
Contributor

olavloite commented Apr 30, 2020

Non-idempotent admin RPCs such as CreateDatabase can fail when a transient error like UNAVAILABLE occur. This can cause unexpected failures in both user code and flaky test failures.

For the Java client we implemented retry logic for these RPCs by querying the backend for any corresponding long-running operation when such a failure occurred, and based on the result of that query, the RPC will either be retried or the corresponding long-running operation will be returned to the user.

@olavloite olavloite added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: spanner Issues related to the Spanner API. labels Apr 30, 2020
@olavloite olavloite self-assigned this Apr 30, 2020
gopherbot pushed a commit that referenced this issue May 6, 2020
The CreateDatabase RPC should be retried if it fails with a transient
error code, such as UNAVAILABLE or DEADLINE_EXCEEDED. The actual RPC
should only be retried if the initial RPC did not already start an
operation for creating a new database. In case the initial call
succeeded, but the response was lost, the retry logic will return a
reference to the existing long-running operation.

Updates #1964
Fixes #1922 and #1921

Change-Id: I2871b4ade440bd371ba717ca324f6542a6321a59
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/55570
Reviewed-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Hengfeng Li <hengfeng@google.com>
@hengfengli
Copy link
Contributor

@olavloite Can you confirm that this issue has been fixed?

@hengfengli hengfengli added the priority: p2 Moderately-important priority. Fix may not be included in next release. label May 25, 2020
@olavloite
Copy link
Contributor Author

@hengfengli The change was implemented for the CreateDatabase RPC in this commit. As far as I know, we haven't experienced any flaky build failures that have been caused by CreateDatabase failing with an UNAVAILABLE error since that, so it should now be fixed for CreateDatabase.

For the Java client we also implemented this retry logic for the CreateBackup and RestoreDatabase RPCs as well. I would recommend that we do that for the Go client as well, provided that the flaky fails for CreateDatabase do not suddenly re-appear. Those RPCs are probably affected by the same problem, but we probably don't see the failures because the integration tests execute a lot less of those RPCs than CreateDatabase RPCs.

@skuruppu
Copy link
Contributor

I believe we only have an IT for CreateBackup. We didn't write much hand-written code for Backups in Go since we only use the auto-generated code for admin APIs. If you were to modify the retry logic, that will need to be done here.

I'm still a bit unsure about doing this since we wanted to minimize the hand-written code here. It's only worth considering if the test fails regularly.

@olavloite
Copy link
Contributor Author

We didn't write much hand-written code for Backups in Go since we only use the auto-generated code for admin APIs.

That's true. It also means that it would not be possible to implement it for RestoreDatabase, as we don't have any handwritten code for it (or we would need to introduce it). I haven't seen any failures caused by UNAVAILABLE errors for the CreateBackup RPC (yet), so I guess we could close this and re-evaluate if they do start to appear.

@skuruppu
Copy link
Contributor

Thanks @olavloite. We'll reopen this if it happens for CreateBackup.

@olavloite
Copy link
Contributor Author

@skuruppu @hengfengli It seems that the build error in #2393 was caused by a similar error. The error code is not UNAVAILABLE, but it seems that a temporary network error in some or another caused either the request or the response to get lost, which again causes a timeout while waiting for the backup operation to be returned.

@skuruppu
Copy link
Contributor

skuruppu commented Jun 4, 2020

The error in that case was:

    TestIntegration_StartBackupOperation: integration_test.go:2728: backup metadata, got error context deadline exceeded, want nil
--- FAIL: TestIntegration_StartBackupOperation (976.27s)

The failure is at https://github.com/googleapis/google-cloud-go/blob/master/spanner/integration_test.go#L2728 and that's a GetBackup call that timed out. I tried to look at other failures referenced by the parent of the issue but most of those seem to not fail in that test, at least from what I gathered from the logs.

@olavloite
Copy link
Contributor Author

@skuruppu You're right. Sorry, I should have checked the code to see which line it actually failed before commenting.

Based on the tests and checks above, it seems that the test would only get to this point if the backup has actually successfully been created. The backup creation however probably took a long time and then caused this timeout. The flow in this case seems to be:

  1. The test creates a context with a 15 minute deadline.
  2. The backup operation is started with this context as the deadline. As long as a long-running operation is returned within the 15 minute deadline, no errors will occur.
  3. The test waits for the backup operation to finish. The test does not use any deadline for this.
  4. The test case tries to retrieve the metadata of the backup from the backend. For this, it uses the original context with a 15 minute deadline.

I don't think the GetBackup call in the last step is the operation that takes too long, but rather the waiting for the backup operation to finish in step 2. That operation does not use any deadline, and will therefore succeed, even if the 15 minute deadline is exceeded. The GetBackup operation will however return the DEADLINE_EXCEEDED error if it is called using a context that has already timed out.

Note that the test case has now been disabled, as it has been having multiple failures in the past. Those failures were mostly caused by the problem that not all test instances were cleaned up, which again caused RESOURCE_EXHAUSTED errors when trying to create a test instance.

@skuruppu
Copy link
Contributor

skuruppu commented Jun 4, 2020

Oh I see, 15 minutes is a really short time to wait for the backup. I think the timeouts we set in the configs are something like 1 hour to get the LRO and the LRO can be polled up to 48 hours. I suppose we wouldn't want to wait this long for an IT to run.

I think @hengfengli found that the generated code in Go doesn't respect the timeouts in the config. Should we consider setting the deadline in the context to a higher value manually?

Another question is, is it possible for GetBackup to use a new context?

@olavloite
Copy link
Contributor Author

Another question is, is it possible for GetBackup to use a new context?

Yes, that would also be possible, but I think setting a higher initial value would be better.

@hengfengli
Copy link
Contributor

Should we just double it to 30 mins? or what value do you think is better?

@olavloite
Copy link
Contributor Author

Should we just double it to 30 mins? or what value do you think is better?

Let's try to double it to 30 minutes, but also use the same context for all operations, so it's clearer which operation causes the timeout. That would mean also using the context with the deadline for the wait operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants