spanner: retry non-idempotent admin RPCs #1964

olavloite · 2020-04-30T12:06:05Z

Non-idempotent admin RPCs such as CreateDatabase can fail when a transient error like UNAVAILABLE occur. This can cause unexpected failures in both user code and flaky test failures.

For the Java client we implemented retry logic for these RPCs by querying the backend for any corresponding long-running operation when such a failure occurred, and based on the result of that query, the RPC will either be retried or the corresponding long-running operation will be returned to the user.

The text was updated successfully, but these errors were encountered:

The CreateDatabase RPC should be retried if it fails with a transient error code, such as UNAVAILABLE or DEADLINE_EXCEEDED. The actual RPC should only be retried if the initial RPC did not already start an operation for creating a new database. In case the initial call succeeded, but the response was lost, the retry logic will return a reference to the existing long-running operation. Updates #1964 Fixes #1922 and #1921 Change-Id: I2871b4ade440bd371ba717ca324f6542a6321a59 Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/55570 Reviewed-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Hengfeng Li <hengfeng@google.com>

hengfengli · 2020-05-25T22:53:30Z

@olavloite Can you confirm that this issue has been fixed?

olavloite · 2020-05-26T06:42:10Z

@hengfengli The change was implemented for the CreateDatabase RPC in this commit. As far as I know, we haven't experienced any flaky build failures that have been caused by CreateDatabase failing with an UNAVAILABLE error since that, so it should now be fixed for CreateDatabase.

For the Java client we also implemented this retry logic for the CreateBackup and RestoreDatabase RPCs as well. I would recommend that we do that for the Go client as well, provided that the flaky fails for CreateDatabase do not suddenly re-appear. Those RPCs are probably affected by the same problem, but we probably don't see the failures because the integration tests execute a lot less of those RPCs than CreateDatabase RPCs.

skuruppu · 2020-05-29T02:35:02Z

I believe we only have an IT for CreateBackup. We didn't write much hand-written code for Backups in Go since we only use the auto-generated code for admin APIs. If you were to modify the retry logic, that will need to be done here.

I'm still a bit unsure about doing this since we wanted to minimize the hand-written code here. It's only worth considering if the test fails regularly.

olavloite · 2020-05-29T10:17:32Z

We didn't write much hand-written code for Backups in Go since we only use the auto-generated code for admin APIs.

That's true. It also means that it would not be possible to implement it for RestoreDatabase, as we don't have any handwritten code for it (or we would need to introduce it). I haven't seen any failures caused by UNAVAILABLE errors for the CreateBackup RPC (yet), so I guess we could close this and re-evaluate if they do start to appear.

skuruppu · 2020-05-29T10:24:01Z

Thanks @olavloite. We'll reopen this if it happens for CreateBackup.

olavloite · 2020-06-04T06:59:48Z

@skuruppu @hengfengli It seems that the build error in #2393 was caused by a similar error. The error code is not UNAVAILABLE, but it seems that a temporary network error in some or another caused either the request or the response to get lost, which again causes a timeout while waiting for the backup operation to be returned.

skuruppu · 2020-06-04T09:47:28Z

The error in that case was:

    TestIntegration_StartBackupOperation: integration_test.go:2728: backup metadata, got error context deadline exceeded, want nil
--- FAIL: TestIntegration_StartBackupOperation (976.27s)

The failure is at https://github.com/googleapis/google-cloud-go/blob/master/spanner/integration_test.go#L2728 and that's a GetBackup call that timed out. I tried to look at other failures referenced by the parent of the issue but most of those seem to not fail in that test, at least from what I gathered from the logs.

olavloite · 2020-06-04T10:12:56Z

@skuruppu You're right. Sorry, I should have checked the code to see which line it actually failed before commenting.

Based on the tests and checks above, it seems that the test would only get to this point if the backup has actually successfully been created. The backup creation however probably took a long time and then caused this timeout. The flow in this case seems to be:

The test creates a context with a 15 minute deadline.
The backup operation is started with this context as the deadline. As long as a long-running operation is returned within the 15 minute deadline, no errors will occur.
The test waits for the backup operation to finish. The test does not use any deadline for this.
The test case tries to retrieve the metadata of the backup from the backend. For this, it uses the original context with a 15 minute deadline.

I don't think the GetBackup call in the last step is the operation that takes too long, but rather the waiting for the backup operation to finish in step 2. That operation does not use any deadline, and will therefore succeed, even if the 15 minute deadline is exceeded. The GetBackup operation will however return the DEADLINE_EXCEEDED error if it is called using a context that has already timed out.

Note that the test case has now been disabled, as it has been having multiple failures in the past. Those failures were mostly caused by the problem that not all test instances were cleaned up, which again caused RESOURCE_EXHAUSTED errors when trying to create a test instance.

skuruppu · 2020-06-04T10:26:09Z

Oh I see, 15 minutes is a really short time to wait for the backup. I think the timeouts we set in the configs are something like 1 hour to get the LRO and the LRO can be polled up to 48 hours. I suppose we wouldn't want to wait this long for an IT to run.

I think @hengfengli found that the generated code in Go doesn't respect the timeouts in the config. Should we consider setting the deadline in the context to a higher value manually?

Another question is, is it possible for GetBackup to use a new context?

olavloite · 2020-06-04T11:42:00Z

Another question is, is it possible for GetBackup to use a new context?

Yes, that would also be possible, but I think setting a higher initial value would be better.

hengfengli · 2020-06-04T23:19:54Z

Should we just double it to 30 mins? or what value do you think is better?

olavloite · 2020-06-09T09:42:04Z

Should we just double it to 30 mins? or what value do you think is better?

Let's try to double it to 30 minutes, but also use the same context for all operations, so it's clearer which operation causes the timeout. That would mean also using the context with the deadline for the wait operation.

olavloite added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: spanner Issues related to the Spanner API. labels Apr 30, 2020

olavloite self-assigned this Apr 30, 2020

hengfengli added the priority: p2 Moderately-important priority. Fix may not be included in next release. label May 25, 2020

skuruppu closed this as completed May 29, 2020

hengfengli mentioned this issue Jun 10, 2020

spanner: fix TestIntegration_StartBackupOperation test #2418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanner: retry non-idempotent admin RPCs #1964

spanner: retry non-idempotent admin RPCs #1964

olavloite commented Apr 30, 2020 •

edited

hengfengli commented May 25, 2020

olavloite commented May 26, 2020

skuruppu commented May 29, 2020

olavloite commented May 29, 2020

skuruppu commented May 29, 2020

olavloite commented Jun 4, 2020

skuruppu commented Jun 4, 2020

olavloite commented Jun 4, 2020

skuruppu commented Jun 4, 2020

olavloite commented Jun 4, 2020

hengfengli commented Jun 4, 2020

olavloite commented Jun 9, 2020

spanner: retry non-idempotent admin RPCs #1964

spanner: retry non-idempotent admin RPCs #1964

Comments

olavloite commented Apr 30, 2020 • edited

hengfengli commented May 25, 2020

olavloite commented May 26, 2020

skuruppu commented May 29, 2020

olavloite commented May 29, 2020

skuruppu commented May 29, 2020

olavloite commented Jun 4, 2020

skuruppu commented Jun 4, 2020

olavloite commented Jun 4, 2020

skuruppu commented Jun 4, 2020

olavloite commented Jun 4, 2020

hengfengli commented Jun 4, 2020

olavloite commented Jun 9, 2020

olavloite commented Apr 30, 2020 •

edited