New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: retry non-idempotent long-running RPCs #141
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
databaseAdminStubSettings | ||
.createBackupOperationSettings() | ||
.getInitialCallSettings() | ||
.getRetrySettings(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't specify retry settings for non-idempotent RPCs. Do you know what this will default to then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are defaults defined in the gapic configuration file as well, but those are already overridden by the client library. The defaults are set in the constructor of SpannerOptions.Builder
:
java-spanner/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java
Lines 242 to 295 in 9e5a1cd
private Builder() { | |
// Manually set retry and polling settings that work. | |
OperationTimedPollAlgorithm longRunningPollingAlgorithm = | |
OperationTimedPollAlgorithm.create( | |
RetrySettings.newBuilder() | |
.setInitialRpcTimeout(Duration.ofSeconds(60L)) | |
.setMaxRpcTimeout(Duration.ofSeconds(600L)) | |
.setInitialRetryDelay(Duration.ofSeconds(20L)) | |
.setMaxRetryDelay(Duration.ofSeconds(45L)) | |
.setRetryDelayMultiplier(1.5) | |
.setRpcTimeoutMultiplier(1.5) | |
.setTotalTimeout(Duration.ofHours(48L)) | |
.build()); | |
RetrySettings longRunningRetrySettings = | |
RetrySettings.newBuilder() | |
.setInitialRpcTimeout(Duration.ofSeconds(60L)) | |
.setMaxRpcTimeout(Duration.ofSeconds(600L)) | |
.setInitialRetryDelay(Duration.ofSeconds(20L)) | |
.setMaxRetryDelay(Duration.ofSeconds(45L)) | |
.setRetryDelayMultiplier(1.5) | |
.setRpcTimeoutMultiplier(1.5) | |
.setTotalTimeout(Duration.ofHours(48L)) | |
.build(); | |
databaseAdminStubSettingsBuilder | |
.createDatabaseOperationSettings() | |
.setPollingAlgorithm(longRunningPollingAlgorithm) | |
.setInitialCallSettings( | |
UnaryCallSettings | |
.<CreateDatabaseRequest, OperationSnapshot>newUnaryCallSettingsBuilder() | |
.setRetrySettings(longRunningRetrySettings) | |
.build()); | |
databaseAdminStubSettingsBuilder | |
.createBackupOperationSettings() | |
.setPollingAlgorithm(longRunningPollingAlgorithm) | |
.setInitialCallSettings( | |
UnaryCallSettings | |
.<CreateBackupRequest, OperationSnapshot>newUnaryCallSettingsBuilder() | |
.setRetrySettings(longRunningRetrySettings) | |
.build()); | |
databaseAdminStubSettingsBuilder | |
.restoreDatabaseOperationSettings() | |
.setPollingAlgorithm(longRunningPollingAlgorithm) | |
.setInitialCallSettings( | |
UnaryCallSettings | |
.<RestoreDatabaseRequest, OperationSnapshot>newUnaryCallSettingsBuilder() | |
.setRetrySettings(longRunningRetrySettings) | |
.build()); | |
databaseAdminStubSettingsBuilder | |
.deleteBackupSettings() | |
.setRetrySettings(longRunningRetrySettings); | |
databaseAdminStubSettingsBuilder | |
.updateBackupSettings() | |
.setRetrySettings(longRunningRetrySettings); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this mean that the Java client library has its own retrying settings which are different from the gapic config file?
@@ -291,7 +292,8 @@ private Builder() { | |||
.setRetrySettings(longRunningRetrySettings); | |||
databaseAdminStubSettingsBuilder | |||
.updateBackupSettings() | |||
.setRetrySettings(longRunningRetrySettings); | |||
.setRetrySettings(longRunningRetrySettings) | |||
.setRetryableCodes(StatusCode.Code.DEADLINE_EXCEEDED, StatusCode.Code.UNAVAILABLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to manually specify the retryable codes here? In the gapic config file (https://github.com/googleapis/googleapis/blob/d741cd976975c745d0199987aff0e908b8352992/google/spanner/admin/database/v1/spanner_admin_database_grpc_service_config.json#L58-L67), updateBackup
will retry DEADLINE_EXCEEDED and UNAVAILABLE. Shouldn't this code be auto-generated?
I guess Java gapic code generator is using a different config file: https://github.com/googleapis/googleapis/blob/master/google/spanner/admin/database/v1/spanner_admin_database_gapic.yaml#L83-L86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why do we manually set retrying settings instead of using the default values from the gapic config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a lot of problems / transient errors using the default values when working on the backups feature. So I set the values manually to try to find settings that did work, so that we can set them back to the default config.
So the Java client should definitely not continue to use custom retry settings. Once we have verified that these settings work well, we should set these as the default in the config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, the Java client (and I think also the other clients, but I'm not 100% sure) use the retry settings in https://github.com/googleapis/googleapis/blob/master/google/spanner/admin/database/v1/spanner_admin_database_gapic.yaml. So that means for example that the default config for UpdateBackup
is non-idempotent.
RPCs returning a long-running operation, such as CreateDatabase, CreateBackup and RestoreDatabase, are non-idempotent and cannot be retried automatically by gax. This means that these RPCs sometimes fail with transient errors, such as UNAVAILABLE or DEADLINE_EXCEEDED. This change introduces automatic retries of these RPCs using the following logic: 1. Execute the RPC and wait for the operation to be returned. 2. If a transient error occurs while waiting for the operation, the client library queries the backend for the corresponding operation. If the operation is found, the resumes the tracking of the existing operation and returns that to the user. 3. If no corresponding operation is found in step 2, the client library retries the RPC from step 1. Fixes GoogleCloudPlatform/java-docs-samples#2571
771ae90
to
605b672
Compare
🤖 I have created a release \*beep\* \*boop\* --- ## [1.53.0](https://www.github.com/googleapis/java-spanner/compare/v1.52.0...v1.53.0) (2020-04-22) ### Features * optimize maintainer to let sessions be GC'ed instead of deleted ([#135](https://www.github.com/googleapis/java-spanner/issues/135)) ([d65747c](https://www.github.com/googleapis/java-spanner/commit/d65747cbc704508f6f1bcef6eea53aa411d42ee2)) ### Bug Fixes * assign unique id's per test case ([#129](https://www.github.com/googleapis/java-spanner/issues/129)) ([a553b6d](https://www.github.com/googleapis/java-spanner/commit/a553b6d48c4f5ee2d0583e5b825d73a85f06216e)) * check for not null input for Id classes ([#159](https://www.github.com/googleapis/java-spanner/issues/159)) ([ecf5826](https://www.github.com/googleapis/java-spanner/commit/ecf582670818f32e85f534ec400d0b8d31cf9ca6)), closes [#145](https://www.github.com/googleapis/java-spanner/issues/145) * clean up test instance if creation failed ([#162](https://www.github.com/googleapis/java-spanner/issues/162)) ([ff571e1](https://www.github.com/googleapis/java-spanner/commit/ff571e16a45fbce692d9bb172749ff15fafe7a9c)) * fix flaky test and remove warnings ([#153](https://www.github.com/googleapis/java-spanner/issues/153)) ([d534e35](https://www.github.com/googleapis/java-spanner/commit/d534e350346b0c9ab8057ede36bc3aac473c0b06)), closes [#146](https://www.github.com/googleapis/java-spanner/issues/146) * increase test timeout and remove warnings ([#160](https://www.github.com/googleapis/java-spanner/issues/160)) ([63a6bd8](https://www.github.com/googleapis/java-spanner/commit/63a6bd8be08a56d002f58bc2cdb2856ad0dc5fa3)), closes [#158](https://www.github.com/googleapis/java-spanner/issues/158) * retry non-idempotent long-running RPCs ([#141](https://www.github.com/googleapis/java-spanner/issues/141)) ([4669c02](https://www.github.com/googleapis/java-spanner/commit/4669c02a24e0f7b1d53c9edf5ab7b146b4116960)) * retry restore if blocked by pending restore ([#119](https://www.github.com/googleapis/java-spanner/issues/119)) ([220653d](https://www.github.com/googleapis/java-spanner/commit/220653d8e25c518d0df447bf777a7fcbf04a01ca)), closes [#118](https://www.github.com/googleapis/java-spanner/issues/118) * StatementParser did not accept multiple query hints ([#170](https://www.github.com/googleapis/java-spanner/issues/170)) ([ef41a6e](https://www.github.com/googleapis/java-spanner/commit/ef41a6e503f218c00c16914aa9c1433d9b26db13)), closes [#163](https://www.github.com/googleapis/java-spanner/issues/163) * wait for initialization to finish before test ([#161](https://www.github.com/googleapis/java-spanner/issues/161)) ([fe434ff](https://www.github.com/googleapis/java-spanner/commit/fe434ff7068b4b618e70379c224e1c5ab88f6ba1)), closes [#146](https://www.github.com/googleapis/java-spanner/issues/146) ### Performance Improvements * increase sessions in the pool in batches ([#134](https://www.github.com/googleapis/java-spanner/issues/134)) ([9e5a1cd](https://www.github.com/googleapis/java-spanner/commit/9e5a1cdaacf71147b67681861f063c3276705f44)) * prepare sessions with r/w tx in-process ([#152](https://www.github.com/googleapis/java-spanner/issues/152)) ([2db27ce](https://www.github.com/googleapis/java-spanner/commit/2db27ce048efafaa3c28b097de33518747011465)), closes [#151](https://www.github.com/googleapis/java-spanner/issues/151) ### Dependencies * update core dependencies ([#109](https://www.github.com/googleapis/java-spanner/issues/109)) ([5753f1f](https://www.github.com/googleapis/java-spanner/commit/5753f1f4fed83df87262404f7a7ba7eedcd366cb)) * update core dependencies ([#132](https://www.github.com/googleapis/java-spanner/issues/132)) ([77c1558](https://www.github.com/googleapis/java-spanner/commit/77c1558652ee00e529674ac3a2dcf3210ef049fa)) * update dependency com.google.api:api-common to v1.9.0 ([#127](https://www.github.com/googleapis/java-spanner/issues/127)) ([b2c744f](https://www.github.com/googleapis/java-spanner/commit/b2c744f01a4d5a8981df5ff900f3536c83265a61)) * update dependency com.google.guava:guava-bom to v29 ([#147](https://www.github.com/googleapis/java-spanner/issues/147)) ([3fe3ae0](https://www.github.com/googleapis/java-spanner/commit/3fe3ae02376af552564c93c766f562d6454b7ac1)) * update dependency io.grpc:grpc-bom to v1.29.0 ([#164](https://www.github.com/googleapis/java-spanner/issues/164)) ([2d2ce5c](https://www.github.com/googleapis/java-spanner/commit/2d2ce5ce4dc8f410ec671e542e144d47f39ab40b)) * update dependency org.threeten:threetenbp to v1.4.3 ([#120](https://www.github.com/googleapis/java-spanner/issues/120)) ([49d1abc](https://www.github.com/googleapis/java-spanner/commit/49d1abcb6c9c48762dcf0fe1466ab107bf67146b)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please).
* updated CHANGELOG.md [ci skip] * updated README.md [ci skip] * updated versions.txt [ci skip] * updated samples/pom.xml [ci skip] * updated samples/install-without-bom/pom.xml [ci skip] * updated samples/snapshot/pom.xml [ci skip] * updated samples/snippets/pom.xml [ci skip] * updated pom.xml Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
RPCs returning a long-running operation, such as CreateDatabase, CreateBackup and RestoreDatabase, are non-idempotent and cannot be retried automatically by gax. This means that these RPCs sometimes fail with transient errors, such as
UNAVAILABLE
orDEADLINE_EXCEEDED
. This change introduces automatic retries of these RPCs using the following logic:Fixes GoogleCloudPlatform/java-docs-samples#2571