Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: respect PDML timeout when using streaming RPC #338

Merged
merged 3 commits into from Jul 15, 2020

Conversation

olavloite
Copy link
Collaborator

@thiagotnunes Would you mind trying the same test with this PR? I don't have a dataset that is big enough that it will run for more than one hour.

Fixes #336

@thiagotnunes
Copy link
Contributor

@olavloite will do so on Monday and will let you know.

Digging in a little on the fix, it seems we just needed to fix the keepalive for the rpc calls, so am I right to assume that the thinking is that the streamWaitTimeout is the one being respected for this particular call?

Thanks for looking into this!

@thiagotnunes
Copy link
Contributor

I got another exception when running the code. I think there is a bug on using the remainingTimeout (line 108) on the PartitionedDMLTransaction. If the timeout is <= 0, we do not cancel the retry. A simple fix would be to break in this case. Below is the stack trace:

Exception in thread "main" com.google.cloud.spanner.SpannerException: UNKNOWN: Invalid timeout: < 0 s
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:211)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:57)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:151)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:94)
	at com.google.cloud.spanner.PartitionedDMLTransaction.executeStreamingPartitionedUpdate(PartitionedDMLTransaction.java:156)
	at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate(SessionImpl.java:118)
	at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate(SessionPool.java:1349)
	at com.google.cloud.spanner.SessionPool$PooledSessionFuture.executePartitionedUpdate(SessionPool.java:1236)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:227)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:224)
	at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry(DatabaseClientImpl.java:241)
	at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate(DatabaseClientImpl.java:222)
	at com.google.cloud.spanner.PdmlIssue.main(PdmlIssue.java:14)
Caused by: java.lang.IllegalArgumentException: Invalid timeout: < 0 s
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at com.google.api.gax.grpc.GrpcCallContext.withStreamWaitTimeout(GrpcCallContext.java:177)
	at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executeStreamingPartitionedDml(GapicSpannerRpc.java:1108)
	at com.google.cloud.spanner.PartitionedDMLTransaction.executeStreamingPartitionedUpdate(PartitionedDMLTransaction.java:113)
	... 8 more

This is my executor:

final SpannerOptions options = SpannerOptions.newBuilder().build();
final Spanner spanner = options.getService();
final DatabaseId id = DatabaseId.of(PROJECT, INSTANCE, DATABASE);
final DatabaseClient client = spanner.getDatabaseClient(id);

client.executePartitionedUpdate(Statement.of(
    "UPDATE Table SET bool = true WHERE true"
));

@thiagotnunes
Copy link
Contributor

In any case, the failure occurs after the specified timeout (2 hours by default).

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jul 13, 2020
@olavloite
Copy link
Collaborator Author

@thiagotnunes Thanks for the test results. I've updated the retry-loop to take a potential zero or negative deadline into account, and I've added some additional unit tests for this class. PTAL.

Copy link
Contributor

@thiagotnunes thiagotnunes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest tests that I've done have terminated with a com.google.cloud.spanner.SpannerException: DEADLINE_EXCEEDED, but the weird thing is that the process did not take 2 hours to fail, but around 68 mins, so it does seem that the 2 hours window is still not being respected.
I might have misspoke when I said that the previous one failed in 2 hours, I thought it did, but I had no timer in runner code. I added it now. These are the logs from the execution:

Started running at Tue Jul 14 12:41:21 AEST 2020
Exception in thread "main" com.google.cloud.spanner.SpannerException: DEADLINE_EXCEEDED
Finished running at Tue Jul 14 13:49:33 AEST 2020 (took 4092380ms)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:211)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:57)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:52)
	at com.google.cloud.spanner.SpannerExceptionFactory.fromApiException(SpannerExceptionFactory.java:229)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:144)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:94)
	at com.google.cloud.spanner.PartitionedDMLTransaction.executeStreamingPartitionedUpdate(PartitionedDMLTransaction.java:162)
	at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate(SessionImpl.java:118)
	at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate(SessionPool.java:1349)
	at com.google.cloud.spanner.SessionPool$PooledSessionFuture.executePartitionedUpdate(SessionPool.java:1236)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:227)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:224)
	at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry(DatabaseClientImpl.java:241)
	at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate(DatabaseClientImpl.java:222)
	at com.google.cloud.spanner.PmlIssue.main(PmlIssue.java:19)

Process finished with exit code 1

@@ -157,6 +163,10 @@ long executeStreamingPartitionedUpdate(final Statement statement, Duration timeo
}
}

Stopwatch createStopwatchStarted() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would it be hard to inject the Stopwatch instead? Not a big deal, just wondering if we could use composition here. Disregard if it would be too hard to inject.


@SuppressWarnings("unchecked")
@RunWith(JUnit4.class)
public class PartitionedDmlTransactionTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests!

@thiagotnunes
Copy link
Contributor

thiagotnunes commented Jul 14, 2020

I know what the problem is: multiple failures are occurring so whenever we are subtracting the elapsed time, we are subtracting the accumulated time of the stopWatch every time. In order to fix this, we should reset the stopWatch in the case that we are about to retry. We could add stopWatch.reset() when catching the UnavailableException and inside the if statement (after line 135 of the PartitionedDMLTransaction).

An example of the problem I am seeing:

  1. We start the PDML statement.
stopWatch = "PT0S"
remainingTimeout = "PT2H"; 
  1. After 10 mins there is a retry-able failure. We subtract the stopWatch from the remainingTimeout.
stopWatch = "PT10M";
remainingTimeout = remainingTimeout - stopWatch = "PT1H50M";
  1. After 20 mins from the start (1) there is another failure. We again subtract the stopWatch from the remainingTimeout, but instead of subtracting 10 mins (time elapsed from the first retry to the second retry), we subtract 20 mins (time elapsed from the start to the second retry).
stopWatch = "PT20M"
remainingTimeout = remainingTimeout - stopWatch = "PT1H30M"; // This should had been PT1H40M

Btw, @olavloite let me know if you this feedback loop is too big (me testing a day after your change). I could implement these minor fixes if you'd like so, otherwise I am happy on testing / reporting back the results to you.

@olavloite
Copy link
Collaborator Author

@thiagotnunes Thanks for the elaborate testing and report. Hopefully the timeout problem should be fixed now.

And yes, please do feel free to add any additional fixes that might be needed to get this 100%.

Copy link
Contributor

@thiagotnunes thiagotnunes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olavloite thanks for the fix. It worked as expected (timing out after 2 hours).

@olavloite olavloite merged commit d67f108 into master Jul 15, 2020
@olavloite olavloite deleted the pdml-streaming-retry-settings branch July 15, 2020 07:20
gcf-merge-on-green bot pushed a commit that referenced this pull request Jul 16, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.59.0](https://www.github.com/googleapis/java-spanner/compare/v1.58.0...v1.59.0) (2020-07-16)


### Features

* add support for NUMERIC data type ([#193](https://www.github.com/googleapis/java-spanner/issues/193)) ([b38a91d](https://www.github.com/googleapis/java-spanner/commit/b38a91d8daac264b9dea327d6b31430d9599bd78))
* spanner NUMERIC type ([#349](https://www.github.com/googleapis/java-spanner/issues/349)) ([78c3192](https://www.github.com/googleapis/java-spanner/commit/78c3192266c474fc43277a8bf3f15caa968a0100))


### Bug Fixes

* check if emulator is running if env var is set ([#340](https://www.github.com/googleapis/java-spanner/issues/340)) ([597f501](https://www.github.com/googleapis/java-spanner/commit/597f501803e6d58717a6e3770e6fd3f34454e9a5))
* fix potential unnecessary transaction retry ([#337](https://www.github.com/googleapis/java-spanner/issues/337)) ([1a4f4fd](https://www.github.com/googleapis/java-spanner/commit/1a4f4fd675a1580c87ad1d53c650a20bd2ff4811)), closes [#327](https://www.github.com/googleapis/java-spanner/issues/327)
* respect PDML timeout when using streaming RPC ([#338](https://www.github.com/googleapis/java-spanner/issues/338)) ([d67f108](https://www.github.com/googleapis/java-spanner/commit/d67f108e86925c1296e695db8e78fa82e11fa4fa))
* runs sample tests in java 8 and java 11 ([#345](https://www.github.com/googleapis/java-spanner/issues/345)) ([b547e31](https://www.github.com/googleapis/java-spanner/commit/b547e31d095be3cf1646e0e9c07bfc467ecc3c22))
* set gRPC keep-alive to 120 seconds ([#339](https://www.github.com/googleapis/java-spanner/issues/339)) ([26be103](https://www.github.com/googleapis/java-spanner/commit/26be103da1117c4940550fad1672c66e6edfbdb3))


### Dependencies

* update dependency com.google.cloud:google-cloud-shared-dependencies to v0.8.3 ([#334](https://www.github.com/googleapis/java-spanner/issues/334)) ([45acd89](https://www.github.com/googleapis/java-spanner/commit/45acd8960c961d48e91a7b1546efa64d9e9ae576))
* update shared config to 0.9.2 ([#328](https://www.github.com/googleapis/java-spanner/issues/328)) ([75df62c](https://www.github.com/googleapis/java-spanner/commit/75df62c0176137fda1d0a9076b83be06f11228ce))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
ansh0l pushed a commit to ansh0l/java-spanner that referenced this pull request Nov 10, 2022
This is an auto-generated regeneration of the .pb.go files by
cloud.google.com/go/internal/gapicgen. Once this PR is submitted, genmgr will
update the corresponding CL at gocloud to depend on the newer version of
go-genproto, and assign reviewers. Whilst this or any regen PR is open in
go-genproto, gapicgen will not create any more regeneration PRs or CLs. If all
regen PRs are closed, gapicgen will create a new set of regeneration PRs and
CLs once per night.

If you have been assigned to review this CL, please:

- Ensure that CI is passing. If it's failing, it requires your manual attention.
- Approve and submit this PR if you believe it's ready to ship. That will prompt
  genmgr to assign reviewers to the gocloud CL.

Corresponding gocloud CL: https://code-review.googlesource.com/c/gocloud/+/54272
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDML Timeout not Respected
3 participants