Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #4714

Closed
thiagotnunes opened this issue Jul 29, 2020 · 8 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Milestone

Comments

@thiagotnunes
Copy link
Contributor

This bug is related to the Spanner client library.

For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".

In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.

We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.

In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.

@thiagotnunes thiagotnunes added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jul 29, 2020
@coryan coryan added the api: spanner Issues related to the Spanner API. label Jul 29, 2020
@devjgm devjgm added this to the 2020-Q3 milestone Aug 6, 2020
@mr-salty mr-salty self-assigned this Sep 8, 2020
@mr-salty
Copy link
Contributor

@thiagotnunes I see the java change looks for any of these strings, I assume we should do the same?

"HTTP/2 error code: INTERNAL_ERROR"
"Connection closed with unknown cause"
"Received unexpected EOS on DATA frame from server"

@thiagotnunes
Copy link
Contributor Author

@mr-salty There is another error that we have not seen in the Java client library, but we have seen in other libraries (which is curious): "RST_STREAM" (https://github.com/googleapis/python-spanner/pull/122/files#diff-81c1269f69a551cb02a056013d0db2e3R37).

If you'd like to be cover all grounds, I would retry on the 3 you mentioned and the RST_STREAM one.

@mr-salty
Copy link
Contributor

Hm, I think we'll need some larger changes to address this in C++ - currently we use a non-streaming ExecuteSql call with a timeout of 10 minutes.

IIUC Java uses a streaming call (does that imply we do periodically receive resume_token responses from the backend?), and has an overall timeout of 2 hours. I was also looking at #4528 which doesn't have a specific PDML timeout, so I assume we would use the ExecuteStreamingSql timeout of 1 hour? or, is the 2 hours important?

@thiagotnunes I normally work pretty late if you're available for a chat sometime later

@thiagotnunes
Copy link
Contributor Author

@mr-salty sorry I think I missed you. I scheduled a meeting for us to go over it next week.

mr-salty added a commit to mr-salty/google-cloud-cpp that referenced this issue Sep 18, 2020
mr-salty added a commit to mr-salty/google-cloud-cpp that referenced this issue Sep 18, 2020
mr-salty pushed a commit to mr-salty/google-cloud-cpp that referenced this issue Sep 18, 2020
mr-salty added a commit that referenced this issue Sep 21, 2020
…ed (#5087)

* fix: retry gRPC operations if the connection is unexpectedly terminated #5087

part of #4714

* address review comments

* clarify kInternal status handling.
@devjgm
Copy link
Contributor

devjgm commented Sep 29, 2020

I see PRs for this. Is this issue fixed?

@mr-salty
Copy link
Contributor

I still have a PR pending and need to test it with a real long-running query (Thiago added me to the relevant project and sent me instructions)

mr-salty added a commit to mr-salty/google-cloud-cpp that referenced this issue Oct 8, 2020
The streaming call allows us to properly resume the operation via
`PartialResultSet::resume_token` (already handled lower in the stack).

part of googleapis#4714 (it almost fixes it, I just need to tweak some timeouts).
mr-salty added a commit to mr-salty/google-cloud-cpp that referenced this issue Oct 9, 2020
The streaming call allows us to properly resume the operation via
`PartialResultSet::resume_token` (already handled lower in the stack).

part of googleapis#4714 (it almost fixes it, I just need to tweak some timeouts).
mr-salty added a commit that referenced this issue Oct 9, 2020
The streaming call allows us to properly resume the operation via
`PartialResultSet::resume_token` (already handled lower in the stack).

part of #4714 (it almost fixes it, I just need to tweak some timeouts).
@devjgm devjgm modified the milestones: 2020-Q3, 2020-Q4 Oct 13, 2020
@devjgm devjgm modified the milestones: 2020-Q4, 2021-Q1 Jan 8, 2021
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jan 25, 2021
@coryan
Copy link
Member

coryan commented Jan 25, 2021

Is this done? It is now out of SLO.

@mr-salty
Copy link
Contributor

greg and I discussed this last week. I think we can close this bug because what is (possibly) left to do is change the retry timeouts per #4528 , which we weren't able to reach consensus on.

If a user had long running-queries and manually set the timeouts long enough, they should not run into the issue (not properly resuming) that was the initial motivation for this issue. with the default timeouts, their query would time out before they ever saw this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants