New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #4714
Comments
@thiagotnunes I see the java change looks for any of these strings, I assume we should do the same? "HTTP/2 error code: INTERNAL_ERROR" |
@mr-salty There is another error that we have not seen in the Java client library, but we have seen in other libraries (which is curious): "RST_STREAM" (https://github.com/googleapis/python-spanner/pull/122/files#diff-81c1269f69a551cb02a056013d0db2e3R37). If you'd like to be cover all grounds, I would retry on the 3 you mentioned and the RST_STREAM one. |
Hm, I think we'll need some larger changes to address this in C++ - currently we use a non-streaming IIUC Java uses a streaming call (does that imply we do periodically receive @thiagotnunes I normally work pretty late if you're available for a chat sometime later |
@mr-salty sorry I think I missed you. I scheduled a meeting for us to go over it next week. |
I see PRs for this. Is this issue fixed? |
I still have a PR pending and need to test it with a real long-running query (Thiago added me to the relevant project and sent me instructions) |
The streaming call allows us to properly resume the operation via `PartialResultSet::resume_token` (already handled lower in the stack). part of googleapis#4714 (it almost fixes it, I just need to tweak some timeouts).
The streaming call allows us to properly resume the operation via `PartialResultSet::resume_token` (already handled lower in the stack). part of googleapis#4714 (it almost fixes it, I just need to tweak some timeouts).
The streaming call allows us to properly resume the operation via `PartialResultSet::resume_token` (already handled lower in the stack). part of #4714 (it almost fixes it, I just need to tweak some timeouts).
Is this done? It is now out of SLO. |
greg and I discussed this last week. I think we can close this bug because what is (possibly) left to do is change the retry timeouts per #4528 , which we weren't able to reach consensus on. If a user had long running-queries and manually set the timeouts long enough, they should not run into the issue (not properly resuming) that was the initial motivation for this issue. with the default timeouts, their query would time out before they ever saw this issue. |
This bug is related to the Spanner client library.
For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".
In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.
We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.
In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.
The text was updated successfully, but these errors were encountered: