Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #7104

Closed
thiagotnunes opened this issue Jul 29, 2020 · 11 comments · Fixed by #7592
Closed
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@thiagotnunes
Copy link
Contributor

This bug is related to the Spanner client library.

For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".

In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.

We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.

In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.

@thiagotnunes thiagotnunes added the api: spanner Issues related to the Spanner API. label Jul 29, 2020
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jul 29, 2020
@skuruppu skuruppu added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Jul 30, 2020
@skuruppu skuruppu assigned jiren and unassigned jiren Jul 30, 2020
@thiagotnunes thiagotnunes self-assigned this Aug 14, 2020
@thiagotnunes
Copy link
Contributor Author

thiagotnunes commented Aug 18, 2020

@jiren hey 👋 , could you help us debug this problem?

Here you can see the example I am trying to run and the error I am seeing: https://gist.github.com/thiagotnunes/1e18081d589ad690ebc962b2ed780244

I have tried tweaking the pool keepalive and that helps having the pdml running for a longer period, but it ultimately fails with the same error. Let me know if you have any questions or if I can help you in any other way.

@thiagotnunes thiagotnunes assigned jiren and unassigned thiagotnunes Aug 18, 2020
@jiren
Copy link
Member

jiren commented Aug 18, 2020

@thiagotnunes, looking into it.

@jiren
Copy link
Member

jiren commented Aug 19, 2020

@thiagotnunes I have tried 20 times using provided gist script PDML statement on 5 million records but not getting any error.
Can you share how many records records present the test database?
Also share How many nodes, replicas are you using?

@thiagotnunes
Copy link
Contributor Author

@jiren thanks for trying this out.

  • The error occurs if the PDML runs for over 30 mins.
  • The test database has 500M (child) records.
  • In the instance provided there is a single node (3 replicas by default I think?).
  • The test I was doing, was updating the full database (500M records).

@jiren
Copy link
Member

jiren commented Aug 19, 2020

@thiagotnunes Thanks for info. One thing more ChildTable is interleaved table?

@jiren
Copy link
Member

jiren commented Aug 19, 2020

@thiagotnunes Please ignore previous message. I have access to appdev-soda-spanner-staging project, I will get all of required information and will try out on your test instance.

@thiagotnunes
Copy link
Contributor Author

@jiren no worries, happy to help with anything!

@thiagotnunes
Copy link
Contributor Author

@jiren while investigating this problem in the PHP client library, it seemed like we could not reproduce the problem. Let me know if this occurs here as well.

@jiren
Copy link
Member

jiren commented Aug 24, 2020

@thiagotnunes I am not able to get above exact error but after few minutes getting bellow error.

#<GRPC::Internal: 13:Received RST_STREAM with error code 2. debug_error_string:{"created":"@1597912702.090361000","description":"Error received from peer ipv6:[2404:6800:4009:805::200a]:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}>

After this error, client lib is trying to resume UPDATE statement execution using the resume token.

I ran tests for few times but, no test was succeeded. Every time there is a timeout error.

Issue location:

# Flush the buffered responses now that they are all handled
buffered_responses = []
end
rescue GRPC::Cancelled, GRPC::DeadlineExceeded, GRPC::Internal,
GRPC::ResourceExhausted, GRPC::Unauthenticated,
GRPC::Unavailable, GRPC::Core::CallError => err

@thiagotnunes
Copy link
Contributor Author

Thanks for the feedback @jiren, I am testing a small fix on my side based on your findings to see if it solves the problem at hand. If it does I will add you to the PR.

@thiagotnunes
Copy link
Contributor Author

@jiren if could provide some feedback in the following PR, I would appreciate it: #7592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants