spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #7104

thiagotnunes · 2020-07-29T02:24:06Z

This bug is related to the Spanner client library.

For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".

In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.

We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.

In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.

thiagotnunes · 2020-08-18T06:00:46Z

@jiren hey 👋 , could you help us debug this problem?

Here you can see the example I am trying to run and the error I am seeing: https://gist.github.com/thiagotnunes/1e18081d589ad690ebc962b2ed780244

I have tried tweaking the pool keepalive and that helps having the pdml running for a longer period, but it ultimately fails with the same error. Let me know if you have any questions or if I can help you in any other way.

jiren · 2020-08-18T10:57:06Z

@thiagotnunes, looking into it.

jiren · 2020-08-19T06:09:27Z

@thiagotnunes I have tried 20 times using provided gist script PDML statement on 5 million records but not getting any error.
Can you share how many records records present the test database?
Also share How many nodes, replicas are you using?

thiagotnunes · 2020-08-19T07:27:48Z

@jiren thanks for trying this out.

The error occurs if the PDML runs for over 30 mins.
The test database has 500M (child) records.
In the instance provided there is a single node (3 replicas by default I think?).
The test I was doing, was updating the full database (500M records).

jiren · 2020-08-19T10:32:04Z

@thiagotnunes Thanks for info. One thing more ChildTable is interleaved table?

jiren · 2020-08-19T10:50:51Z

@thiagotnunes Please ignore previous message. I have access to appdev-soda-spanner-staging project, I will get all of required information and will try out on your test instance.

thiagotnunes · 2020-08-19T12:16:58Z

@jiren no worries, happy to help with anything!

thiagotnunes · 2020-08-20T02:25:47Z

@jiren while investigating this problem in the PHP client library, it seemed like we could not reproduce the problem. Let me know if this occurs here as well.

jiren · 2020-08-24T08:36:33Z

@thiagotnunes I am not able to get above exact error but after few minutes getting bellow error.

#<GRPC::Internal: 13:Received RST_STREAM with error code 2. debug_error_string:{"created":"@1597912702.090361000","description":"Error received from peer ipv6:[2404:6800:4009:805::200a]:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}>

After this error, client lib is trying to resume UPDATE statement execution using the resume token.

I ran tests for few times but, no test was succeeded. Every time there is a timeout error.

Issue location:

google-cloud-ruby/google-cloud-spanner/lib/google/cloud/spanner/results.rb

Lines 143 to 148 in 0db8a98

    
               # Flush the buffered responses now that they are all handled 
        
               buffered_responses = [] 
        
             end 
        
           rescue GRPC::Cancelled, GRPC::DeadlineExceeded, GRPC::Internal, 
        
                  GRPC::ResourceExhausted, GRPC::Unauthenticated, 
        
                  GRPC::Unavailable, GRPC::Core::CallError => err

thiagotnunes · 2020-08-25T04:23:38Z

Thanks for the feedback @jiren, I am testing a small fix on my side based on your findings to see if it solves the problem at hand. If it does I will add you to the PR.

thiagotnunes · 2020-08-25T07:54:10Z

@jiren if could provide some feedback in the following PR, I would appreciate it: #7592

thiagotnunes added the api: spanner Issues related to the Spanner API. label Jul 29, 2020

yoshi-automation added the triage me I really want to be triaged. label Jul 29, 2020

skuruppu added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Jul 30, 2020

skuruppu assigned jiren and unassigned jiren Jul 30, 2020

thiagotnunes self-assigned this Aug 14, 2020

thiagotnunes assigned jiren and unassigned thiagotnunes Aug 18, 2020

thiagotnunes mentioned this issue Aug 25, 2020

fix(spanner): retry or resume eos and rst_stream errors #7592

Merged

thiagotnunes assigned thiagotnunes and unassigned jiren Aug 28, 2020

thiagotnunes closed this as completed in #7592 Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #7104

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #7104

thiagotnunes commented Jul 29, 2020

thiagotnunes commented Aug 18, 2020 •

edited

jiren commented Aug 18, 2020

jiren commented Aug 19, 2020

thiagotnunes commented Aug 19, 2020

jiren commented Aug 19, 2020

jiren commented Aug 19, 2020

thiagotnunes commented Aug 19, 2020

thiagotnunes commented Aug 20, 2020

jiren commented Aug 24, 2020

thiagotnunes commented Aug 25, 2020

thiagotnunes commented Aug 25, 2020

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #7104

spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #7104

Comments

thiagotnunes commented Jul 29, 2020

thiagotnunes commented Aug 18, 2020 • edited

jiren commented Aug 18, 2020

jiren commented Aug 19, 2020

thiagotnunes commented Aug 19, 2020

jiren commented Aug 19, 2020

jiren commented Aug 19, 2020

thiagotnunes commented Aug 19, 2020

thiagotnunes commented Aug 20, 2020

jiren commented Aug 24, 2020

thiagotnunes commented Aug 25, 2020

thiagotnunes commented Aug 25, 2020

thiagotnunes commented Aug 18, 2020 •

edited