New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanner-client: Retry PDML on "Received unexpected EOS on DATA frame from server" #3258
Comments
Not sure, but looks like this case is already handled and transaction is retried. Is there an easy way to check it? Maybe some server emulator that can be instructed to close connection in a few seconds? |
I've tried PDML updates on a table with one milloin rows using custom port forwarder that breaks connections before update operation is complete. In this case gRPC library throws |
@ava12 thanks for testing this. I can help you testing this next week with a larger database if you share with me some of the code. This will make sure that the problem does not occur in long running PDMLs. |
Sample code is like this: <?php
require_once __DIR__ . '/vendor/autoload.php';
use Google\Cloud\Spanner\SpannerClient;
const INSTANCE_NAME = 'my-instance';
const DATABASE_NAME = 'my-database';
const QUERY = "UPDATE my_table SET my_col = 'updated' WHERE true";
$config = [
// 'keyFilePath' => '/path/to/keyfile.json', // default: GOOGLE_APPLICATION_CREDENTIALS env var
// 'projectId' => 'my-project', // default: obtain from key file
];
$client = new SpannerClient($config);
$database = $client->instance(INSTANCE_NAME)->database(DATABASE_NAME);
echo 'updating' . PHP_EOL;
$rowCount = $db->executePartitionedUpdate(QUERY);
echo "done, $rowCount rows updated" . PHP_EOL; |
@ava12 thanks for the sample code! I am not being able to successfully run the operation, I keep getting the following error after ~10 minutes of the PDML running:
Could you try and see if you get the same error for a long running PDML? |
I cannot get this error because I cannot initiate a request that is long enough. composer show -i | grep grpc in the sample code directory and show its output? |
The output of the command is the following:
We will try to investigate the issue. It seems this is happening in the python library as well. |
@ava12 sorry, we will probably not have the resources to investigate this further, could you continue with the investigation? The PDML that takes a long time to execute is the following: <?php
require './vendor/autoload.php';
use Google\Cloud\Spanner\SpannerClient;
$spanner = new SpannerClient(array(
'projectId' => '' // staging project
));
$instance = $spanner->instance('thagotnunes-test-instance');
$database = $instance->database('thiagotnunes-pdml-testing');
printf("Running PDML started at " . date("Y-m-d H:i:s") . "\n");
$rowCount = $database->executePartitionedUpdate(
"UPDATE ChildTable c SET c.name=CONCAT('315',c.name) WHERE c.ChildId < 1000000 AND NOT c.name LIKE '315%'"
);
printf("Running PDML finished at " . date("Y-m-d H:i:s") . "\n");
?> |
@thiagotnunes, I don't know the name of staging project and I don't think I have credentials to access it. Your sample query is quite simple and updates only one million rows. I've tried similar query on dev project and my table with one million rows, and it took less than 30 seconds to complete. How come your query takes more than 10 minutes? Complex schema? Huge table rows? |
@ava12 oh I thought you had access to it. That would be the app-soda staging project. Let me talk to my peers to see how to get you access to it. In terms of the time to update, the schema has a parent / child relationship with interleaving. There are about 500 parents with 1M children, so 500M updates. |
Ah, the project is |
Cannot reproduce. I never see neither "Socket closed", nor "The operation was cancelled". I always get "Timeout Exceeded". Even when I set timeout to 3 hours, I get "Timeout Exceeded" in 3 hours, no other errors, no retries. |
@ava12 thanks for trying it out. I will see if this still happens on my end and get back to you. |
@ava12 you are correct, this problem does not seem to occur any longer. I was running my tests in a different environment, which might had been the culprit of the problem. Thanks for verifying this. I will close the issue. |
This bug is related to the Spanner client library.
For long lived transactions (>= 30 minutes), in the case of large PDML changes, it is possible that the gRPC connection is terminated with an error "Received unexpected EOS on DATA frame from server".
In this case, we need to retry the transaction either with the received resume token obtained on reading the stream or from scratch. This will ensure that the PDML transaction continues to execute until it is successful or a hard timeout is reached.
We have already implemented such change in the Java client library, for more information see this PR: googleapis/java-spanner#360.
In order to test the fix, we can use a large spanner database. Please speak to @thiagotnunes for more details.
The text was updated successfully, but these errors were encountered: