New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanner: session pool destruction with disconnected client #3685
Comments
@olavloite if you could take a look when you have a moment, that would be great. |
Closing the session pool when closing a Spanner client should be quick and should ignore any errors that are caused by an unresponsive backend or by loss of connectivity. Go does not support fire-and-forget RPC invocations. Instead this solution uses parallel invocations of the DeleteSession RPC with short timeouts that are ignored if they occur. Fixes #3685
@bombsimon Thanks for the detailed report. I agree with you that closing a client (and its session pool) should be quick, even if connectivity has been lost or the backend is non-responsive for any other reason. Fire-and-forget RPC invocations are not really supported, especially not if the gRPC channel is closed while those are still in flight, so I think the best solution is to parallelize the deletion of the sessions and use a relatively short deadline that should normally only be exceeded in extreme cases. This still means that your specific scenario will need 5 seconds to close down, but that should still be a lot more workable than the current 1500 seconds. |
@olavloite Thank you so much, and thanks for creating the PR! 5 seconds sounds much more reasonable! :) |
Closing the session pool when closing a Spanner client should be quick and should ignore any errors that are caused by an unresponsive backend or by loss of connectivity. Go does not support fire-and-forget RPC invocations. Instead this solution uses parallel invocations of the `DeleteSession` RPC with short timeouts that are ignored if they occur. Fixes #3685
🤖 I have created a release \*beep\* \*boop\* --- ## [1.15.0](https://www.github.com/googleapis/google-cloud-go/compare/v1.14.1...v1.15.0) (2021-02-24) ### Features * **spanner/admin/database:** add CMEK fields to backup and database ([47037ed](https://www.github.com/googleapis/google-cloud-go/commit/47037ed33cd36edfff4ba7c4a4ea332140d5e67b)) * **spanner/admin/database:** add CMEK fields to backup and database ([16597fa](https://www.github.com/googleapis/google-cloud-go/commit/16597fa1ce549053c7183e8456e23f554a5501de)) ### Bug Fixes * **spanner:** parallelize session deletion when closing pool ([#3701](https://www.github.com/googleapis/google-cloud-go/issues/3701)) ([75ac7d2](https://www.github.com/googleapis/google-cloud-go/commit/75ac7d2506e706869ae41cf186b0c873b146e926)), refs [#3685](https://www.github.com/googleapis/google-cloud-go/issues/3685) This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Client
Spanner
Environment
Any, in this example local Spanner emulator
Go Environment
go env
Code
MVP to reproduce issue
Expected behavior
Even with a lost connection to the Spanner instance, a call to
Close()
should not hang - at least not for 15 seconds per item in the session pool (which by default is set to 100).Actual behavior
Since the connection is lost, when calling
Close()
on a Spanner client, it will hang for 15 seconds before timing out for each session in the pool - defaulting the operation to 1500 seconds.Screenshots
N/A
Additional context
To reproduce this, use the code example above combined with the Spanner emulator. The code will create the instance and the database so all you have to do is to start the emulator, run the code, exit the emulator and then try to interrupt the code (Ctrl + C). Easiest way to start the Spanner emulator would probably be with Docker:
I've noticed that it's important to ensure the database exist before creating the Spanner client. Initially I thought a connection was not created when calling
spanner.NewClient()
but apparently depending on whether the instance and database exist or not the session pool will be populated. By doing it in the order of the code above, the client will have a pool consisting of 100 items.To summarise the flow when calling
Close()
on the client, we can see that it takes the following route:Calling close will call
close()
on all the idle sessionsgoogle-cloud-go/spanner/client.go
Lines 227 to 232 in bcac779
Which will iterate over the pool one item at a time and call
destroy()
google-cloud-go/spanner/session.go
Lines 743 to 748 in bcac779
Which will create a timeout of 15 seconds for each item in the pool and call
delete()
google-cloud-go/spanner/session.go
Lines 338 to 340 in bcac779
Which will hang for 15 seconds each time since the client cannot connect to the Spanner instance due to lost connection
google-cloud-go/spanner/session.go
Lines 347 to 350 in bcac779
This issue is mostly a question about the reasoning behind this timeout (15 seconds), the sequential call and what a proper way to handle this situation would be. I don't thing a lost connection is that uncommon and with a proper liveness check for a service, a call to
Close()
could very well be issued when the connection is detected to be lost (and not recoverable). When that happens, I would like my program to exit as quick as possible.I can see three or four obvious solutions around this:
Close()
. This will probably not happen since this would change the signature and require a new major release of the package to conform to semver.destroy()
operations concurrently. I haven't checked if this is actually resonable or could case problems.If any of these alternatives sounds suitable I'll gladly work on a PR for that!
The text was updated successfully, but these errors were encountered: