New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsublite: Stalls on Sending & Delayed "The service is currently unavailable" #3813
Comments
Hi Stephen, Sorry to hear that you have been experiencing these issues and thank you for reporting. We have done some investigation and there is an operation on the server that is taking longer than expected - we are looking into a fix or mitigation. The client receives an As a workaround, could you please try increasing the timeout on the client as follows:
Let us know if this doesn't help, or you have to increase the timeout even more. |
We decided to increase the default timeout to 1 hr in PR #3821. If this was a genuine service outage (this issue was a false positive), how would you prefer to handle it in your clients? Do you prefer to detect and alert? Or do you prefer to keep the tasks running until the Pub/Sub Lite service resumes normal operation? |
I responded in #3814 to the question above. |
@tmdiep so far so good, this change appears to have removed the issue. We'll know for sure after another 24 hours. |
Glad to hear that increasing the timeout has helped. |
@tmdiep increasing the timeout helped, but eventually resulted in a failure like:
I'm assuming this is the same issue as the above? Based on the recommendation above, should we try pushing the timer up? Times below are Eastern US, March 17th. Based on memory utilization the service only queued messages for a while before tossing the errors above, in previous scenarios the OOM killer would actually "recover" the service, but it looks like a 30 minute plus timer will be needed to keep it running and we may need to be prepared to queue 30 minutes of messages which at current message volumes is likely impossible. The problem is, we're at about 1/50th of message volume per machine. Letting the OOM killer "solve" the issue is not ideal as you might imagine. I believe this is likely not the performance profile you all want for the PubSubLite service. Is this a backend issue contributing? Assuming the above is ultimately completely unavoidable. How do we detect recovery of the service and reconnect is the next question. |
Hi Stephen, I'll break up this long response: Timeout value We have identified a likely cause of the server error, but the fix is yet to roll out to the us-east1 region (ETA by the end of this week). Looking at the logs, there were several occurrences of the error that affected your topic (all ET):
I see that the publishing rate for your topic has remained steady from around 2021-03-17 9:10, so it looks like the client successfully survived these occurrences. A 10 minute timeout seems sufficient to overcome these errors, but you may want to increase it for more resilience. Event at 8:00 Now what happened around 7:30 - 8:00, I can't fully explain. These are observations from our internal monitoring:
Can you please confirm the value of I will try to reproduce with an application publishing similar sized messages at a similar rate. Service stopped error The error Next time the PublisherClient terminates, please let me know:
Do your logs happen to contain (1) for the termination at 8:06? |
@tmdiep I appreciate the thorough response. Also, I'm happy to help with memory dumps, extra logging, etc. if it helps get this running well. We're very keen to move to PubSubLite. Timeout ValueThat sounds good. We'll adjust as needed. The publish volume per machine will increase significantly soon. Event at 8:00The Timeout was definitely at 10 minutes. There is one client (currently) publishing messages. There will be as many as 30 in the future. We are not using any message ordering. At: 7:36 Eastern the service is still handling messages:
You can see that messages handled is increasing:
but stalls at 7:34
It stays stalled until 8:05 Eastern
Until just a few hundred messages later when it transitions to "service has stopped". I assume if I had a lot more message volume the timeout segment would be whatever was in flight erroring out before the client stopped accepting new messages
It continues for quite a while to pump these errors. I assume these are retries because they seem to be a much higher rate of errors than the service was even handling in terms of messages? There are thousands of these in the 12:05:27 second
Message handling seemed to not completely stop though:
but back to many (but less), hundreds not thousands, per second:
and message handling is stalled
Then the service was restarted and began running just fine:
|
We're glad to hear that. As this is a new service, we're actively improving its reliability when unexpected issues are observed in the wild. Thank you for your patience and offering to help! The only change that I would suggest is to log Event at 8:00 Thank you for the extra logs and info. It does clarify what was happening. Since your topic has 10 partitions, under the hood there are 10 parallel gRPC streams connected to the Pub/Sub Lite service and 10 corresponding publish queues. (Using log timestamps:)
Partition 5 was assigned to the same backend server task as a few other partitions, so it wasn't a case of the task being down. Somewhere along the route to that server task, the connection was stuck. Unfortunately I currently don't have information to determine where this occurred. Coincidentally, the backend error occurred at 2021-03-17 8:05:55 just to confuse us :) Please let us know if you encounter this scenario again. 30 min downtime is a concerning duration, which we definitely need to eliminate and will investigate. |
@tmdiep I appreciate the info. I will continue to provide updates as I have them. Based on your info above the "PubSub Lite Publish Error rpc error: code = Unavailable desc = 408:Request Timeout" error was actually partial in terms of it being one partition, but since that is a PublishResult.error I should just shut down publishing there? |
Yes, please shut down publishing. Although the publisher client consists of individual partitions, it tries to hide this implementation detail and terminates everything if any one partition encounters a fatal error. |
Hi Stephen, As I mentioned in the other issue, v0.8 has some improvements to allow better handling of backend availability. We have increased the default value of The 30min window of unavailability (it's actually due to a stream reconnection getting stuck) is still under investigation. We have added some monitoring on the server to try to determine the root cause. We are considering adding mitigations in the client libraries. |
Closing this issue, as the remaining anomaly (gRPC stream connection stuck for 30min) hasn't been seen for over a month. Monitoring suggested that the connection request never reached Pub/Sub Lite servers. Historically, it has been caused by grpc-go. |
Client
PubSubLite
Environment
Debian on GCE
Go Environment
pubsublite@v0.7.0
go go version go1.16 linux/amd64
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/--user--/.cache/go-build"
GOENV="/home/--user--/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/--user--/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/--user--/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build922222807=/tmp/go-build -gno-record-gcc-switches"
Code
This is a fairly vanilla set of publishing code to pubsublite.
That channel is in a GoRoutine listening for the publishes to track internal metrics and report errors.
Expected behavior
Message delivery stays stable. Note: This occurs with multiple clients on multiple machines at the same time, leading me to believe though their may be a client solution the issue is in response to a backend behavior. Starting a new client, or a restart, results in no connection to the PubSubLite service.
Actual behavior
We are seeing pubsublite behavior very similar to this #1163
The behavior seen is that all messages stop publishing. "messageCountHandled" (our internal metric on each publish) does not increase, but log.Printf("PubSub Lite Publish Error %s", err.Error()) also never occurs. Until finally a more useful message from the service about being down, a segfault and then restart that seems to fix things . Our app logs the following over and over before that happens.
This might happen for 20 minutes, then it logs this error from pubsub
then
Ultimately, I'd like to improve that error logic, but right now the behavior is unpredictable enough to make it seem like an upstream error needs to be resolved first.
The text was updated successfully, but these errors were encountered: