Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsublite: Stalls on Sending & Delayed "The service is currently unavailable" #3813

Closed
GLStephen opened this issue Mar 14, 2021 · 13 comments
Closed
Assignees
Labels
api: pubsublite Issues related to the Pub/Sub Lite API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@GLStephen
Copy link

GLStephen commented Mar 14, 2021

Client

PubSubLite

Environment

Debian on GCE

Go Environment

pubsublite@v0.7.0

go go version go1.16 linux/amd64

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/--user--/.cache/go-build"
GOENV="/home/--user--/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/--user--/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/--user--/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build922222807=/tmp/go-build -gno-record-gcc-switches"

Code

This is a fairly vanilla set of publishing code to pubsublite.

publishedMessage <- publisher.Publish(pubSubCtx, &pubsub.Message{
  Data:       loggable,
  Attributes: attributeMap,
})

That channel is in a GoRoutine listening for the publishes to track internal metrics and report errors.

func startListeningForMessages() {
	//var results []pubsub.PublishResult
	log.Printf("Listening for Messages")
	for {
		r := <-publishedMessage
		_, err := r.Get(pubSubCtx)
		if err != nil {
			// TODO: Handle error.
			// NOTE: The publisher will terminate upon first error. Create a new
			// publisher to republish failed messages.
			log.Printf("PubSub Lite Publish Error %s", err.Error())
			createPubSubLitePublisher()
		} else {
			//log.Printf("Published %s", messagePub)
			currentStats.MessageCountHandled++
		}
	}
}

Expected behavior

Message delivery stays stable. Note: This occurs with multiple clients on multiple machines at the same time, leading me to believe though their may be a client solution the issue is in response to a backend behavior. Starting a new client, or a restart, results in no connection to the PubSubLite service.

Actual behavior

We are seeing pubsublite behavior very similar to this #1163

image

The behavior seen is that all messages stop publishing. "messageCountHandled" (our internal metric on each publish) does not increase, but log.Printf("PubSub Lite Publish Error %s", err.Error()) also never occurs. Until finally a more useful message from the service about being down, a segfault and then restart that seems to fix things . Our app logs the following over and over before that happens.

rpc error: code = Unavailable desc = The service is currently unavailable.

This might happen for 20 minutes, then it logs this error from pubsub

publisher(projects/78829200929/locations/us-east1-c/topics/prod_lite_ingest_event_stream/partitions/0): rpc error: code = Unavailable desc = The service is currently unavailable.

then

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x9d7f66]
goroutine 4087608 [running]:
cloud.google.com/go/pubsublite/pscompat.(*PublisherClient).transformMessage(0x0, 0xc0758c7dc0, 0xc0758c7e30, 0x0, 0xc0006caba8)
/home/stephen_johnston/go/pkg/mod/cloud.google.com/go/pubsublite@v0.7.0/pscompat/publisher.go:170 +0x26
cloud.google.com/go/pubsublite/pscompat.(*PublisherClient).Publish(0x0, 0xcb6730, 0xc00011a000, 0xc0758c7dc0, 0xc0758f22c8)
/home/stephen_johnston/go/pkg/mod/cloud.google.com/go/pubsublite@v0.7.0/pscompat/publisher.go:121 +0xac
main.publishRequestToPubSub(0xc000404a00, 0xc000118090, 0x24, 0xc000118090, 0x6, 0xc0001180c0, 0x24, 0xc0001180f0, 0x24, 0x640, ...)
         /opt/pubwise/cpminator-api-golang/main.go:405 +0x2f2
created by main.handleAddEvent
/opt/pubwise/cpminator-api-golang/main.go:293 +0x478

Ultimately, I'd like to improve that error logic, but right now the behavior is unpredictable enough to make it seem like an upstream error needs to be resolved first.

@GLStephen GLStephen added the triage me I really want to be triaged. label Mar 14, 2021
@product-auto-label product-auto-label bot added the api: pubsublite Issues related to the Pub/Sub Lite API. label Mar 14, 2021
@tmdiep
Copy link
Contributor

tmdiep commented Mar 16, 2021

Hi Stephen,

Sorry to hear that you have been experiencing these issues and thank you for reporting. We have done some investigation and there is an operation on the server that is taking longer than expected - we are looking into a fix or mitigation. The client receives an Unavailable error, which the library retries internally, but is giving up after 1 minute of persistently retrying.

As a workaround, could you please try increasing the timeout on the client as follows:

settings := pscompat.DefaultPublishSettings
settings.Timeout = 10 * time.Minute
publisher, err := pscompat.NewPublisherClientWithSettings(ctx, topic, settings)

Let us know if this doesn't help, or you have to increase the timeout even more.

@tmdiep
Copy link
Contributor

tmdiep commented Mar 16, 2021

We decided to increase the default timeout to 1 hr in PR #3821.

If this was a genuine service outage (this issue was a false positive), how would you prefer to handle it in your clients? Do you prefer to detect and alert? Or do you prefer to keep the tasks running until the Pub/Sub Lite service resumes normal operation?

@GLStephen
Copy link
Author

I responded in #3814 to the question above.

@GLStephen
Copy link
Author

@tmdiep so far so good, this change appears to have removed the issue. We'll know for sure after another 24 hours.

@tmdiep tmdiep removed the triage me I really want to be triaged. label Mar 16, 2021
@tmdiep
Copy link
Contributor

tmdiep commented Mar 16, 2021

Glad to hear that increasing the timeout has helped.

@meredithslota meredithslota added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Mar 17, 2021
@GLStephen
Copy link
Author

GLStephen commented Mar 17, 2021

@tmdiep increasing the timeout helped, but eventually resulted in a failure like:

PubSub Lite Publish Error pubsublite: service has stopped or is stopping

I'm assuming this is the same issue as the above? Based on the recommendation above, should we try pushing the timer up? Times below are Eastern US, March 17th.

image

Based on memory utilization the service only queued messages for a while before tossing the errors above, in previous scenarios the OOM killer would actually "recover" the service, but it looks like a 30 minute plus timer will be needed to keep it running and we may need to be prepared to queue 30 minutes of messages which at current message volumes is likely impossible. The problem is, we're at about 1/50th of message volume per machine. Letting the OOM killer "solve" the issue is not ideal as you might imagine.

image

I believe this is likely not the performance profile you all want for the PubSubLite service. Is this a backend issue contributing? Assuming the above is ultimately completely unavoidable. How do we detect recovery of the service and reconnect is the next question.

@tmdiep
Copy link
Contributor

tmdiep commented Mar 18, 2021

Hi Stephen,

I'll break up this long response:

Timeout value

We have identified a likely cause of the server error, but the fix is yet to roll out to the us-east1 region (ETA by the end of this week).

Looking at the logs, there were several occurrences of the error that affected your topic (all ET):

  • 2021-03-17 20:28:46 - 20:29:06
  • 2021-03-17 12:04:59 - 12:06:02
  • 2021-03-17 11:55:54 - 11:56:36

I see that the publishing rate for your topic has remained steady from around 2021-03-17 9:10, so it looks like the client successfully survived these occurrences. A 10 minute timeout seems sufficient to overcome these errors, but you may want to increase it for more resilience.

Event at 8:00

Now what happened around 7:30 - 8:00, I can't fully explain. These are observations from our internal monitoring:

  • Just prior, the server was receiving ~6 messages, per partition, per min.
  • At 7:36, all partitions dropped to 1 message, per partition, per min. With the exception of partition=5, which dropped to zero and appeared to go offline (i.e. stopped establishing Publish gRPC streams with the server).
    • No server errors or anything unusual observed.
  • By 8:06, all remaining partitions went offline, so the PublisherClient must have terminated at that point.
  • At 2021-03-17 8:05:55 there was an occurrence of the server error. I'm not 100% sure if it was the cause of the PublisherClient terminating, or just a coincidence, given the other anomalies observed.

Can you please confirm the value of PublishSettings.Timeout that was set (10 min)? Do you only have 1 PublisherClient instance publishing to this topic? Also, are you setting OrderingKey for messages at all?

I will try to reproduce with an application publishing similar sized messages at a similar rate.

Service stopped error

The error pubsublite: service has stopped or is stopping is thrown when attempting to publish a message after the PublisherClient has already terminated. So unfortunately we missed the actual error that caused it to terminate.

Next time the PublisherClient terminates, please let me know:

  1. The error message returned by PublishResult of the first message that failed; and
  2. The error message returned by PublisherClient.Error.

Do your logs happen to contain (1) for the termination at 8:06?

@GLStephen
Copy link
Author

GLStephen commented Mar 18, 2021

@tmdiep I appreciate the thorough response. Also, I'm happy to help with memory dumps, extra logging, etc. if it helps get this running well. We're very keen to move to PubSubLite.

Timeout Value

That sounds good. We'll adjust as needed. The publish volume per machine will increase significantly soon.

Event at 8:00

The Timeout was definitely at 10 minutes. There is one client (currently) publishing messages. There will be as many as 30 in the future. We are not using any message ordering.

At: 7:36 Eastern the service is still handling messages:

Mar 17 11:32:17 pw-api-v8-tn5k cpminator-api-golang[454]: {"message":"Message Status", "context":{"RequestCount":1301779,"MessageCount":7015396,"MessageCountHandled":7015948}}

You can see that messages handled is increasing:

Mar 17 11:33:17 pw-api-v8-tn5k cpminator-api-golang[454]: {"message":"Message Status", "context":{"RequestCount":1302375,"MessageCount":7016183,"MessageCountHandled":7016173}}

but stalls at 7:34

Mar 17 11:34:17 pw-api-v8-tn5k cpminator-api-golang[454]: {"message":"Message Status", "context":{"RequestCount":1302963,"MessageCount":7016767,"MessageCountHandled":7016173}}

It stays stalled until 8:05 Eastern

Mar 17 12:05:17 pw-api-v8-tn5k cpminator-api-golang[454]: {"message":"Message Status", "context":{"RequestCount":1322877,"MessageCount":7036598,"MessageCountHandled":7016173}}
Mar 17 12:05:25 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error rpc error: code = Unavailable desc = 408:Request Timeout
Mar 17 12:05:25 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error rpc error: code = Unavailable desc = 408:Request Timeout

Until just a few hundred messages later when it transitions to "service has stopped". I assume if I had a lot more message volume the timeout segment would be whatever was in flight erroring out before the client stopped accepting new messages

Mar 17 12:05:25 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error rpc error: code = Unavailable desc = 408:Request Timeout
Mar 17 12:05:25 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping

It continues for quite a while to pump these errors. I assume these are retries because they seem to be a much higher rate of errors than the service was even handling in terms of messages? There are thousands of these in the 12:05:27 second

Mar 17 12:05:27 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:05:27 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
.... over and over ... thousands of them ....
and again in the 28th second

Message handling seemed to not completely stop though:

Mar 17 12:06:17 pw-api-v8-tn5k cpminator-api-golang[454]: {"message":"Message Status", "context":{"RequestCount":1323565,"MessageCount":7145016,"MessageCountHandled":7035145}}

but back to many (but less), hundreds not thousands, per second:

Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping
Mar 17 12:06:29 pw-api-v8-tn5k cpminator-api-golang[454]: PubSub Lite Publish Error pubsublite: service has stopped or is stopping

and message handling is stalled

Mar 17 12:07:17 pw-api-v8-tn5k cpminator-api-golang[454]: {"message":"Message Status", "context":{"RequestCount":1324167,"MessageCount":7148506,"MessageCountHandled":7035145}}

Then the service was restarted and began running just fine:

Mar 17 13:29:35 pw-api-v8-tn5k cpminator-api-golang[16468]: {"message":"Message Status", "context":{"RequestCount":18647,"MessageCount":105939,"MessageCountHandled":105951}}

@tmdiep
Copy link
Contributor

tmdiep commented Mar 19, 2021

Also, I'm happy to help with memory dumps, extra logging, etc. if it helps get this running well. We're very keen to move to PubSubLite.

We're glad to hear that. As this is a new service, we're actively improving its reliability when unexpected issues are observed in the wild. Thank you for your patience and offering to help!

The only change that I would suggest is to log PublisherClient.Error once you encounter a PublishResult that returns an error, and stop publishing. A failed PublishResult indicates that the publisher client has decided to terminate permanently, so subsequent attempts to publish will all result in a pubsublite: service has stopped or is stopping error.

Event at 8:00

Thank you for the extra logs and info. It does clarify what was happening.

Since your topic has 10 partitions, under the hood there are 10 parallel gRPC streams connected to the Pub/Sub Lite service and 10 corresponding publish queues.

(Using log timestamps:)

  1. Mar 17 11:34:17 - The publish stream for partition=5 was disconnected and the library attempted to reconnect a new stream. The other partitions actually continued to publish messages successfully. But as the PublishResults were put in a FIFO channel, it was blocked on a PublishResult.Get() call for a message destined for partition 5, which explains why MessageCountHandled stalled at 7016173 for a while.
  2. Mar 17 12:05:25 - The reconnect attempt for partition=5 finally timed out. rpc error: code = Unavailable desc = 408:Request Timeout is a different issue to the backend error that we discovered. There must be a 30 min timeout somewhere along the path to Pub/Sub Lite backend servers.
  3. Since more than 10 minutes has elapsed, the PublisherClient decided to permanently terminate. So it terminated the publish streams of all other partitions.
  4. Once the PublisherClient decided to terminate, it released all PublishResults, unblocking the Get() call and flushing the publishedMessage channel. Hence MessageCountHandled increased to 7035145.
  5. After the PublisherClient terminated, all subsequent calls to PublisherClient.Publish were rejected, resulting in the pubsublite: service has stopped or is stopping errors.

Partition 5 was assigned to the same backend server task as a few other partitions, so it wasn't a case of the task being down. Somewhere along the route to that server task, the connection was stuck. Unfortunately I currently don't have information to determine where this occurred.

Coincidentally, the backend error occurred at 2021-03-17 8:05:55 just to confuse us :)

Please let us know if you encounter this scenario again. 30 min downtime is a concerning duration, which we definitely need to eliminate and will investigate.

@GLStephen
Copy link
Author

@tmdiep I appreciate the info. I will continue to provide updates as I have them. Based on your info above the "PubSub Lite Publish Error rpc error: code = Unavailable desc = 408:Request Timeout" error was actually partial in terms of it being one partition, but since that is a PublishResult.error I should just shut down publishing there?

@tmdiep
Copy link
Contributor

tmdiep commented Mar 19, 2021

Yes, please shut down publishing. Although the publisher client consists of individual partitions, it tries to hide this implementation detail and terminates everything if any one partition encounters a fatal error.

@tmdiep
Copy link
Contributor

tmdiep commented Mar 26, 2021

Hi Stephen,

As I mentioned in the other issue, v0.8 has some improvements to allow better handling of backend availability. We have increased the default value of PublishSettings.Timeout to a very high value, which would allow the PublisherClient to survive issues that you've observed. But if your application needs to detect prolonged backend unavailability, we suggest that you set a lower timeout and handle accordingly.

The 30min window of unavailability (it's actually due to a stream reconnection getting stuck) is still under investigation. We have added some monitoring on the server to try to determine the root cause. We are considering adding mitigations in the client libraries.

@tmdiep
Copy link
Contributor

tmdiep commented Jun 10, 2021

Closing this issue, as the remaining anomaly (gRPC stream connection stuck for 30min) hasn't been seen for over a month. Monitoring suggested that the connection request never reached Pub/Sub Lite servers. Historically, it has been caused by grpc-go.

@tmdiep tmdiep closed this as completed Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsublite Issues related to the Pub/Sub Lite API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants