Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pubsub: Unbounded retry forced when publishing with message ordering enabled #1702

Open
poernahi opened this issue Aug 17, 2023 · 5 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/java-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@poernahi
Copy link

Summary

When publishing to pubsub with message ordering enabled, the library overrides the gRPC retry settings and forces unbounded retry. The application using this library is then unable to detect failures. The workaround is to add timeout when waiting for the returned future, but this makes the retry/timeout policy scattered and surprising to configure.

This is the code that enforces infinite retry.
https://github.com/googleapis/java-pubsub/blob/main/google-cloud-pubsub/src/main/java/com/google/cloud/pubsub/v1/Publisher.java#L168 line 174 to line 180

The exact reason why the infinite retry does not eventually become successful is unclear. We have multiple random occurrences where applications running in a GKE cluster would stop publishing. In some instances, we had even waited for the application to retry for up to a week without success. Unfortunately we do not have a reliable way to trigger this type of failure and we do not have the low level gRPC logs when the failure happened.

My suggestion is to allow the library user to set a sane upper bound for retry, and let failures be propagated through the onFailure callback.

Environment details

  1. API: Pubsub
  2. OS type and version: Ubuntu 22.04
  3. Java version: temurin-17.0.8
  4. version(s): google-cloud-pubsub-1.123.12

Steps to reproduce

  1. Enable message ordering in publisher
  2. Publish messages
  3. Simulate failure, such as network disconnect

Code example

Based on sample publish code in documentation.

    Publisher publisher =
        Publisher.newBuilder(topicName)
            // Sending messages to the same region ensures they are received in order
            // even when multiple publishers are used.
            .setEndpoint("us-east1-pubsub.googleapis.com:443")
            .setEnableMessageOrdering(true)
            .build();

    try {
      Map<String, String> messages = new LinkedHashMap<String, String>();
      messages.put("message1", "key1");
      // auto-generate many more messages here ...

      for (Map.Entry<String, String> entry : messages.entrySet()) {
        ByteString data = ByteString.copyFromUtf8(entry.getKey());
        PubsubMessage pubsubMessage =
            PubsubMessage.newBuilder().setData(data).setOrderingKey(entry.getValue()).build();
        ApiFuture<String> future = publisher.publish(pubsubMessage);

        // Add an asynchronous callback to handle publish success / failure.
        ApiFutures.addCallback(
            future,
            new ApiFutureCallback<String>() {

              // onFailure is never called, because the publisher will retry indefinitely

              @Override
              public void onFailure(Throwable throwable) {
                if (throwable instanceof ApiException) {
                  ApiException apiException = ((ApiException) throwable);
                  // Details on the API exception.
                  System.out.println(apiException.getStatusCode().getCode());
                  System.out.println(apiException.isRetryable());
                }
                System.out.println("Error publishing message : " + pubsubMessage.getData());
              }

              @Override
              public void onSuccess(String messageId) {
                // Once published, returns server-assigned message ids (unique within the topic).
                System.out.println(pubsubMessage.getData() + " : " + messageId);
              }
            },
            MoreExecutors.directExecutor());
      }
    } finally {
      // When finished with the publisher, shutdown to free up resources.
      publisher.shutdown();
      publisher.awaitTermination(1, TimeUnit.MINUTES);
    }

Stack trace

No stack trace because of infinite retry

External references such as API reference guides

Any additional information below

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/java-pubsub API. label Aug 17, 2023
@maitrimangal
Copy link
Member

Hi, I tried to reproduce this error, but unable to do so. Can you please create a customer support issue for us to get more insight into why your publish calls are failing?

Regarding the upper bound on the retries so it is not infinite, I will bring it up in my next client meeting and discuss this further!

Keeping this bug open for now, to post any updates on the upper bound for retries. But will be unable to debug your exact issue until we get more insight into the publish calls. Thanks!

@maitrimangal maitrimangal added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Aug 18, 2023
@maitrimangal
Copy link
Member

So the reason we do not have an upperbound for retries is because if we stopped retrying, we would essentially have to fail the message, and since ordering is enabled, we would have to fail all the subsequent messages too to preserve the message ordering. So to prevent this we try to retry the message infinitely.

I think the best way forward here is to understand why the publishes are failing in the first place.

@poernahi
Copy link
Author

I have been able to reproduce this condition with low level gRPC logging enabled and below is the cleaned up logs. The application performs publish operation with ordering key enabled. But in this scenario, most messages have distinct key.

(successful publish operations ...)
OUTBOUND HEADERS POST /google.pubsub.v1.Publisher/Publish ...
INBOUND HEADERS status: 200
INBOUND GO_AWAY errorCode=0 load_shed
OUTBOUND RST_STREAM errorCode=5 (repeated multiple times with different streamId)
(successful publish operations ...)

After this point, there are other successful publish operations. However, the future returned by the publish operation that coincides with the stream termination above would never complete, with or without exception. If I wrap the future with a hardcoded timeout, for example 120 seconds, then I will see consistent timeout exception 120 seconds after the GO_AWAY load_shed event.

Looking at the source code of this java-pubsub library, my naive expectation is something like this should result in UNAVAILABLE gRPC error code which should then be auto-retried. Or, if another non-retrying error code is produced, an error should be raised to the library user.

So far, all instances of this issue that I have observed are tightly correlated to server load_shed events.

Library version has been updated to 1.125.5.

@maitrimangal maitrimangal reopened this Oct 26, 2023
@maitrimangal maitrimangal added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Oct 26, 2023
@maitrimangal maitrimangal added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Dec 19, 2023
@maitrimangal maitrimangal self-assigned this Dec 19, 2023
@michaelpri10
Copy link
Contributor

A potential solution here would be to have a user-set maxAttempts field overrule any other configuration (i.e., skip setting maxAttempts for ordered messages). However, this could cause message loss because as mentioned above we would need to fail all subsequent messages. I'll need to have discussions before deciding whether we should do this as it could potentially break the configuration of current users.

@michaelpri10
Copy link
Contributor

Had some more discussion about this and did talk about the possibility of having the user-specified retry settings take precedent over any other configuration. However, we would want to coordinate this change across all client libraries so that the behavior is consistent in every language.

@michaelpri10 michaelpri10 added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/java-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants