Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Triaging of failing messages in Services / Components #1365

Open
weberjm opened this issue Nov 29, 2021 · 2 comments
Open

Improve Triaging of failing messages in Services / Components #1365

weberjm opened this issue Nov 29, 2021 · 2 comments
Labels
enhancement New feature or request epic general affects multiple services or domains message oriented middleware

Comments

@weberjm
Copy link
Member

weberjm commented Nov 29, 2021

Problem

There are a variety of reasons why a message could fail in its consuming service. Sometimes, it is due to transient problems, or those dealing with the underlying systems involved in performing functions. In other cases, the message is incorrect or does not have the required information for downstream systems, meaning it is not likely to ever succeed.

Examples:

  • In Ferryman, it chooses to reject() or nack() for many reasons, including if the requested action does not exist. These messages will be requeued, even though they have little chance of ever succeeding.
  • In the Snapshots service, if there is any error processing a message, the service calls nack(), with requeue explicitly set to true, meaning it will try to re-process this message indefinitely, and prioritized above subsequent requests.

Proposal

Recently, rebound and reject queues have been implemented in the framework. We now have multiple options for dealing with a failing message:

  • Use the nack() or reject() function to immediately requeue and prioritize messages
  • Send a message to the rebound queue for delaying the retry
  • Fail and optionally log the error.

Each queue message handler in the framework should be reviewed, and a decision made on how to handle different types of failures. Priority should be given to ensuring that messages do not find themselves being infinitely requeued. Following these decisions and implementation, a short document describing the reasoning and an implementation guide for future usage should be written.

In some cases, we may wish to make the behavior configurable, and the different actions could be built into wrapper functions and delivered via npm, like the event-bus.

@weberjm weberjm added enhancement New feature or request general affects multiple services or domains message oriented middleware labels Nov 29, 2021
@weberjm weberjm added the epic label Mar 9, 2022
@weberjm
Copy link
Member Author

weberjm commented Mar 9, 2022

Many event listeners are written with direct nack() or reject() functionality. This will directly requeue a message as close to the top of the message stack as possible.

Hans has recently implemented the rebound queue so that the option can be given based on different states:

  • Rebound if an external service is not available, to give it time to recover to a healthy state
  • If a message is incorrect it should be rejected without auto requeue
  • If a service is not yet available, but somehow still receiving messages, it should reject and requeue

@weberjm
Copy link
Member Author

weberjm commented Mar 9, 2022

TBD: do we make retry behavior configurable for the system, when and how? Or do we decide a specific behavior for each "type" of failure?

To Do:

  • Determine different failure classes
  • Determine appropriate response type by class
  • Implement the appropriate reject/retry/rebound logic for each decided response
    • In Services
    • In Ferryman
    • Potentially in Components themselves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic general affects multiple services or domains message oriented middleware
Development

No branches or pull requests

1 participant