Improve Triaging of failing messages in Services / Components #1365

weberjm · 2021-11-29T16:03:39Z

Problem

There are a variety of reasons why a message could fail in its consuming service. Sometimes, it is due to transient problems, or those dealing with the underlying systems involved in performing functions. In other cases, the message is incorrect or does not have the required information for downstream systems, meaning it is not likely to ever succeed.

Examples:

In Ferryman, it chooses to reject() or nack() for many reasons, including if the requested action does not exist. These messages will be requeued, even though they have little chance of ever succeeding.
In the Snapshots service, if there is any error processing a message, the service calls nack(), with requeue explicitly set to true, meaning it will try to re-process this message indefinitely, and prioritized above subsequent requests.

Proposal

Recently, rebound and reject queues have been implemented in the framework. We now have multiple options for dealing with a failing message:

Use the nack() or reject() function to immediately requeue and prioritize messages
Send a message to the rebound queue for delaying the retry
Fail and optionally log the error.

Each queue message handler in the framework should be reviewed, and a decision made on how to handle different types of failures. Priority should be given to ensuring that messages do not find themselves being infinitely requeued. Following these decisions and implementation, a short document describing the reasoning and an implementation guide for future usage should be written.

In some cases, we may wish to make the behavior configurable, and the different actions could be built into wrapper functions and delivered via npm, like the event-bus.

The text was updated successfully, but these errors were encountered:

weberjm · 2022-03-09T11:10:14Z

Many event listeners are written with direct nack() or reject() functionality. This will directly requeue a message as close to the top of the message stack as possible.

Hans has recently implemented the rebound queue so that the option can be given based on different states:

Rebound if an external service is not available, to give it time to recover to a healthy state
If a message is incorrect it should be rejected without auto requeue
If a service is not yet available, but somehow still receiving messages, it should reject and requeue

weberjm · 2022-03-09T11:12:35Z

TBD: do we make retry behavior configurable for the system, when and how? Or do we decide a specific behavior for each "type" of failure?

To Do:

Determine different failure classes
Determine appropriate response type by class
Implement the appropriate reject/retry/rebound logic for each decided response
- In Services
- In Ferryman
- Potentially in Components themselves

weberjm added enhancement New feature or request general affects multiple services or domains message oriented middleware labels Nov 29, 2021

weberjm added the epic label Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Triaging of failing messages in Services / Components #1365

Improve Triaging of failing messages in Services / Components #1365

weberjm commented Nov 29, 2021

weberjm commented Mar 9, 2022

weberjm commented Mar 9, 2022

Improve Triaging of failing messages in Services / Components #1365

Improve Triaging of failing messages in Services / Components #1365

Comments

weberjm commented Nov 29, 2021

Problem

Proposal

weberjm commented Mar 9, 2022

weberjm commented Mar 9, 2022