Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Notification backoff - avoid flooding notifications during prolonged outage #317

Open
jinnatar opened this issue Nov 26, 2021 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@jinnatar
Copy link
Contributor

TL;DR; a prolonged harvester outage of say 8h during sleep will produce a notification every 5 minutes per harvester. This gets annoying to wake up to, and also may consume limited resources / credits of the notification service. Instead an exponential backoff should be used which can easily drop consumption by 90% while still providing much the same level of urgency to an operator.

Steps to produce suboptimal behavior:

  1. Full Node craps itself out at 01:00. The operator is sound asleep and does not react to the notifications received from harvesters: "Your harvester appears to be offline! No events for the past 400 seconds."
  2. Harvesters continue not being able to connect and continue sending the notification every 5 minutes. With 2 remote harvesters sending a notification every 5 minutes during sleep time this equals a total of 192 alerts sent over 8 hours of sleep. This is not a hypothetical but exactly what happened to me. :-)

Better behavior:

  1. Full Node craps itself out at 01:00. Operator is still blissfully sleeping and ignoring notifications.
  2. Every harvester alerts: "Your harvester appears to be offline! No events for the past 400 seconds."
  3. Instead of a fixed 5 minute alert threshold, use an exponential backoff algorithm. Values can be played with for example here: https://exponentialbackoffcalculator.com/ .. I'll use as an example here:
    • Interval: 300 seconds
    • Max retries: 11 (just for illustrative purposes, no need to implement a max retry)
    • Exponential: 1.5
  4. With the values above, notifications would be triggered with the following deltas since the start of the incident (not cumulative):
    • 5m
    • 12.5m
    • 23.75m
    • 40m
    • 66m
    • 104m
    • 2.7h
    • 4.1h
    • 6.2h
    • 9.4h
  5. As an end result, from 2 harvesters our operator would wake up to 20 notifications instead of 192. Still clear that the issue is long term and ongoing but has only consumed 10% of the resources that the naive approach does. Also at the start of the incident notifications are still received fairly often.

Solution summary:

  • Decide on an exponential that makes sense, my examples above use 1.5 and produces 90% savings over the example scenario with very little noticeable effect for most alert scenarios.
  • Calculate thresholds for notifications with the following formula: T() = interval * exponential_rate ^ retryNumber
@martomi
Copy link
Owner

martomi commented Dec 4, 2021

Good feature design 👍 I'd be happy to upstream this feature if someone has worked or plans to work on it!

@martomi martomi added enhancement New feature or request good first issue Good for newcomers labels Dec 4, 2021
@jinnatar
Copy link
Contributor Author

jinnatar commented Dec 4, 2021

Looking at the code responsible for this, it seems to me the keep-alive threshold should still define the check interval but then the modification would be to:

  1. In keep_alive_monitor keep & increase an event counter when the threshold is met and reset once it's no longer met.
  2. Calculate notification thresholds with the formula from the counter.
  3. Not trigger event creation unless sufficient delta has passed, OR pass in a notify=False flag which would cause the event to be logged, but not sent to other notification channels. (Which could be a useful feature for other things as well. Also might be less confusing for someone not aware of the backoff if the logs say "offline for 50000 seconds but not notifying until Y seconds to avoid flooding you.")

Doesn't seem too difficult to implement. If the above plan sounds ok, I should be able to eventually knock something out. My immediate issue is gone though via automatic service restarts based on monitoring TCP 8444. :-)

@martomi
Copy link
Owner

martomi commented Dec 5, 2021

Yes, sounds good, I like the idea to still log out all the events with informations for the reason of missing notification. I think it's OK to handle this entirely in the keep_alive_monitor component because the notify_manager doesn't have any other function beside notification at the moment (in regards to passing notify=False if I understood that correctly).

That is unless we want to have a more general notification throttling that works across all event types. But as I understand it, this is currently the most offensive notification type.

@jinnatar
Copy link
Contributor Author

jinnatar commented Dec 6, 2021

Alright, so I'll try to implement a really targeted fix for the keep-alive but keep in mind that it could be refactored later into a more generic solution.

jinnatar added a commit to jinnatar/chiadog that referenced this issue Jan 14, 2022
At first notifications will arrive roughly at the same speed as before
but the delay starts increasing gradually over time. Keep-alive is still
tested with the same frequency but failures in between notifications are
only logged with the next notification threshold included.

Fixes martomi#317
jinnatar added a commit to jinnatar/chiadog that referenced this issue Jan 23, 2022
At first notifications will arrive roughly at the same speed as before
but the delay starts increasing gradually over time. Keep-alive is still
tested with the same frequency but failures in between notifications are
only logged with the next notification threshold included.

Fixes martomi#317
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants