Dedupe Webhooks for queue-processor #297

bwagner5 · 2020-11-19T23:13:16Z

Events like spot rebalance recommendation, asg lifecycle hooks, spot ITN, and ec2 instance status change can cause multiple webhooks to fire, one for each event on the same instance. NTH should dedupe webhooks that have already been sent for an instance since the drain status has not changed.

haugenj · 2020-12-14T22:27:49Z

I'm gonna leave this for now, if this is affecting anyone please comment here and we'll increase the priority

bwagner5 · 2021-01-19T17:56:59Z

When commenting on #353 I realized I forgot to write down some thoughts on mitigating this:

First, we could dedupe locally within the event store. If an event comes into the event store acting on a node that has already been successfully drained within a 10 minute time period, the event store could immediately mark it as completed. This would only help within a single replica deployment of NTH.

Another approach could run the Event Store as a separate deployment. NTH worker pods could be scaled independently and the same dedupe logic discussed above would scale to more workers. I'm not sure the approach is warranted though considering the relatively low traffic and the increased complexity this would add to NTH.

A third approach could be to try to handle events with a more graceful degradation. If the node doesn't exist or has already been cordoned and drained, then don't send the webhook. Since a single NTH pod can now process events in parallel, it's possible that events could be processed in parallel where the cordon and drain check fails, so a cordon and drain call is initiated while another workers is in the middle of a cordon drain call. In that situation the webhooks would be duplicated. To get around the concurrency issue, we could lock event processing for an individual node, but this would only be easily done within a single pod, so duplicate webhooks could still fire if two different NTH pods are concurrently processing two events for the same node.

Another valid approach is to do nothing and accept duplicate webhooks. In practice, no one has complained about them so far If you're out there and find them frustrating, let us know here!

evandam · 2022-06-09T17:41:30Z

Hey @bwagner5, just wanted to bump this since I think it would be really nice to have notifications deduplicated to keep noise low in Slack channels for example.

In my case, I see notifications for "Spot Interruption event received", "ASG Lifecycle Termination event received", and "EC2 State Change event received" - Three notifications when one spot instance is interrupted.

I realize it's a complicated problem, especially with multiple replicas, but the third option you stated sounds good to me. Maybe it would make sense to have a flag to say "skip webhooks when draining" and call it a day? It may still have some issues around concurrency but might be a pretty successful and lower effort fix.

bwagner5 mentioned this issue Nov 27, 2020

Multiple events for the same node cause a crash #307

Closed

bwagner5 assigned haugenj Nov 30, 2020

bwagner5 added the Type: Enhancement New feature or request label Nov 30, 2020

haugenj removed their assignment Dec 14, 2020

bwagner5 mentioned this issue Jan 19, 2021

High Availability for queue processor mode #353

Closed

jillmon added Priority: High This issue will be seen by most users Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case and removed Priority: High This issue will be seen by most users labels Feb 11, 2021

jillmon added this to the Webhook Enhancements milestone Feb 11, 2021

jillmon added the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Oct 19, 2021

GavinBurris42 mentioned this issue Aug 25, 2023

Deduping duplicate webhooks sent for the same Node #889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedupe Webhooks for queue-processor #297

Dedupe Webhooks for queue-processor #297

bwagner5 commented Nov 19, 2020

haugenj commented Dec 14, 2020

bwagner5 commented Jan 19, 2021

evandam commented Jun 9, 2022

Dedupe Webhooks for queue-processor #297

Dedupe Webhooks for queue-processor #297

Comments

bwagner5 commented Nov 19, 2020

haugenj commented Dec 14, 2020

bwagner5 commented Jan 19, 2021

evandam commented Jun 9, 2022