-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent misbehaving destinations from effecting healthy ones #2
Comments
So we define bad targets as targets with more than one attempt (we can ignore errors as, like you said, if you have more than 1 attempt, you failed with a retryable error):
Once we know who the bad targets are, we pass them to
Why don't we just sort on
Given that bad targets are a function of I agree with the general approach, just wondering if we can skip the first query entirely (as we have the number of attempts in the job table already). The downside of this simplification is that we don't know what the bad targets are, so the v2 goal would need some extra work, but I would argue we can worry about that when we work on v2 (maybe we track failures on a separate table to make it more permanent?). |
One more thing to consider: A job's |
For 1: I implemented support for retrying jobs to a separate queue here: PostHog/rusty-hook#20 |
To make sure I understand the world correctly, here are my assumptions:
Based on that I'd propose we process things in this order:
We can have a background process (janitor) add slow targets to a cache / table.
Rollout considerations: we can initially roll this out with nothing populating 'cached_slow_targets', so we'll only prioritise based on attempts. This is already better than the existing system, where we just don't have retries at all, so we can start rolling this out on the plugin server side while in parallel building the 'cached_slow_targets' population part. |
If you don't do the 1st aggregating query and you only dequeue jobs by sorting by That's the "what if a popular destination is slow" example I gave, which I think is what we need to protect against. |
Yeah, I mostly agree. Some small tweaks:
Having something smart in charge of telling others who to delay would definitely work. That said, having to store, update, and pass around centralized knowledge is what I was hoping to avoid by my simple/dumb "stateless" query.
Well, we can easily sort by attempts and ship this and work on this later, I agree. But as noted above I feel like only handling errors/retries on a per row/event basis makes us totally weak against the most common failure scenario, and the one we built this project to avoid (AFAIK). That is, one slow backend can choke up the whole pipeline just on 1st attempts alone. It feels like we're using PG in a way we could have used Kafka. But, just to repeat: sorting by |
We have 2 unimplemented features on the v1 requirements list:
These feel pretty related, so I figured we could discuss them in one place. See prior discussion in Slack, but I plan to summarize/clarify here so you should be able to skip that unless you're really curious.
Separate Queues
This should be relatively easy and makes a lot of sense.
webhooks
. When submitting a job for retry, also change its queue column (say,webhooks-degraded
).webhooks-degraded
, probably with a lower number of max PG connections.app_metrics
has to do with the queue.This prevents retries from blocking events that haven't been tried at all.
Slow destination doesn't block up everything else
This one is more difficult, and in my opinion the primary goal of this project. If we didn't implement this, we may as well have used Kafka and pushed failures to a second topic (which would push failures back onto itself to retry).
To give a concrete example (and why "Separate Queues" above doesn't solve it alone):
(In the Slack thread linked above, we discussed some ideas around the janitor service collecting metrics about timeouts/connection-errors/etc and flagging certain destinations as degraded. The producer (and consumer?) would then need to get that information (via PG, either pub/sub or by polling a table). None of this is impossible but it's a lot of moving parts.)
I propose a consumer-only solution that I think will be easy to implement, cheap to use, and solve the problem above:
FOR UPDATE SKIP LOCKED
and it does not hold open a transaction for any other work.CASE
s (up to a reasonable limit) and assign a higher number totarget
s that have observed more errors, effectively pushing them back in the order.What this does is cheaply (AFAIK) prevent a slow destination from hurting the pipeline even on the first attempt for any given Webhook payload.
The producer could even use the same query/cache and proactively put jobs into the degraded queue (they'd have
attempt=0
and so they'd be tried first in the degraded queue).Downsides:
Upsides:
The text was updated successfully, but these errors were encountered: