Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

Implement spread out retries #4

Open
albrow opened this issue Apr 15, 2015 · 6 comments
Open

Implement spread out retries #4

albrow opened this issue Apr 15, 2015 · 6 comments

Comments

@albrow
Copy link
Owner

albrow commented Apr 15, 2015

Currently, if a job fails it will be immediately queued for retry. This is appropriate in some but not all circumstances. For example, if a third-party API is down for a few hours, retrying the job immediately would cause it to be retried many times before permanently failing. It would be better to spread out the retries over time. E.g. the first retry is immediate, the next one is 15 minutes later, the next one is 1 hour later, etc.

@epelc
Copy link
Contributor

epelc commented May 14, 2015

I think your looking for exponential backoff.

http://en.wikipedia.org/wiki/Exponential_backoff

@epelc
Copy link
Contributor

epelc commented Sep 28, 2015

@albrow would you accept a pr to fix this?

We're running into this in production as we use api's with unreliable uptime especially on weekends. A ton of jobs are failing and we have to manually go restart them.

I think it would require a adding a parameter to the schedule and scheduleRecurring functions. This would be a breaking change. But to avoid these in the future we could switch it to accept a schedule struct instead. This way you could add options without breaking things in the future.

@albrow
Copy link
Owner Author

albrow commented Sep 28, 2015

@epelc I'm not going to have time to implement this anytime soon, but I would be happy to review a PR for it :) Couldn't we make this a field (or fields) of PoolConfig with some sensible default values to make it a non-breaking change?

@epelc
Copy link
Contributor

epelc commented Sep 28, 2015

@albrow I think that'd work well if you have a single job type or they are all similar. But if your hitting different apis it'd require separate pools then.

I think we could get away with the pool config in our app but I'm not sure how others are using this. If you have a lot of different job types it might be problematic. Let me know your thoughts though. I'll do either one.

@albrow
Copy link
Owner Author

albrow commented Sep 29, 2015

Hmm... as I understand it, one of the great things about using exponential backoff is that it handles a variety of failure conditions pretty well. For example, it will handle both cases where there was a temporary, one-time failure and cases where e.g., a service is down for the weekend. I think this is why delayed_job, a popular ruby gem which I drew some inspiration from, doesn't let you tweak the exponential backoff parameters. My opinion is that we should add one or two parameters to PoolConfig for now. When I finally get the time to fix #14, it will be easier to express different options for individual job types, so we can consider changing this at that time.

@epelc
Copy link
Contributor

epelc commented Sep 29, 2015

Sounds good. I'll add some sort of option to the PoolConfig like you said.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants