Implement spread out retries #4

albrow · 2015-04-15T20:31:19Z

Currently, if a job fails it will be immediately queued for retry. This is appropriate in some but not all circumstances. For example, if a third-party API is down for a few hours, retrying the job immediately would cause it to be retried many times before permanently failing. It would be better to spread out the retries over time. E.g. the first retry is immediate, the next one is 15 minutes later, the next one is 1 hour later, etc.

epelc · 2015-05-14T18:51:05Z

I think your looking for exponential backoff.

http://en.wikipedia.org/wiki/Exponential_backoff

epelc · 2015-09-28T21:39:30Z

@albrow would you accept a pr to fix this?

We're running into this in production as we use api's with unreliable uptime especially on weekends. A ton of jobs are failing and we have to manually go restart them.

I think it would require a adding a parameter to the schedule and scheduleRecurring functions. This would be a breaking change. But to avoid these in the future we could switch it to accept a schedule struct instead. This way you could add options without breaking things in the future.

albrow · 2015-09-28T21:58:28Z

@epelc I'm not going to have time to implement this anytime soon, but I would be happy to review a PR for it :) Couldn't we make this a field (or fields) of PoolConfig with some sensible default values to make it a non-breaking change?

epelc · 2015-09-28T22:16:27Z

@albrow I think that'd work well if you have a single job type or they are all similar. But if your hitting different apis it'd require separate pools then.

I think we could get away with the pool config in our app but I'm not sure how others are using this. If you have a lot of different job types it might be problematic. Let me know your thoughts though. I'll do either one.

albrow · 2015-09-29T18:14:44Z

Hmm... as I understand it, one of the great things about using exponential backoff is that it handles a variety of failure conditions pretty well. For example, it will handle both cases where there was a temporary, one-time failure and cases where e.g., a service is down for the weekend. I think this is why delayed_job, a popular ruby gem which I drew some inspiration from, doesn't let you tweak the exponential backoff parameters. My opinion is that we should add one or two parameters to PoolConfig for now. When I finally get the time to fix #14, it will be easier to express different options for individual job types, so we can consider changing this at that time.

epelc · 2015-09-29T18:21:00Z

Sounds good. I'll add some sort of option to the PoolConfig like you said.

albrow added the enhancement label Apr 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement spread out retries #4

Implement spread out retries #4

albrow commented Apr 15, 2015

epelc commented May 14, 2015

epelc commented Sep 28, 2015

albrow commented Sep 28, 2015

epelc commented Sep 28, 2015

albrow commented Sep 29, 2015

epelc commented Sep 29, 2015

Implement spread out retries #4

Implement spread out retries #4

Comments

albrow commented Apr 15, 2015

epelc commented May 14, 2015

epelc commented Sep 28, 2015

albrow commented Sep 28, 2015

epelc commented Sep 28, 2015

albrow commented Sep 29, 2015

epelc commented Sep 29, 2015