Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consumer: FullJitterStrategy may increase backoff counter unexpectedly #195

Open
judwhite opened this issue Oct 3, 2016 · 8 comments
Open

Comments

@judwhite
Copy link
Contributor

judwhite commented Oct 3, 2016

go-nsq/consumer.go

Lines 781 to 786 in 3c0c5ed

case backoffFlag:
nextBackoff := r.config.BackoffStrategy.Calculate(int(backoffCounter) + 1)
if nextBackoff <= r.config.MaxBackoffDuration {
backoffCounter++
backoffUpdated = true
}

If FullJitterStrategy calculates a value less than config.MaxBackoffDuration then backoffCounter will increment where ExponentialStrategy wouldn't, potentially causing inBackoff() and RDY 1 to persist longer than intended.

This probably matters very little for most use cases, but during a prolonged outage and long-running handlers it could take a while to recover.

@judwhite
Copy link
Contributor Author

judwhite commented Oct 3, 2016

Ideas on how to approach this:

  • Add a max backoff counter which defaults to log2(cfg.MaxBackoffDuration / cfg.BackoffMultiplier). I doubt anyone wants to turn this knob and it assumes an implementation by default.
  • Break the signature of the BackoffStrategy interface's Calculate method to indicate if the backoff counter should increment.
  • Add another interface which adds this behavior, use type assertions to see if it's implemented? Requires internal knowledge and type magic.

I think I'd favor breaking the signature rather than a silent change in behavior. I'd also be surprised if many people are using custom backoff strategies. Thoughts?

@mreiferson
Copy link
Member

I see what you mean.

Break the signature of the BackoffStrategy interface's Calculate method to indicate if the backoff counter should increment.

How would FullJitterStrategy determine when to return true/false in this case?

@judwhite
Copy link
Contributor Author

judwhite commented Oct 5, 2016

@mreiferson Something like this

// Calculate returns a random duration of time [0, 2 ^ attempt) and a value
// indicating whether to increase the backoff counter.
func (s *FullJitterStrategy) Calculate(attempt int) (time.Duration, bool) {
    // lazily initialize the RNG
    s.rngOnce.Do(func() {
        if s.rng != nil {
            return
        }
        s.rng = rand.New(rand.NewSource(time.Now().UnixNano()))
    })

    backoffDuration := s.cfg.BackoffMultiplier *
        time.Duration(math.Pow(2, float64(attempt)))

    return time.Duration(s.rng.Intn(int(backoffDuration))),
           backoffDuration <= s.cfg.MaxBackoffDuration
}

@mreiferson
Copy link
Member

Yea, that makes sense.

I think I'd favor breaking the signature rather than a silent change in behavior. I'd also be surprised if many people are using custom backoff strategies. Thoughts?

We simply don't know, although I tend to agree. The other "right" thing to do would be to bump the major version number. But given Go's pathetic dependency management situation, I don't think that's all that important. What do you think @jehiah?

@judwhite
Copy link
Contributor Author

judwhite commented Oct 9, 2016

@mreiferson @jehiah I thought about the pros and cons of a MaxBackoffLevel or changing the BackoffStrategy interface some more. I think the added complexity of another option, and that a new option would overlap in meaning with existing options, makes changing the interface a better approach.

Thoughts on these two approaches:

  • A MaxBackoffLevel is probably something consumers care about - how many success levels do we need to get through before we're back at full throttle?
  • The BackoffStrategy interface takes a parameter named attempt; the way it's called with backoffCounter this value is (supposed to be) bounded (which is a good thing for full jitter, otherwise it'd often calculate higher values which would get capped at MaxBackoffDuration, and go against the purpose of full jitter). This supports that a MaxBackoffLevel is important to recovery and also backoff calculation in the strategy (to state the obvious).
  • Adding a MaxBackoffLevel would add complexity, and in a way overlap MaxBackoffDuration; currently MaxBackoffLevel implicitly exists (or should) by a combination of MaxBackoffDuration, BackoffMultiplier, and the strategy.

What do you think about changing the interface? Any other ideas?

If changing the interface is the way to go, we could:

  • Change the signature of Calculate to Calculate(attempt int) (time.Duration, bool), like above.
  • Add a MaxBackoffLevel() int method to the interface. That method might look something like this:
func (s *FullJitterStrategy) MaxBackoffLevel() int {
    x := float64(s.cfg.MaxBackoffDuration / s.cfg.BackoffMultiplier)
    return int(math.Ceil(math.Log2(x + 1)))
}

@mreiferson mreiferson changed the title Consumer: Backoff counter may increase beyond expected bounds when using FullJitterStrategy consumer: FullJitterStrategy may increase backoff counter unexpectedly Oct 10, 2016
@mreiferson
Copy link
Member

I agree with your analysis.

It's not great, and we should try to avoid it when possible, but I ultimately think that behavior of the library is more important than backwards compatibility and furthermore, given that compilation will break, end users will be notified before it can affect a production system.

Let's modify the interface. When we go to stamp a stable release and write up migration docs, we can decide about version numbers.

@judwhite
Copy link
Contributor Author

@mreiferson Thanks for taking the time to read through this. I'll raise a PR to change Calculate, I think it's more understandable than a method calculating the max backoff level.

I noticed the code starts the backoff counter at 1, which means the lowest backoff will be (2^1)*cfg.BackoffMultiplier. I think passing 1 to Calculate is expected, and the strategy should multiply by 2^(n-1). There are tests for this but they start attempt at 0, which it'll never be.

It's been like this for a while but the tests are more recent.

@mreiferson
Copy link
Member

SGTM re: off-by-one 😜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants