Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make loadbalancer avoid sending requests to unhealthy nodes #2363

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

RomanZavodskikh
Copy link
Member

@RomanZavodskikh RomanZavodskikh commented Jun 1, 2023

Updates #2346

proxy/proxy.go Outdated
@@ -870,6 +870,11 @@ func (p *Proxy) makeBackendRequest(ctx *context, requestContext stdlibcontext.Co
req = injectClientTrace(req, ctx.proxySpan)

response, err := roundTripper.RoundTrip(req)
if response.StatusCode >= 400 {
Copy link
Member

@szuecs szuecs Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this we can't do on >=400. We need to check the errors that return timeouts and connection errors.
Everything that is a valid response we do accept as "success".
So a 500 could be a "success". I think you could start with err != nil as "fail".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the fact error code >=400 is not what we want because 4xx is a valid response, its client side error. But 5xx even being a valid response means something is up with the backend, 5xx are always server side errors. So maybe something like err != nil || esponse.StatusCode >= 500 would be better?

defer m.m.Unlock()

m.failedRequests++
m.totalRequests++
Copy link
Member

@szuecs szuecs Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

atomic.AddInt64(&m.totalRequests, 1)
atomic.AddInt64(&m.failedRequests, 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in this section of code m.m is locked, so we can rely on no other goroutine using those fields.
So both usual and atomic instructions are correct.
However, atomic ones are slower, because they add mfence instructions, like here https://stackoverflow.com/questions/51846894/what-is-the-performance-of-stdatomic-vs-non-atomic-variables

Copy link
Contributor

@lucastt lucastt Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this post is the same case we are seeing here. They are explicitly comparing a normal operation vs an atomic one, but the performance constraints also involve the mutex lock. Atomic instructions are busy waiting, mutex free up CPU for other threads. I can't tell one or the other is more efficient for our use case, but our use case seems different from the question in your link. Or I might be imagining things, either way its a valid discussion. As food for thought: https://stackoverflow.com/questions/15056237/which-is-more-efficient-basic-mutex-lock-or-atomic-integer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swapped the lines, because you want to add total before failed, such that you do not count failed more than total (could lead to failed is bigger than total).
In general I am sure that lock will be slower as @lucastt summarized. If you like to check it, create a micro benchmark. Do not trust random posts by people, even if this would be super-hero-popular-10x-coder, it can be wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition tells me that atomics are faster in this case. :-)
In general, I would like to write benchmark to prove it, if I will have time, will do this.
It I won't, will think about atomics as a baseline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you increase first total requests and after that the failed ones.

m.m.Lock()
defer m.m.Unlock()

m.totalRequests++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

atomic.AddInt64(&m.totalRequests, 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in this section of code m.m is locked, so we can rely on no other goroutine using those fields.
So both usual and atomic instructions are correct.
However, atomic ones are slower, because they add mfence instructions, like here https://stackoverflow.com/questions/51846894/what-is-the-performance-of-stdatomic-vs-non-atomic-variables

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we discuss it in the comment above #2363 (comment)

@@ -208,7 +226,9 @@ func (r *random) Apply(ctx *routing.LBContext) routing.LBEndpoint {
return ctx.Route.LBEndpoints[i]
}

return withFadeIn(r.rand, ctx, r.notFadingIndexes, i, r)
i = withFadeIn(r.rand, ctx, r.notFadingIndexes, i, r)
i = withHealthCheck(r.rand, ctx, r.notFadingIndexes, i, r)
Copy link
Contributor

@lucastt lucastt Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would not be better to actually have a new LB algorithm instead of modify random. Since you avoid unhealthy nodes it is not exactly random anymore, it might be a bit misleading to a user configuring it.

Copy link
Contributor

@lucastt lucastt Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same thing could be applied to round robin also. Instead of a whole LB algorithm this could be just an option to both algorithms and configurable through some flag or parameter.

now := time.Now()
failedReqs, totalReqs := ctx.Route.LBEndpoints[choice].Metrics.GetFailedRequests()

if totalReqs == 0 || float64(failedReqs)/float64(totalReqs) < 0.95 {
Copy link
Contributor

@lucastt lucastt Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is ok, but looks more like a CB than a admission control behavior. I might be reading this wrong but what I understand here is: Once a backend reaches 95% error rate we just stop sending requests to it until there is no active requests to this backend, then we try again and if the backend shows apparent recover we keep sending requests to it, if it reaches 95% error rate again we stop and so on.

First I don't have big problems with this behavior but from the issue description what I would expect is load being drained from this backend until we reach a point where the backend reach a equilibrium and is able to serve the requests sent to it. This is much less aggressive behavior too. And is a admission control like behavior.

Have you thought about the pros and cons of these two approaches? If yes, would you mind describing then a bit for me to understand the reasoning?

From a quick look my thinking is: in case of network partition the CB behavior would be better because it acts much faster, but in case of overload admission control seems to fare better since it will not just ditch the load of a overloaded backend to other backends that might be struggling or not. Basicaly in the overload case CB behavior could spread the failure to other backends.

If there is a case that this behaviour could be a choice for the user it might make sense to make it configurable through some flag or parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for feedback Lucas! The very first thing to do from my PoV is to split withFadeIn and withHealthCheck parts to make them not so coupled together.
But from the first glance I would like AC (not CB) to be here, because I would like some requests to go to unhealthy node just to find out the moment when it will be healthy again.

Copy link
Member

@szuecs szuecs Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a strategy pattern somewhere.
I think what we want is:

  • probability based drops (or better re-schedule), similar to admissionControl reject calculation to allow small amount of traffic to broken endpoints to observe PHC (passive health check)
  • exponential backoff (CB style)

@RomanZavodskikh RomanZavodskikh force-pushed the admissionControlUnhealthyNodes branch 6 times, most recently from e096d3f to 63c9ea0 Compare June 5, 2023 15:53
RomanZavodskikh pushed a commit that referenced this pull request Jun 5, 2023
Updates #2363

Signed-off-by: Roman Zavodskikh <roman.zavodskikh@zalando.de>
@RomanZavodskikh RomanZavodskikh force-pushed the admissionControlUnhealthyNodes branch 10 times, most recently from de7929c to 171360a Compare June 8, 2023 09:17
loadbalancer/algorithm.go Outdated Show resolved Hide resolved

if ctx.Route.LBFadeInDuration <= 0 {
return ctx.Route.LBEndpoints[r.index]
endpoints := []routing.LBEndpoint{}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I avoid repetitions (this looks the same as in random and almost the same as in consistent-hast)?

}
goodHealthyEp := totalReqs <= 10 || float64(failedReqs)/float64(totalReqs) < 0.95 || rnd.Float64() < 0.05
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such constants look like good baseline values to me. How can I make them configurable in a good way?

Copy link
Member

@szuecs szuecs Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we would have to store them in a new type and you have no "pure" func, but a "method".
Then you would need to create a constructor called from proxy or skipper and pass through the config.

if _, fadingIn := fadeInState(now, ctx.Route.LBFadeInDuration, ep[i].Detected); !fadingIn {
notFadingIndexes = append(notFadingIndexes, i)
}
failedReqs, totalReqs := (int64)(0), (int64)(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the defaults:

var failedReqs, totalReqs int64

if ctx.Route.LBFadeInDuration <= 0 {
return ctx.Route.LBEndpoints[r.index]
endpoints := []routing.LBEndpoint{}
for _, e := range ctx.Route.LBEndpoints {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply is in the hotpath and it feels like instead of an index access to slice we have now a loop through all endpoints, so from O(1) to O(n). That sounds not very good. Do I miss something?

Copy link
Contributor

@lucastt lucastt Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so from O(1) to O(n)

Indeed.

I feel like there is two options here, one is to take this off of the hot path, being updated independently through some go routine or access the slice the same way, but if the endpoint is not healthy we shift to the next and do the same evaluation. The second one has to have a stop condition to avoid unending loops. I personally think it would be better to have this in a separate go routine that updates this slice every x seconds.

endpoints = append(endpoints, e)
}
}
if len(endpoints) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if a very small number of endpoints is healthy? Shouldn't we have some kind of guardrail for this? In this case there is two possibilities:

  1. This algorithm sheds some load;
  2. This algorithm does not mind if skipper overload some healthy backends;

The first option is kinda breaking the separation of concerns of these features, I don't think an LB algorithm should shed load, it seems a bit out of place and this is not what the author chose to do, which I agree.

But there is some concern that if very few endpoints are healthy this would make the system collapse. So maybe this algorithm should only be available to routes with some form of load shedding strategy configured. I don't have a well formed opinion about it, what do you think?

Well, at least we should document this risk very well.

t.Log("CSV " + fmt.Sprintf("%d,", i) + strings.Join(showStats, ","))
}
})
}

// For each endpoint, return a hash key which will make the consistent hash algorithm select it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test not valid anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After random keys are used here
#2363 (comment)
this function is not called nowhere.

func() {
for {
ctx.Params[ConsistentHashKey] = hashKeys[len(stats)%len(hashKeys)]
ctx.Params[ConsistentHashKey] = strconv.Itoa(rnd.Intn(100000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I also don't understand, why do we have to change this behavior from iterative to random? Shouldn't it be interchangeable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version of test code creates the limited amount of keys: one key for one host, so basically in case of 6 hosts (3 "old" ones: a, b, c and 3 "fading-in": d, e, f) this code will send request with the following key values in the loop:

h_a
h_b
h_c
h_d
h_e
h_f
h_a
h_b
... (~100k reqs for single test)

As in the beginning of this test, nodes d, e, f process very few amount of requests (like here #2363 (comment)), instead such requests are redirected to the corresponding host in the consistent hash circle (https://www.toptal.com/big-data/consistent-hashing).
For example, requests to d may be redirected to e, e - to f, f - to a.
In the old version this corresponding is constant, because the same six keys are used, so there is high chance that host d will receive not so many requests.

So, the solution I propose here is to use different keys for the forementioned redirection to be not constant, but random instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR tends to be too huge, moved this part to another PR: #2380

@@ -12,7 +12,7 @@ import (
)

const (
fadeInDuration = 100 * time.Millisecond
fadeInDuration = 500 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather specific change, why is 500ms better than 100ms here? Is this for visualization reasons? Because to me it should work independently of the fadeInDuration here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, visualization works fine regardless of fadeInDuration. Such changes are needed for tests like TestFadeIn/consistent-hash,_5 (in loadbalancer/fadein_test.go) because in the beginning of it there are some hosts are being "warmed up" and therefore they receive not so many requests (ones or tens) and so there're good changes that by some random reason the number of requests will go down by some little (ones or tens) amount.

Basically, this test checks that number of requests sent to every host is monotonic (decreases for old hosts, increases for fading-in hosts), so basically for fading-in hosts the condition looks like "for every period of time the number of requests in the next period of time should be no less that 0.6*number of requests in the current period". When the number of requests is small, some random reason (in the Go scheduler or random generator, for example) may easily make this condition false even for rightly implemented load balancer.

That's why I increase fadeInDuration: to increase the time window during which reqs are collected, so increase the number of requests in every specific moment of time, so decrease the influence of random events here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR tends to be too huge, moved this part to another PR: #2380

Roman Zavodskikh added 2 commits June 8, 2023 18:15
Updates #2346

Signed-off-by: Roman Zavodskikh <roman.zavodskikh@zalando.de>
Updates #2346

Signed-off-by: Roman Zavodskikh <roman.zavodskikh@zalando.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants