Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No KeepAlive timeout set in load test clients #3466

Open
szuecs opened this issue Nov 22, 2023 · 9 comments
Open

No KeepAlive timeout set in load test clients #3466

szuecs opened this issue Nov 22, 2023 · 9 comments
Assignees
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature new-http issues that would require (or benefit from) a new HTTP API triage

Comments

@szuecs
Copy link

szuecs commented Nov 22, 2023

Brief summary

We use the k6 nodejs http client to do load tests and as we can see in CPU graphs it sticks with only one socket to the target.
This is of course saturating only a single target even if the target is autoscaling, which is a severe bug for a load test tool.
I am not very experienced in nodejs, but you can likely set keepalive timeout to something like 60s on the http client to fix the issue. See also this Go issue that shows the problem in more detail golang/go#23427 .

Example load test client:

import http from "k6/http";

function getLimits(order, baseUrl, token) {

    let url = `${baseUrl}/purchase-quantity-restrictions`
    let payload = JSON.stringify({
...
    });

    let params = {
        headers: {
            "Authorization": `Bearer ${token}`,
            "Content-Type": "application/json",
...
        },
    };

    return http.post(url, payload, params);

k6 version

latest

OS

ubuntu 22.04 lts

Docker version and image (if applicable)

FROM loadimpact/k6:latest AS k6official

Steps to reproduce the problem

run this client and increase load to an autoscaled target.

Expected behaviour

all targets should have the same CPU usage

Actual behaviour

new started targets are not having the same CPU usage as the others.

@kullanici0606
Copy link

Hi,

Could you please tell how does autoscaling work in your case? For example, in AWS, I think, autoscaling works by adding new nodes to load balancer definitions, in that case DNS is not changed.

In your case, are new nodes' IP addresses added to the DNS when autoscaling launches new nodes? Is load balancing handled by DNS?

For load balancer case, sticky sessions may cause unbalanced node loads. Also, sometimes load balancers use source ip address for load balancing strategy, which is not ideal for load testing scenario if load is generated from one IP address.

Could you give more information about your setup, please?

@codebien
Copy link
Collaborator

codebien commented Dec 20, 2023

Hi @szuecs,
I can imagine some use cases, but as @kullanici0606 it would be helpful to get a better explanation about your specifics.
It would be nice to know what limits you're hitting using the currently available way by the DNS option https://grafana.com/docs/k6/latest/using-k6/k6-options/reference/#dns.

Why do you evaluate it as a bug? It doesn't sound so. I don't think we say somewhere that we support this case. It sounds to me simply a feature to add. I removed the label for now and we might re-add it after if it is required.

@codebien codebien added evaluation needed proposal needs to be validated or tested before fully implementing it in k6 awaiting user waiting for user to respond new-http issues that would require (or benefit from) a new HTTP API feature and removed bug triage labels Dec 20, 2023
@codebien codebien removed their assignment Dec 20, 2023
@szuecs
Copy link
Author

szuecs commented Dec 21, 2023

@codebien @kullanici0606
The problem is not DNS! I referenced this Go issue to make you aware that TCP connections are not closed and recreated but hold forever.

There are 2 problems.
One problem is if you run for example NLB than you get packets forwarded and not proxied, which means you load test connection pool won't connect to new connected nodes that are attached to the NLB TargetGroup.
The second problem is that you can take down an ALB because you connect not to the scaled-out ALB instances. ALB will have initially only 3 EC2 instances if you create an ALB spanning 3 AZs, one in each zone. If you hit their limit, they need to scale out, so also adding more EC2 and later also adding more IPs to the ALB. These IPs you will not lookup as the Go issue shows.

In both cases you limit the throughput to whatever was available at start of the load test. That means it's not a scalable solution and scaling out upfront is not really a realistic load test scenario.

I hope I could clarify your questions and why it's a bug and not a feature.

Our load test will run in one of our kubernetes clusters that run on nodes with public IPs attached, such that all pods running on one node share the same src IP, but that seems to be fine. The ingress setup is best explained in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#aws-deployment but replace ALB with NLB. The second layer load balancer is skipper, which is the only one in the TargetGroup of the NLBs and baked by an hpa as shown in https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#deployment

@szuecs
Copy link
Author

szuecs commented Apr 2, 2024

@codebien I tried to do my diligence, but I got no response and the label "awaiting user" seems not to be correct if I respond and got no comment.

@mohitk05
Copy link

mohitk05 commented Apr 2, 2024

Hello @codebien @kullanici0606, I work with Sandor and I looked into this issue since we are planning to use k6 to perform large load tests. I wanted to highlight a few points here and propose solutions for us to discuss.

As @szuecs has pointed out, the mentioned use case is indeed an issue but it might be tricky to place it into the right category (bug/feature) because of its occurrence under specific situations. I looked into projects other than k6 and checked public articles to find if there are others who face this issue and I found the following:

The above links show that this is a common issue faced when persistent TCP connections are used against a target that is auto-scaling, though the discussions seem to be open and there is no single accepted solution.

I believe k6 should provide ways to solve this since the use case would be more frequently faced when running load tests. Disabling persistent connection can poorly affect k6's performance hence I'm avoiding that as a potential solution.

To solve this issue, I can think of the following solutions:

  1. Provide an option in k6 to set the maximum number of requests per persistent TCP connection
  2. Provide an option in k6 to close TCP connections every X seconds (a TTL for connections, similar to Sandor's suggestion of a goroutine)
  3. Send a Connection: close header from the load balancer at regular intervals (difficult to implement).

Solving this in the load test client seems like a better solution given the probability of facing this issue. Other load test tools that solve this:

@codebien
Copy link
Collaborator

Hi @mohitk05 @szuecs,
we met today with the team to discuss this issue.

In summary, our position remains similar as previously communicated: we might add support for this use-case as a feature in the next version of the HTTP module, that we’re currently developing, where we expect to introduce a more configurable solution to this specific issue.

To give you more context:

Unfortunately, as we explained previously in another issue, any configuration that will force closing connections has to be as much as possible predictable, otherwise, they might impact metrics and create confusion during investigations, troubleshooting or debugging.

Yet, even if we were keen to introduce a well-documented, configurable change that would bring a solution to this, we would still encounter a significant technical challenge: there isn't, currently, anyway for k6 to control the HTTP connection pool directly. It is mostly the same reason as for @szuecs’s original issue on the Go repository.
The available API provided by Go to enforce closing connections via the HTTP package is closeConnectionIdle, which doesn’t guarantee any closing on the active connections, as the name clearly explains.

Supporting this use-case would effectively involve building custom solutions that would not be relying on the standard Go HTTP library, which is a practice we aim to follow as much as possible. Doing the opposite is a difficult decision that has to be properly tested and evaluated considering all the corner cases HTTP, TCP and networking have in general. Continuing to use the Go standard library for HTTP module, is for us the guarantee for high standards to one the most critical modules of k6.

Of course, as with all engineering decisions, it has trade-offs; unfortunately, this issue is one of them.

At this point, the alternative option we can offer, other than the already available work-around to use noVUConnectionReuse option is for you to benefit from the extensions model where you can build your custom implementation. We can offer our support there helping in finding solutions and reviewing code, for us, it might be useful as an exploration phase for httpv2.

Eventually, if you build something generic then it could even be considered as the first version for HTTP v2 and we may consider including it as an experimental module in k6 core, if it is something you are interested on benfitting. But, any evaluation, has to be deferred to future time when your extension will be available. Instead, if this is already right now a high desire to you, then feel free to start discussing here a potential proposal and how you would like to implement it.

I hope it helps.

@codebien codebien removed awaiting user waiting for user to respond triage labels Apr 19, 2024
@szuecs
Copy link
Author

szuecs commented Apr 19, 2024

Thanks for coming back to us!

In general I think we can try to use noVUConnectionReuse option.
@mohitk05 can you test it and provide feedback if that helps?
I am happy to review all kind of data on skipper side.

For us (I am running skipper as ingress router with 1M rps every day in 100 production clusters), the closeIdleConnection loop never showed a negative impact on performance. I reviewed hundreds of latency issues that never showed that connection establishments were really a problem. If you stay inside an aws region I never saw >1ms connection times measured by go httptrace module. Here's our client deeplink to httptrace instrumentation https://github.com/zalando/skipper/blob/master/net/httpclient.go#L404
The Go httptrace module is pure gold!

Let me know if I can help more.

@RafaMalicia
Copy link

Hi,

Could you please tell how does autoscaling work in your case? For example, in AWS, I think, autoscaling works by adding new nodes to load balancer definitions, in that case DNS is not changed.

In your case, are new nodes' IP addresses added to the DNS when autoscaling launches new nodes? Is load balancing handled by DNS?

For load balancer case, sticky sessions may cause unbalanced node loads. Also, sometimes load balancers use source ip address for load balancing strategy, which is not ideal for load testing scenario if load is generated from one IP address.

Could you give more information about your setup, please?

Hello, I do have exactly this issue, seems to be that the sticky session option on AAG v2 is causing unbalanced application calls, is there any workaround I can use to prevent this? I'm using a single VM (one IP address) to run a local test against 2 web servers controlled by an AAG. Thanks in advance!

@codebien
Copy link
Collaborator

Hi @RafaMalicia,

seems to be that the sticky session option on AAG v2 is causing unbalanced application calls

I'm not an expert of Azure platform, but on AWS seems to be a dedicated section regarding how to workaround this case. The different suggestions seem to be platform-agnostic so you could consider applying the same strategy to Azure as well.

Check the Manual rebalancing section inside of the sticky sessions page on the AWS ALB docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature new-http issues that would require (or benefit from) a new HTTP API triage
Projects
None yet
Development

No branches or pull requests

6 participants