Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly high memory usage on 0.41.0? #2762

Closed
19shubham11 opened this issue Nov 9, 2022 · 12 comments
Closed

Significantly high memory usage on 0.41.0? #2762

19shubham11 opened this issue Nov 9, 2022 · 12 comments

Comments

@19shubham11
Copy link

Brief summary

We've been using k6 for a couple of months and were able to develop a test-suite that gave us around 200k rps with the use of 5 worker machines running on GCP each running the given suite. We use a ramping-vus iterator and this how our config looks like

        executor: 'ramping-vus',
        gracefulStop: '1m',
        startVUs: 0,
        stages: [
            { duration: '5m', target: 3000 },
            { duration: '5m', target: 5000 },
            { duration: '5m', target: 8000 },
            { duration: '5m', target: 10000 },
            { duration: '5m', target: 12000 },
            { duration: '5m', target: 15000 },
            { duration: '10m', target: 15000 },
        ],
        gracefulRampDown: '5m',
    },

the tests generally run for about 45 mins and reach max VUs of 15000.

Until yesterday we were running on k6 version 0.40.0, and upon updating to 0.41.0, memory usage went up really high. We have a max memory limit of 100GB on the instance and the 0.41.0 reached 85% memory usage within 10 mins of the test executing, I reverted back to 0.40.0 and the memory usage was down to ~10% for the entirety of the 45mins.

Is this a known issue, or related to something introduced in the newer version, or maybe something got deprcated and I need to adjust the setup somehow?
Happy to provide more details if needed

k6 version

0.41.0

OS

Debian GNU/Linux 11 (bullseye)

Docker version and image (if applicable)

No response

Steps to reproduce the problem

  • running a high-scale test (15k) VUs on 0.41.0 and on 0.40.0

Expected behaviour

significantly high memory usage on 0.41.0

Actual behaviour

  • should perform the same?
@19shubham11 19shubham11 added the bug label Nov 9, 2022
@na--
Copy link
Member

na-- commented Nov 9, 2022

Can you share something about what your script actually does? Does it generate a lot of metrics with unique/high-cardinality tags? For example, do you have a ton of unique URLs or something like that?

If so, the high memory usage might be because of this change in k6 v0.41.0 and you may be able to ameliorate it by using URL grouping: https://k6.io/docs/using-k6/http-requests/#url-grouping

If not, then please share any other details about your script to help us diagnose what the issue might be.

@na-- na-- added high prio evaluation needed proposal needs to be validated or tested before fully implementing it in k6 labels Nov 9, 2022
@19shubham11
Copy link
Author

Oh interesting, yes, the script actually has around 12 unique URLs and and a few with a path param that changes based on the previous response, they are mostly CRUD operations, and are called in sequence over again (as you can see maxes out at 15k VUs)

So if I understand correctly based on https://k6.io/docs/using-k6/http-requests/#url-grouping that it will be generating unique metrics per URL? (like users/1 and users/2 would be treated differently?)

I'm assuming a URL like /users/{:id} called 10k times will create 10k new metrics in the newer version? Anyway way to disable this?

@na--
Copy link
Member

na-- commented Nov 9, 2022

So if I understand correctly based on https://k6.io/docs/using-k6/http-requests/#url-grouping that it will be generating unique metrics per URL? (like users/1 and users/2 would be treated differently?)

I'm assuming a URL like /users/{:id} called 10k times will create 10k new metrics in the newer version? Anyway way to disable this?

Yes. Or, rather, it will create 10k (or more, if you have other differences in tags) time series.

This is probably the problem. Try to use the http.url helper or (manually set the name tag) for these requests and you should see your memory usage significantly decrease. Memory usage (with a reasonable number of time series) and the garbage collection CPU overhead should actually be lower than v0.40.0 🤞

@19shubham11
Copy link
Author

19shubham11 commented Nov 9, 2022

Alright, thanks! I'll try and add tag to all of the paths and report back.

On an unrelated note, is it expected to have minor-ish "breaking" changes on normal releases; we just download the latest version so were just unaware of the 0.41 release until today.

(Haven't really read the full release policy myself so feel free to ignore, plus it's not a breaking change anyway just a performance dip IMO, anyway it's a great tool have loved using it so far)

@19shubham11
Copy link
Author

Added tags to the params and can confirm memory usage is not shooting up, thanks

@na--
Copy link
Member

na-- commented Nov 10, 2022

Awesome 🎉 Can you provide a rough estimate of how many unique URLs your script was hitting? Because even 10k-15k unique URLs (and so, time series) shouldn't have caused such a huge increase in memory usage according to our tests?

@19shubham11
Copy link
Author

19shubham11 commented Nov 10, 2022

10-15k was just an example I gave 😅 so for some actual numbers - endpoints with path params (unique URLs) would be called around 10k/sec -> so ~10000 * 60 * 45 = ~27M for a 45 min full test that we run. But since on 0.41.0 we almost went OOM after around 10 mins, that would be ~10000 * 60 * 10 = ~6M. So I'm assuming 6M unique time series and they kept adding up

@na--
Copy link
Member

na-- commented Nov 10, 2022

Ah, yeah, that would certainly do it 😅

Now that we can actually track the amount of unique time series, we will probably add some sort of a warning if some number is exceeded, e.g. 100k? 🤔 We'll need to do some benchmarking

@19shubham11
Copy link
Author

19shubham11 commented Nov 10, 2022

Yeah I think that would be great, but logs might be hard to follow sometimes, but would be something.(but it's also not easy to hit those numbers on a local setup with limited CPU/memory from what I learnt)

Is there a possibility to disable these time series metrics itself (on something planned in the future?) because I am not really using these too extensively, and we just rely on the prometheus metrics on the server side to validate our results and not on the loadtest client.

@na--
Copy link
Member

na-- commented Nov 10, 2022

Is there a possibility to disable these time series metrics itself (on something planned in the future?)

Unfortunately you can't disable them and we probably won't add such a feature in the future, sorry 😞

It's not ideal and it is a problem for some existing tests like yours, but on the other hand a whole bunch of core things that now work on top of the time series functionality are (or can be) way more efficient than before, and we also need time series for certain other feature to be possible to implement at all:

And yeah, unfortunately, if there millions of unique URLs in your test, you'd need to adjust your script slightly and add the name tag to group them, but it's a viable workaround. You needed to do that URL grouping with name even before, if you wanted to export your metrics to the k6 Cloud or InfluxDB, or basically any other output besides csv and json. Or if you needed to set thresholds on the metrics from these requests. With the http.url helper it's not even that big of an overhead, it's just a template literal with a few extra characters. It sucks, but most other tools that deal with metrics also have cardinality restrictions precisely for similar reasons we now need them... 😞

And for non-URL unique tags, we intend to have a JS API to support high-cardinality metric metadata in the future, i.e. basically something like tags that doesn't result in new time series being created for different values. Right now that part is only internal (i.e. usable from Go code in the core and xk6 extensions).

@19shubham11
Copy link
Author

yeah totally makes sense, thanks for the clarification and the quick help on this issue as well :)

@na--
Copy link
Member

na-- commented Nov 10, 2022

I'll close this issue since I opened grafana/k6-docs#883, #2765 and #2766 for various parts of the things we touched here 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants