Optimize memory consumption of Trend Metrics #1068

Sirozha1337 · 2019-06-29T14:13:34Z

Running a test that's 12 hours long crashes the program around 6 hours in, the memory consumption is tremendous, it can eat 64gb RAM. While searching for a memory leak, I've noticed, that the engine stores all samples for Trend metrics in memory until the end of the test. I think this is the main thing that leads to such a high memory consumption. I propose refactoring of the TrendSink, so that we wouldn't need to keep all the samples.

The Avg, Med and percentiles can be calculated by using Moving percentile algorithm, it only needs the last sample, new sample and current value. The only problem is that we have to know before-hand what percentiles we want to calculate, but looking at the current code there are only two percentiles in the output, they are 90th and 95th percentile.

na-- · 2019-06-30T05:46:52Z

Yeah, we know about the Trend issues with long-running tests... My proposal to completely solve this was to refactor the TrendSink to use HdrHistograms (#763), and/or for k6 to do smarter threshold and summary value calculations (#764) to reduce their impact. Unfortunately, those tasks have gotten delayed a few times, because of other bugs or important features 😞

As a potential workaround, there's currently a way to completely avoid using any TrendSinks in k6, but it only works if you're not interested in the end-of-test summary stats and you don't have any thresholds. This is done by running k6 with --no-thresholds --no-summary and is usually only useful when you use an external output like InfluxDB or Load Impact Insights.

Sirozha1337 · 2019-07-01T06:45:16Z

Your proposal is great, it would fix the memory leak and won't cost you much data. The thing is, right now we have all the data in memory and the only things that we need from that data are averages and percentiles. Maybe we can use the Rolling Average algorithm to fix it now? And if we want to show more data in the summary, we will implement the Hdr Histogram for that.

na-- · 2019-07-01T07:01:35Z

The averages are easy to calculate without storing all the values, that has never been the issue. Percentiles on the other hand are much trickier... I remember researching some memory-efficient percentile algorithms when I originally found the issue (I was searching for a memory leak 😅 ), but I then concluded that the HdrHistogram would be both easier to implement and more flexible. Can you share what specific moving percentile algorithm you recommend?

Sirozha1337 · 2019-07-01T07:12:55Z

I am talking about this algorithm: https://mjambon.com/2016-07-23-moving-percentile/

If xi<m_i−1 then
    m_i=m_{i−1}−δ/p
else if xi>mi−1 then
    m_i=m_{i−1}+δ/(1−p)
else xi=m_i−1 and keep the previous value:
    m_i=m_{i−1}

Here m_i is the current value of the percentile, p is the percentile itself and δ is a user chosen constant that expresses how fast the calculated percentile will converge to the real value, this constant matters only if you want to see realtime results, because in the end it converges to the real value anyway.

It seems much easier to implement than an Hdr Histogram, though not as flexible.

na-- · 2019-07-01T07:52:07Z

Ah, I remembered something else that precluded the use of such a moving percentile algorithm...To use it, we need to 1) know the specific percentiles we need to calculate from the start and 2) refactor the code quite a bit so that the various Trend sinks "know" from the start that they need to calculate these specific percentiles.

The first part, knowing the needed percentiles at the start of the test, is probably the harder one... There are currently two things that may require percentiles - the end-of-test summary stats and the thresholds.

The summary stats at the end of the test, as you've said, it's relatively easy - we know what we need. It's p(90) and p(95) by default, though that can be changed by the summaryTrendStats option, e.g. k6 run --summary-trend-stats="min,med,avg,max,p(99),p(99.9),p(99.99)" script.js. But in either case, we can easily know them at the start of the test.

The thresholds are the problem here... For example, if we have this k6 script:

import http from "k6/http";

export let options = {
  thresholds: {
    http_req_duration: ["p(95)<500"],
  }
};

export default function () {
  http.get("https://httpbin.org/");
}

k6 won't actually parse the expression p(95)<500 properly, in a way that we'll know that we need p(95). Instead, in the local k6 execution, we actually evaluate the thresholds by executing them in a restricted JS runtime that has the sink as a local variable and p() is just a simple helper function there: https://github.com/loadimpact/k6/blob/bdd417e8a4206c6bbbef36d7ce7bea48f34a0804/stats/thresholds.go#L33-L35

I assume that this was done as a quick fix to get thresholds working, since there was already a JS runtime in k6 anyway... But it prevents us from properly reasoning about what percentile values we'll need at the start of the test... 🤦‍♂️ There's a separate issue to fix this (#961), but we can't use the simple moving percentiles algorithms before we do... So, the HdrHistogram seemed (and it still does) like the simpler solution we can just plug into the current k6 architecture (because with it we won't need to know the percentiles we require, among other benefits) and it will allow us to fix the thresholds architecture separately, at a later time.

Sirozha1337 · 2019-07-01T08:24:14Z

Ah, that's what I was missing. I didn't take into account that you can specify any percentile in the thresholds.
Looks like HdrHistogram is the best solution then. Sorry for wasting your time and thank you for explanation, now the code makes more sense to me.

na-- · 2021-01-27T07:33:54Z

Closing this in favor of #763

na-- added bug duplicate performance labels Jun 30, 2019

na-- mentioned this issue Jul 10, 2019

Use HDR histograms for calculating percentiles in thresholds and summary stats #763

Open

na-- added this to the v1.0.0 milestone Aug 27, 2019

This was referenced Sep 2, 2019

Threshold improvements #1136

Closed

Investigate telegraf integration in k6 #1064

Closed

na-- mentioned this issue Oct 27, 2019

Track and improve k6 memory usage and performance #1167

Open

na-- mentioned this issue Dec 10, 2019

looks like there is memory leak in k6 #1267

Closed

mstoykov mentioned this issue Jul 12, 2020

Suspected memory leak #1549

Closed

na-- closed this as completed Jan 27, 2021

na-- removed this from the v1.0.0 milestone Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory consumption of Trend Metrics #1068

Optimize memory consumption of Trend Metrics #1068

Sirozha1337 commented Jun 29, 2019 •

edited

na-- commented Jun 30, 2019

Sirozha1337 commented Jul 1, 2019

na-- commented Jul 1, 2019

Sirozha1337 commented Jul 1, 2019 •

edited

na-- commented Jul 1, 2019

Sirozha1337 commented Jul 1, 2019

na-- commented Jan 27, 2021

Optimize memory consumption of Trend Metrics #1068

Optimize memory consumption of Trend Metrics #1068

Comments

Sirozha1337 commented Jun 29, 2019 • edited

na-- commented Jun 30, 2019

Sirozha1337 commented Jul 1, 2019

na-- commented Jul 1, 2019

Sirozha1337 commented Jul 1, 2019 • edited

na-- commented Jul 1, 2019

Sirozha1337 commented Jul 1, 2019

na-- commented Jan 27, 2021

Sirozha1337 commented Jun 29, 2019 •

edited

Sirozha1337 commented Jul 1, 2019 •

edited