Prometheus remote write flush time grows #3498

github-vincent-miszczak · 2023-12-11T13:28:43Z

Brief summary

I'm using k6 with https://github.com/mostafa/xk6-kafka extension to make some tests with Redpanda/Kafka.

Output is configured to use https://k6.io/docs/results-output/real-time/prometheus-remote-write/.

While running the test, I observe that after some time, I get warnings from k6 like:

WARN[1340] Successful flushed time series to remote write endpoint but it took 5.045489743s while flush period is 5s. Some samples may be dropped.  nts=24 output="Prometheus remote write"
[...]
WARN[1578] Successful flushed time series to remote write endpoint but it took 6.11823145s while flush period is 5s. Some samples may be dropped.  nts=24 output="Prometheus remote write"
[...]
WARN[2161] Successful flushed time series to remote write endpoint but it took 8.008273492s while flush period is 5s. Some samples may be dropped.  nts=24 output="Prometheus remote write"
[...]
WARN[3367] Successful flushed time series to remote write endpoint but it took 12.510661483s while flush period is 5s. Some samples may be dropped.  nts=24 output="Prometheus remote write"

The Prometheus backend I use is very capable, it's a production-grade Mimir cluster that manages millions of series and samples per second.

As time goes on, flush time grows, and that's unexpected because there are only 24 series from my bench.

k6 version

v0.47.0

OS

Amazon Linux

Docker version and image (if applicable)

No response

Steps to reproduce the problem

Run k6 with some tests, writing to Prometheus, and wait for a while ex:

k6 -o experimental-prometheus-rw run producer.js

Expected behaviour

Flush time to Prometheus is constant, there is no warning.

Actual behaviour

Flush time grows over time, and k6 output warnings.

The text was updated successfully, but these errors were encountered:

joanlopez · 2023-12-18T12:51:30Z

Hello @github-vincent-miszczak,

Thanks for reporting such issue. ~~I'd like to help but I'm struggling with trying to reproduce it.~~

I tried to reproduce this issue with both, a very simple k6 test script, and with a simple k6 test script that uses xk6-kafka, heavily inspired by the docs there. Running both for up to 30m, against the Prometheus (Mimir) instance from my personal Grafana Cloud account, with no warnings.

Have you been able to reproduce this issue on demand, thus confirmed it wasn't a temporary issue?
If so, could you bring more details in order to reproduce it, please?

~~As soon as I reproduce it, I'll be very happy to work on it in order to ship a fix!~~

Reproduced! Will try to bring some clarity sooner rather than later.

Thanks!

kullanici0606 · 2023-12-19T08:11:33Z

Hi,

I know you are already investigating the issue, but in order to familiriaze with the codebase, I also wanted to investigate the issue. I hope this will help you to solve the issue faster.

I have profiled the code and it seems that as time goes on, since TrendSink saves all the data it has seen, percentile calculations take more and more time:

As it can be seen from the heap dumps taken with different times, TrendSink grows as time passes:

heap dump top 3 at t1
     flat  flat%   sum%        cum   cum%
   64.97MB 53.42% 53.42%    64.97MB 53.42%  go.k6.io/k6/metrics.(*TrendSink).Add
    9.51MB  7.82% 61.24%     9.51MB  7.82%  github.com/mostafa/xk6-kafka.(*Kafka).reportWriterStats
    6.51MB  5.35% 66.59%     6.51MB  5.35%  github.com/dop251/goja.(*baseObject).setOwnStr

heap dump top 3 at t2
     flat  flat%   sum%        cum   cum%
  158.18MB 59.49% 59.49%   158.18MB 59.49%  go.k6.io/k6/metrics.(*TrendSink).Add
   40.04MB 15.06% 74.54%    40.04MB 15.06%  github.com/mostafa/xk6-kafka.(*Kafka).reportWriterStats
   10.51MB  3.95% 78.50%    10.51MB  3.95%  github.com/dop251/goja.(*baseObject).setOwnStr

heap dump top 3 at t3
    flat  flat%   sum%        cum   cum%
  309.31MB 74.15% 74.15%   309.31MB 74.15%  go.k6.io/k6/metrics.(*TrendSink).Add
   34.03MB  8.16% 82.31%    34.03MB  8.16%  github.com/mostafa/xk6-kafka.(*Kafka).reportWriterStats
    8.50MB  2.04% 84.35%     8.50MB  2.04%  github.com/dop251/goja.(*baseObject)._put

heap dump top 3 at t4
     flat  flat%   sum%        cum   cum%
 1485.70MB 84.61% 84.61%  1485.70MB 84.61%  go.k6.io/k6/metrics.(*TrendSink).Add
  149.14MB  8.49% 93.11%   149.14MB  8.49%  github.com/mostafa/xk6-kafka.(*Kafka).reportWriterStats
   28.01MB  1.60% 94.70%    28.01MB  1.60%  github.com/dop251/goja.(*baseObject).setOwnStr

Here percentile calculation is called:

k6/metrics/sample.go

Lines 149 to 153 in f49b98a

    
           dynamicResolver := func(percentile float64) func(s *TrendSink) float64 { 
        
           	return func(s *TrendSink) float64 { 
        
           		return s.P(percentile / 100) 
        
           	} 
        
           }

TrendSink has to sort the values it has recorded to calculate percentile, however as time passes, values slice grows and therefore sorting starts to take more and more time:

k6/metrics/sink.go

Line 135 in f49b98a

sort.Float64s(t.values)

TrendSink.Add never truncates old data:

k6/metrics/sink.go

Lines 108 to 124 in f49b98a

    
           func (t *TrendSink) Add(s Sample) { 
        
           	if t.count == 0 { 
        
           		t.max, t.min = s.Value, s.Value 
        
           	} else { 
        
           		if s.Value > t.max { 
        
           			t.max = s.Value 
        
           		} 
        
           		if s.Value < t.min { 
        
           			t.min = s.Value 
        
           		} 
        
           	} 
        
           	t.values = append(t.values, s.Value) 
        
           	t.sorted = false 
        
           	t.count++ 
        
           	t.sum += s.Value 
        
           }

For a long running load test, this can cause the problem described in the issue.

What is the best way truncate old data?

joanlopez · 2023-12-19T10:03:46Z

That's it @kullanici0606, thanks for such a great summary!

I know you are already investigating the issue, but in order to familiriaze with the codebase, I also wanted to investigate the issue. I hope this will help you to solve the issue faster.

Absolutely, really appreciated. Indeed, it's great to see that we both reached the same conclusion.

So, yeah, the root cause of this issue are the Trend metrics (the xk6-kafka extension defines some), which hold all the metric samples collected during the whole execution, thus it causes huge memory usage when used in long running tests (find more information here and here), as well as makes the flush process slower as the test goes on (as detailed by @kullanici0606).

In this case, the only recommendation I know is to use the Native Histograms support described here. In fact, I've executed some of the tests that helped me reproduce this issue, but with K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true, and although I have to admit that it didn't cause any significant reduction in terms of memory usage, it definitely helped with the flush issues (no warnings anymore).

So, could you @github-vincent-miszczak check if that also works for you, please? Thanks!

github-vincent-miszczak · 2023-12-20T12:59:23Z

Hi!
Thank you for the amazing investigation! Unfortunately, my upstream Prometheus does not support native histograms, and as it's a production one, I can't easily enable it.
Having another implementation just for testing would be too time-consuming for me, I already moved out from k6 for my bench, because of this issue among others with xk6-kafka.
Thank you for providing a workaround for people who can use native histograms, and thank you for the details explanation. Hopefully the root issue can be fixed.

joanlopez · 2023-12-20T13:15:57Z

Hey @github-vincent-miszczak,

I'm sad to read such news. I hope Native Histograms become more stable and widely adopted soon, so more users with similar issues can use it, as well as I hope the team has time to work on the overall memory improvements. So, ideally at some point you can reconsider k6 again, or at least use it for other sorts of tests.

I'm going to close the issue for now, as I don't see any other remaining action, despite of the aforementioned issues, that will remain open. Please, feel free to open another issue if you have any other requests.

Thanks!

github-vincent-miszczak added the bug label Dec 11, 2023

github-actions bot assigned joanlopez Dec 11, 2023

github-actions bot added the triage label Dec 11, 2023

joanlopez closed this as completed Dec 20, 2023

joanlopez mentioned this issue Dec 27, 2023

k6 memory leak using imports #3515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus remote write flush time grows #3498

Prometheus remote write flush time grows #3498

github-vincent-miszczak commented Dec 11, 2023

joanlopez commented Dec 18, 2023 •

edited

kullanici0606 commented Dec 19, 2023

joanlopez commented Dec 19, 2023 •

edited

github-vincent-miszczak commented Dec 20, 2023

joanlopez commented Dec 20, 2023 •

edited

Prometheus remote write flush time grows #3498

Prometheus remote write flush time grows #3498

Comments

github-vincent-miszczak commented Dec 11, 2023

Brief summary

k6 version

OS

Docker version and image (if applicable)

Steps to reproduce the problem

Expected behaviour

Actual behaviour

joanlopez commented Dec 18, 2023 • edited

kullanici0606 commented Dec 19, 2023

joanlopez commented Dec 19, 2023 • edited

github-vincent-miszczak commented Dec 20, 2023

joanlopez commented Dec 20, 2023 • edited

joanlopez commented Dec 18, 2023 •

edited

joanlopez commented Dec 19, 2023 •

edited

joanlopez commented Dec 20, 2023 •

edited