New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus remote write flush time grows #3498
Comments
Hello @github-vincent-miszczak, Thanks for reporting such issue.
Reproduced! Will try to bring some clarity sooner rather than later. Thanks! |
Hi, I know you are already investigating the issue, but in order to familiriaze with the codebase, I also wanted to investigate the issue. I hope this will help you to solve the issue faster. I have profiled the code and it seems that as time goes on, since As it can be seen from the heap dumps taken with different times,
Here percentile calculation is called: Lines 149 to 153 in f49b98a
Line 135 in f49b98a
Lines 108 to 124 in f49b98a
For a long running load test, this can cause the problem described in the issue. What is the best way truncate old data? |
That's it @kullanici0606, thanks for such a great summary!
Absolutely, really appreciated. Indeed, it's great to see that we both reached the same conclusion. So, yeah, the root cause of this issue are the Trend metrics (the xk6-kafka extension defines some), which hold all the metric samples collected during the whole execution, thus it causes huge memory usage when used in long running tests (find more information here and here), as well as makes the flush process slower as the test goes on (as detailed by @kullanici0606). In this case, the only recommendation I know is to use the Native Histograms support described here. In fact, I've executed some of the tests that helped me reproduce this issue, but with So, could you @github-vincent-miszczak check if that also works for you, please? Thanks! |
Hi! |
I'm sad to read such news. I hope Native Histograms become more stable and widely adopted soon, so more users with similar issues can use it, as well as I hope the team has time to work on the overall memory improvements. So, ideally at some point you can reconsider k6 again, or at least use it for other sorts of tests. I'm going to close the issue for now, as I don't see any other remaining action, despite of the aforementioned issues, that will remain open. Please, feel free to open another issue if you have any other requests. Thanks! |
Brief summary
I'm using
k6
with https://github.com/mostafa/xk6-kafka extension to make some tests with Redpanda/Kafka.Output is configured to use https://k6.io/docs/results-output/real-time/prometheus-remote-write/.
While running the test, I observe that after some time, I get warnings from
k6
like:The Prometheus backend I use is very capable, it's a production-grade Mimir cluster that manages millions of series and samples per second.
As time goes on, flush time grows, and that's unexpected because there are only 24 series from my bench.
k6 version
v0.47.0
OS
Amazon Linux
Docker version and image (if applicable)
No response
Steps to reproduce the problem
Run k6 with some tests, writing to Prometheus, and wait for a while ex:
k6 -o experimental-prometheus-rw run producer.js
Expected behaviour
Flush time to Prometheus is constant, there is no warning.
Actual behaviour
Flush time grows over time, and
k6
output warnings.The text was updated successfully, but these errors were encountered: