Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspected memory leak #1549

Closed
georgekcchung opened this issue Jul 12, 2020 · 11 comments
Closed

Suspected memory leak #1549

georgekcchung opened this issue Jul 12, 2020 · 11 comments
Labels

Comments

@georgekcchung
Copy link

georgekcchung commented Jul 12, 2020

Environment

  • k6 version: k6 v0.26.2 (2020-03-18T11:45:39+0000/v0.26.2-0-g459da79e, go1.13.8, linux/amd64)
  • OS and version: CentOS Linux release 7.7.1908

Expected Behavior

The k6 process's memory usage should keep constant

Actual Behavior

The k6 memory usage keeps increasing until it is killed by OS due to OOM

Steps to Reproduce the Problem

command line to run k6: ./bin/k6 run -u 100 -d 48h --insecure-skip-tls-verify --discard-response-bodies -q --no-connection-reuse --no-vu-connection-reuse --no-color run-https.js

content of run-https.js

import http from "k6/http";
import crypto from 'k6/crypto';
import { Counter } from 'k6/metrics';

let allErrors = new Counter('error_counter');
let allSuccess = new Counter('success_counter');

export let options = {
  discardResponseBodies: true,
};

export default function() {

    const url = 'https://192.16.100.133:8443/1.0/logging';
    const auth = true;
    const behindLoadBalancer = true;

    let data = { time: 492363409673, app: "abc", "logs": [{
            "level": "critical",
            "message": "blah blah",
            "clientUserId": "abc"
        },
        {
            "level": "critical",
            "message": "blah blah 2",
            "clientUserId": "abc"
        }]};

    let headers = {};

    if ( auth ) {
        const secret = "sddsfefsdgsdfg";
        const appid = "app1";
        const GMTDate = "Thu, 09 Jul 2020 04:40:00 GMT"

        let combinedData = GMTDate + secret + JSON.stringify(data);
        let hash = crypto.sha256(combinedData, 'hex');

        headers = {'Content-Type': 'application/json', 'Date' : GMTDate, 'X-AAA-AUTH' : hash, 'X-AAA-APPID': appid};
    } else {
        headers = {'Content-Type': 'application/json'};
    }

    if ( behindLoadBalancer ) {
        let ip = (Math.floor(Math.random() * 255) + 1)+"."+(Math.floor(Math.random() * 255))+"."+(Math.floor(Math.random() * 255))+"."+(Math.floor(Math.random() * 255));
        headers["X-Forwarded-For"] = ip;
    }

    let res = http.post(url, JSON.stringify(data), {headers: headers});
    if (res.status >= 400) {
        allErrors.add(1);
    } else {
        allSuccess.add(1);
    }
};
@mstoykov
Copy link
Collaborator

Hi @georgekcchung, I suspect you are hitting the combination of golang's GC not being able to keep with the garbage generation and one of the many things that k6 doesn't do to make it less likely that this will happen, as well as having a script that generates a lot of objects :).

  1. Can you provide us with some timeline? k6 starts with X mb usage and then goes to X after 1 hour and dies after X hours, but the machine has X mb free so it used all of that?
  2. It's a good thing you are using discardResponseBodies as that ... helps :D
  3. As far as I can see you never change data but you do stringify it upto two times on each iteration which will generate more objects that then will need to be GCed.
  4. Using compatibility-mode=base will significantly reduce the amount of memory each VU requires which will again significantly reduce the amount of work the GC needs to do constantly which will likely be good enough for you
  5. In general, we argue for having at least a little bit of sleep at the end of each iteration in order for the GC to have time to catch up and in order to have some pacing of the requests. With the soon to be release 0.27.0 and the infamous 1007 PR (not linking it as it is huge with references already), there will be better ways for pacing ;)
  6. If none of the above help(or aren't viable) the output summary in k6 (IMO) isn't about more than the simplest of checks, and currently, the Trend metrics do kind of ... leak memory(there are more issues which are linked), so I advise to use another metric output in combination with --no-thresholds --no-summary which will not save metrics indefinitely in k6. I would recommend the json or csv(which is apparently not documented 🤦 ) one if you are having a lot of results (or the cloud for that matter ;) ) as the other ones generally get overwhelmed. Both The json one supports adding .gz(the csv one turned not to) at the end of the file which will automatically gzip the output which does save A LOT of space ;)

@georgekcchung
Copy link
Author

@mstoykov , thanks for your swift reply.

  1. Here is the timeline for your reference.
    Time, memory usage
    8:48, 615M
    8:54, 784M
    9:04, 1112M
    9:11, 1567M
    9:19, 1699M
    9:33, 2355M
    10:01, 3374M

My testing machine has 12G RAM. I didn't witness the death of the progress but I saw "killed" message in command
prompt. I expect it used all the memory and was killed by the OS.

  1. The above test used only one stringify.

  2. compatibility-mode=base breaks my test script. I will figure out how to correct it later.

  3. I count on k6 to give me the best load test result. If I do sleep in the script, the load test result will be worse than actual result.

  4. I will try it later.

Thanks.

@mstoykov
Copy link
Collaborator

@georgekcchung any update on this? Can we close this or do you still think there is a memory leak?

@luckybroman5
Copy link

luckybroman5 commented Oct 2, 2020

I do think there is a memory leak here. Consider the following;

'use strict'

var ws = require("k6/ws")
var  k6 = require("k6").check
var check = k6.check
var sleep = k6.sleep
var Counter  = require('k6/metrics').Counter

var FetchRecordsSuccessRate = new Counter('FetchRecords_SuccessRate');
var connections = new Counter('Client_Read_Connections');
var RecordNumber = new Counter("RecordNumber");
var ErrorCounter = new Counter("ErrorCounter");

module.exports.FetchRecordsSuccessRate = FetchRecordsSuccessRate
module.exports.connections = connections
module.exports.RecordNumber = RecordNumber
module.exports.ErrorCounter = ErrorCounter

module.exports.default = function() {
    var url = "ws://" + TARGET_HOST + "/longpoll/" + TOPIC

    var res = ws.connect(url, {}, function(socket) {
    var t = Math.floor(Math.random() * 5000)
    socket.setTimeout(function() {
      console.log( t, "milliseconds passed, closing the socket");
      socket.close();
    }, t);

    socket.on("open", function() { console.log("connected") });

    socket.on("message", function(data) {
      try {        
        if (data && data.length) {
          console.log(data)
          var num = parseInt(data, 10)
          
          RecordNumber.add(Number(num))
        }
      } catch (e) {
        console.log(e)
        ErrorCounter.add(true)
      }

        FetchRecordsSuccessRate.add(!!data)
    });
    socket.on("close", function() { console.log("disconnected") });
  });

  sleep(1)
}

This is a very simple test that simply downloads data from a websocket. I've followed the advice of @mstoykov and it didn't seem to yield any results. For instance:

image

The first spike is with no changes, the second is with --discard-response-bodies and the third discarding response bodies along with base compatibility mode. Also note that I'm looking at the slope of the lines; if the tests ran for equal amounts of time, they'd consume the same amount of RAM.

My machine has 32gb of RAM, so it taking up that much with 300VU's and never GC'ing makes me really think there is a memory Leak. It sucks that I'm having this problem. It's such a small road block yet might be expensive to overcome!

Any Ideas? Thanks in advance!

@mstoykov
Copy link
Collaborator

mstoykov commented Oct 2, 2020

Hi @luckybroman5,

I did try to reproduce this using this script and the server.go implementation. With-u 300 -d 10m it stabilized at 1590mb(after around 1-2 minutes) without --compatibility-mode=base and at 217mb with it (after around a few seconds) and stayed at that for the following 2-3 minutes at which point I stopped it as I didn't see a reason to continue.

This was using the v0.28.0. It should be noted that pre v0.27.0 there were a lot of issues in the ws package and it is far from "good" even now but at least memory leaking seems to not be one of the issues anymore. Running with v0.26.2 it does seem to be leaking or at least to be using more memory but not with the speed you are showing - I got to 600mb after 4 minutes with --compatibility-mode=base.

Please provide more information(after updating to v0.28.0 if you are not using it) like:

  • actual parameters you are running it with
  • how big the messages are
  • how many messages /s are you receiving (I do get around 20k+/s )
  • anything else that you find relevant

it will be best if you can make it break reliably and share the script and a server implementation as I did. I can help you with both but I will need more information :).

p.s. I removed to logging as it was really noisy and in my experience it increased the memory usage as it also increased the request making speed :)

@luckybroman5
Copy link

luckybroman5 commented Oct 2, 2020

@mstoykov Sure.

There is about 300 messages/sec coming in over the websocket, and the message size is relatively small:
2020-10-2 9:54 123123213123 Foooooooo Barrrrrr

Here are the flags I'm running my script with: --vus 300 --compatibility-mode=base --duration 15m --discard-response-bodies --out influxdb=my.influxdb.host

I'm also running k6 via docker.

Hopefully that helps. I will also try and remove the logging.

@mstoykov
Copy link
Collaborator

mstoykov commented Oct 2, 2020

can you try without the --out influxdb ?

@luckybroman5
Copy link

@mstoykov that seems to have done it! Unfortunately though, not being able to use influxdb is not ideal for my tests. Is there any thing I can do to help?

@mstoykov
Copy link
Collaborator

mstoykov commented Oct 3, 2020

Run with -v, but with less console.llogs in order to see the influxdb debug information.
I expect that what is happening is that influxdb isn't ingesting metrics as fast as k6 produces them and in the end k6 just has too many metrics in memory while waiting for influxdb to ingest them.
I can propose that you:

  • Demote some tags to fields in influxdb
  • Play with the system tags you likely can remove at least half of them without any kind of problem for your test.
  • Try to play with the options added in this PR.
  • Maybe you can increase the influxdb performance as well
  • You can also obviously drop some of your custom metrics, if none of the above works

@mstoykov
Copy link
Collaborator

mstoykov commented Oct 8, 2020

@luckybroman5 did any of this help you ?

@mstoykov
Copy link
Collaborator

mstoykov commented Nov 6, 2020

I am closing this as the original user hasn't responded in quite some time and the problem seems to have been that they are using too much memory not that k6 has a memory leak ...

@mstoykov mstoykov closed this as completed Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants