Backend memory leak #514

chibenwa · 2024-05-06T10:08:32Z

Describe the bug

Over time tdrive consume more and more memory.

At 10am :


root@tdrive-backend-5d8bc8d946-t2z4p:/usr/src/app# ps ux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          20  0.9  8.6 3227084 1327243 ?     Sl   May04  23:36 node --unhandled-rejections=warn --max-http-header-size=30000 dist/server.js

At 12:

root@tdrive-backend-5d8bc8d946-t2z4p:/usr/src/app# ps ux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          20  0.9  8.6 3227084 1414260 ?     Sl   May04  23:36 node --unhandled-rejections=warn --max-http-header-size=30000 dist/server.js

This eventually lead to the 2GB quota to be exceeded and the bot to be salvaged:

To Reproduce

Run and monitor Twake drive in production

Expected behavior

No memory leak. I expect tdrive memory consumption to reach a stable point.

The text was updated successfully, but these errors were encountered:

chibenwa · 2024-05-06T13:22:58Z

Memory graph clearly shows memory increases and never is released.

Note that the "going done" at the end of the period is a K8S reboot

ericlinagora · 2024-05-07T13:25:50Z

This is normal behaviour for node, it has a virtual memory, and tries to use the max of it. The machine it's hosted on must have more ram than the max GC virtual memory. We can either reduce node's space (eg with --max-old-space-size) or increase the memory of the host machine (which I recommend because the default of node is already quite conservative).

chibenwa · 2024-05-07T13:44:15Z

Hi @ericlinagora

I come from the Java world and the mail servers operate on CNB architecture with a 2GB heap without issue (of course MX setting defining the memory used by the JVM is set in accordance with the kubernetes limits).

So yes nodejs do not run alone there, and we need to set both limits in a reasonnable way.

I do think that a couple GBs shall be enough for a not-that-stressed web app.

I encourage you updating the tdrive helm charts accordingly.

Second, again coming from the java world, having an over-sized old generation is suspicious at best, especially when implementing a REST API which is suppose to employ a per-request disconnected model.

Is tdrive doing caching? Unproperly configured caching can lead to such behaviours...

Finally I am not a node expert but some other node application we have around do not exhibit such abnormal behaviours - tu mention the OpenPaaS nodes on the same deployment they stay well below the 500MB space.

All this points toward TDrive memory needing a better management.

chibenwa · 2024-05-07T13:45:39Z

Please proceed to a memory dump and analyse the memory content.

(It would be in java I would hapilly do a heap dump and analyse it with VisualVM...)

ericlinagora · 2024-05-07T14:50:25Z

Hi,
So like I said, the v8 GC is only remotely comparable to the JVM. If there is a memleak, it would exhaust the space inside node, and would cause a node error. If the OOM killer is killing node like here, it's a mismatch with node's max memory bigger than the host's. You can take a look at internal memory like this https://nodejs.org/en/learn/diagnostics/memory/using-heap-snapshot but I don't think it will be very useful, nor is looking at node's process' memory from outside as all you'd see is indeed what looks like a mem leak. Think of it like looking at the memory used by a VM from the host machine. Possibly the other node instances have that flag set ?

ericlinagora · 2024-05-07T14:51:19Z

(and if there is a leak, we should look at it once node is the one catching it, because the OOM will probably be unrelated to our code, since node expects that space available)

chibenwa · 2024-05-07T15:38:25Z

If think the word "you" is misused here. I am not part of the tdrive team, nor an ops, just a fellow dev working on another product reporting some issues he did see.

Agreed to push memory limit down to the tdrive application.

I still believe there's a deeper issue lurking in there...

ericlinagora · 2024-05-08T11:19:07Z

Perhaps there is, but this is the expected behaviour even if there weren't a deeper issue. If you want to make such an issue more visible, then instead of increasing the VM, we should decrease node's space to stress it earlier, until it's truly deployed and if we remember to change it back before, this is probably a better approach from a diagnostic perspective

chibenwa added the bug Something isn't working label May 6, 2024

shepilov added the tech label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend memory leak #514

Backend memory leak #514

chibenwa commented May 6, 2024

chibenwa commented May 6, 2024 •

edited

ericlinagora commented May 7, 2024

chibenwa commented May 7, 2024

chibenwa commented May 7, 2024

ericlinagora commented May 7, 2024 •

edited

ericlinagora commented May 7, 2024

chibenwa commented May 7, 2024 •

edited

ericlinagora commented May 8, 2024

Backend memory leak #514

Backend memory leak #514

Comments

chibenwa commented May 6, 2024

chibenwa commented May 6, 2024 • edited

ericlinagora commented May 7, 2024

chibenwa commented May 7, 2024

chibenwa commented May 7, 2024

ericlinagora commented May 7, 2024 • edited

ericlinagora commented May 7, 2024

chibenwa commented May 7, 2024 • edited

ericlinagora commented May 8, 2024

chibenwa commented May 6, 2024 •

edited

ericlinagora commented May 7, 2024 •

edited

chibenwa commented May 7, 2024 •

edited