Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend memory leak #514

Open
chibenwa opened this issue May 6, 2024 · 8 comments
Open

Backend memory leak #514

chibenwa opened this issue May 6, 2024 · 8 comments
Labels
bug Something isn't working tech

Comments

@chibenwa
Copy link
Member

chibenwa commented May 6, 2024

Describe the bug

Over time tdrive consume more and more memory.

At 10am :


root@tdrive-backend-5d8bc8d946-t2z4p:/usr/src/app# ps ux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          20  0.9  8.6 3227084 1327243 ?     Sl   May04  23:36 node --unhandled-rejections=warn --max-http-header-size=30000 dist/server.js

At 12:

root@tdrive-backend-5d8bc8d946-t2z4p:/usr/src/app# ps ux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          20  0.9  8.6 3227084 1414260 ?     Sl   May04  23:36 node --unhandled-rejections=warn --max-http-header-size=30000 dist/server.js

This eventually lead to the 2GB quota to be exceeded and the bot to be salvaged:

Screenshot from 2024-05-06 12-04-31

Screenshot from 2024-05-06 12-04-49

To Reproduce

Run and monitor Twake drive in production

Expected behavior

No memory leak. I expect tdrive memory consumption to reach a stable point.

@chibenwa chibenwa added the bug Something isn't working label May 6, 2024
@chibenwa
Copy link
Member Author

chibenwa commented May 6, 2024

image

Memory graph clearly shows memory increases and never is released.

Note that the "going done" at the end of the period is a K8S reboot

image

@shepilov shepilov added the tech label May 6, 2024
@ericlinagora
Copy link
Contributor

This is normal behaviour for node, it has a virtual memory, and tries to use the max of it. The machine it's hosted on must have more ram than the max GC virtual memory. We can either reduce node's space (eg with --max-old-space-size) or increase the memory of the host machine (which I recommend because the default of node is already quite conservative).

@chibenwa
Copy link
Member Author

chibenwa commented May 7, 2024

Hi @ericlinagora

I come from the Java world and the mail servers operate on CNB architecture with a 2GB heap without issue (of course MX setting defining the memory used by the JVM is set in accordance with the kubernetes limits).

So yes nodejs do not run alone there, and we need to set both limits in a reasonnable way.

I do think that a couple GBs shall be enough for a not-that-stressed web app.

I encourage you updating the tdrive helm charts accordingly.

Second, again coming from the java world, having an over-sized old generation is suspicious at best, especially when implementing a REST API which is suppose to employ a per-request disconnected model.

Is tdrive doing caching? Unproperly configured caching can lead to such behaviours...

Finally I am not a node expert but some other node application we have around do not exhibit such abnormal behaviours - tu mention the OpenPaaS nodes on the same deployment they stay well below the 500MB space.

All this points toward TDrive memory needing a better management.

@chibenwa
Copy link
Member Author

chibenwa commented May 7, 2024

Please proceed to a memory dump and analyse the memory content.

(It would be in java I would hapilly do a heap dump and analyse it with VisualVM...)

@ericlinagora
Copy link
Contributor

ericlinagora commented May 7, 2024

Hi,
So like I said, the v8 GC is only remotely comparable to the JVM. If there is a memleak, it would exhaust the space inside node, and would cause a node error. If the OOM killer is killing node like here, it's a mismatch with node's max memory bigger than the host's. You can take a look at internal memory like this https://nodejs.org/en/learn/diagnostics/memory/using-heap-snapshot but I don't think it will be very useful, nor is looking at node's process' memory from outside as all you'd see is indeed what looks like a mem leak. Think of it like looking at the memory used by a VM from the host machine. Possibly the other node instances have that flag set ?

@ericlinagora
Copy link
Contributor

(and if there is a leak, we should look at it once node is the one catching it, because the OOM will probably be unrelated to our code, since node expects that space available)

@chibenwa
Copy link
Member Author

chibenwa commented May 7, 2024

If think the word "you" is misused here. I am not part of the tdrive team, nor an ops, just a fellow dev working on another product reporting some issues he did see.

Agreed to push memory limit down to the tdrive application.

I still believe there's a deeper issue lurking in there...

@ericlinagora
Copy link
Contributor

Perhaps there is, but this is the expected behaviour even if there weren't a deeper issue. If you want to make such an issue more visible, then instead of increasing the VM, we should decrease node's space to stress it earlier, until it's truly deployed and if we remember to change it back before, this is probably a better approach from a diagnostic perspective

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tech
Projects
None yet
Development

No branches or pull requests

3 participants