Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak in EdgeHub module #6997

Open
KhantNayan opened this issue Apr 17, 2023 · 8 comments
Open

Memory Leak in EdgeHub module #6997

KhantNayan opened this issue Apr 17, 2023 · 8 comments
Assignees

Comments

@KhantNayan
Copy link

Expected Behavior

Memory utilization by EdgeHub module should be constant

Current Behavior

Memory utilization by EdgeHub module is increasing

Steps to Reproduce

  1. Run EdgeHub module for 2-3 days
  2. Observe memory utilization of the EdgeHub module
  3. Restart EdgeHub module and check memory utilization

Context (Environment)

Output of iotedge check

damen@damen:~$ sudo iotedge check [sudo] password for damen:

Configuration checks (aziot-identity-service)

√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
‼ aziot-identity-service package is up-to-date - Warning
Installed aziot-identity-service package has version 1.4.1 but 1.4.3 is the latest stable version available.
Please see https://aka.ms/aziot-update-runtime for update instructions.
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)

√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks

√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
‼ aziot-edge package is up-to-date - Warning
Installed IoT Edge daemon has version 1.4.3 but 1.4.9 is the latest stable version available.
Please see https://aka.ms/iotedge-update-runtime for update instructions.
√ container time is close to host time - OK
√ DNS server - OK
‼ production readiness: logs policy - Warning
Container engine is not configured to rotate module logs which may cause it run out of disk space.
Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
You can ignore this warning if you are setting log policy per module in the Edge deployment.
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks

√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
32 check(s) succeeded.
3 check(s) raised warnings. Re-run with --verbose for more details.
2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.

Device Information

  • Host OS : Ubuntu 20.04
  • Architecture : amd64
  • Container OS : Linux containers

Runtime Versions

  • aziot-edged [run iotedge version]: 1.4.3
  • Edge Agent [image tag (e.g. 1.0.0)]: mcr.microsoft.com/azureiotedge-agent:1.4.2-linux-amd64
  • Edge Hub [image tag (e.g. 1.0.0)]: mcr.microsoft.com/azureiotedge-hub:1.4.2-linux-amd64
  • Docker/Moby [run docker version]: 20.10.18+azure-2

Additional Information

10-20 MB increasing at every 4 hours
Below are the memory utilization be Edge Hub

******************************************
*   Mon Mar 13 16:00:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.11%     1.906MiB / 500MiB     0.38%     27.4kB / 30.1kB   0B / 0B          3
3d3b2f853eb8   edgeHub                  0.04%     163.4MiB / 500MiB     32.68%    26.6MB / 32.6MB   4.1kB / 2.45GB   25
d3cc99067218   devicemanagement         1.11%     2.23MiB / 500MiB      0.45%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                0.00%     51.72MiB / 500MiB     10.34%    561kB / 600kB     8.19kB / 291kB   18
******************************************
*   Mon Mar 13 20:00:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.11%     1.91MiB / 500MiB      0.38%     55.4kB / 64.3kB   0B / 0B          3
3d3b2f853eb8   edgeHub                  0.15%     182.7MiB / 500MiB     36.54%    38.7MB / 49.9MB   4.1kB / 3.31GB   25
d3cc99067218   devicemanagement         1.14%     2.223MiB / 500MiB     0.44%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                0.00%     50.85MiB / 500MiB     10.17%    1.21MB / 1.32MB   8.19kB / 537kB   17
******************************************
*   Tue Mar 14 00:01:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.10%     1.914MiB / 500MiB     0.38%     78.5kB / 92.2kB   0B / 0B          3
3d3b2f853eb8   edgeHub                  0.04%     197.6MiB / 500MiB     39.52%    48.3MB / 63.8MB   4.1kB / 3.99GB   24
d3cc99067218   devicemanagement         1.16%     2.227MiB / 500MiB     0.45%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                1.37%     51.69MiB / 500MiB     10.34%    1.72MB / 1.9MB    8.19kB / 733kB   18
******************************************
*   Tue Mar 14 04:01:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.11%     1.922MiB / 500MiB     0.38%     101kB / 120kB     0B / 0B          3
3d3b2f853eb8   edgeHub                  0.06%     211.9MiB / 500MiB     42.38%    58MB / 77.8MB     4.1kB / 4.68GB   26
d3cc99067218   devicemanagement         1.07%     2.23MiB / 500MiB      0.45%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                0.00%     51.9MiB / 500MiB      10.38%    2.23MB / 2.47MB   8.19kB / 930kB   18
******************************************
@vipeller vipeller self-assigned this Apr 17, 2023
@vipeller vipeller added the bug Something isn't working label Apr 17, 2023
@vipeller
Copy link
Contributor

Hi @KhantNayan, sorry for the late answer. Do you see this growing indefinitely (so it never stops)? We have long haul tests and we did not notice this, but let me it up and see. I would think that it is related to caching and the memory gets freed when it is scarce.

@KhantNayan
Copy link
Author

Thank you @vipeller for the response.
This memory utilization is growing continuously and we set the memory limitation to 500 MB, So the OutOfMemory killer is invoking at that limit.
And due to OOM, sometimes system is misbehaving like not connected to iothub, threw exception in EdgeHub/EdgeAgent module

@KhantNayan
Copy link
Author

Attached support bundle
support_bundle_2023_05_08_16_03_10_UTC.zip

@vipeller
Copy link
Contributor

vipeller commented May 8, 2023

@KhantNayan Hi, I see from the support bundle that there are many modules. Can you give me a hint about the message pattern, e.g. that these modules use module to module messages and what message size/message rate? I don't need the exact schema, just want to setup some similar test, I want to find the spot where it leaks, we don't see leaking in out long running tests

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

This issue is being marked as stale because it has been open for 30 days with no activity.

@spark-iiot
Copy link

Microsoft has reproduced this problem on multiple occasions without succeeding in resolving it.

@spark-iiot
Copy link

Any updates on this incident ?

@vipeller
Copy link
Contributor

vipeller commented Sep 5, 2023

Hi @spark-iiot, this issue is not actively being investigated. I asked some additional information on May 8th about your setup, so I could run a test with similar module number/behavior.

We have long running tests and those don't show memory leak. Those send through several 10 thousands of messages through several days.

We had memory leak problems in the past, resulted by different bugs, RocksDB, etc - some of them were found and fixed, others were worked around. These were triggered by specific use cases.

Without knowing your use case, I will not be able to repro it and see what may cause this. I need to run something similar than you do, so then I can check with memory profiling that what holds on the memory.

Please, give some information what you actually do:

  • from the logs you attached I saw a high number of modules. The stats however in the initial reports shows only two additional modules (device monitor/device management) Does it mean that the leak occurs with only two modules running?
  • What is the messaging pattern? Again, from the stats I see edgeHub net io in the several MB range, but the two active modules barely have any. Is it because modules are stopping/starting? If that so, it would be important to know, maybe the leak is triggered by something like that.

Let me know your setup better, so we can create better tests to repro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants