Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse using memory very much above MAX_MEMORY_USAGE_RATIO and Sentry keeps crashing randomly #5512

Open
hzjsti opened this issue Feb 5, 2024 · 2 comments
Labels
duplicate This issue or pull request already exists

Comments

@hzjsti
Copy link

hzjsti commented Feb 5, 2024

Self-Hosted Version

24.1.0.dev0

CPU Architecture

x86_64

Docker Version

25.0.1

Docker Compose Version

2.24.2

Steps to Reproduce

  1. Installed new Sentry
  2. Started in production

Expected Result

Sentry should stay healthy.

Actual Result

Virtual Server (8 vCPU, 16 GB RAM) getting unhealthy and unreachable from time to time. First investigation shows that Clickhouse is not respecting ENV MAX_MEMORY_USAGE_RATIO = 0.3. See output of top:

top - 14:42:13 up  3:55,  4 users,  load average: 1.07, 0.47, 0.42
Tasks: 350 total,   1 running, 347 sleeping,   0 stopped,   2 zombie
%Cpu(s):  8.1 us,  1.0 sy,  0.0 ni, 90.8 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  15654.4 total,    289.1 free,  14302.4 used,   1455.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1352.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  17765 message+  20   0 9995.2m 339928  28936 S   4.3   2.1  22:17.12 clickhouse-serv
   5015 root      20   0 7999244   1.0g  11148 S   2.3   6.6   5:36.22 java
 180716 999       20   0 1474836 233692  18356 S   2.0   1.5   0:45.63 sentry
   7250 999       20   0 1717872 177496   4784 S   1.0   1.1   1:56.53 sentry
    554 root      20   0 6043556 114496  11560 S   0.7   0.7   1:37.98 dockerd
   4976 999       20   0 1272704 170152   4676 S   0.7   1.1   1:14.56 sentry
   5966 admin     20   0  405868 174864   3492 S   0.7   1.1   1:02.72 snuba
   8820 999       20   0 1052804 174428   4920 S   0.7   1.1   1:48.96 sentry
  11593 999       20   0 1568796 170144   4932 S   0.7   1.1   1:16.15 sentry
  15989 999       20   0  936016 165160   4800 S   0.7   1.0   0:38.38 sentry
  17669 999       20   0  936760 166200   4720 S   0.7   1.0   0:40.87 sentry
  18258 999       20   0 1485856 237768  15180 S   0.7   1.5   2:40.78 sentry
  ...

Sometimes, the server comes back after about 20min, many times a hard reboot is necessary. Sounds like heavy swapping. If am now trying to get logs of an outage event and post more information as soon as available.

Event ID

No response

@azaslavsky
Copy link

I believe this is a dupe of getsentry/self-hosted#2700?

@azaslavsky azaslavsky added Waiting for: Community duplicate This issue or pull request already exists labels Feb 7, 2024
@hzjsti
Copy link
Author

hzjsti commented Feb 7, 2024

I can now confirm it is a memory (RAM) issue. I was able to have a look at top while crashing. The server starts swapping at some point. As the server didn't have much swap space it got out of memory. I increased that to a high value and hope that prevents the crashes.

I am not sure if this is a dupe of getsentry/self-hosted#2700 and if it is Clickhouse causing the problem. At the point of swapping Clickhouse's memory usage didn't grow very much. Perhaps I can now see more as another such event would have more place to grow into Swap Memory. As RAM consumption is about 95% all the time it could be any service. I also can not see a higher amount of Sentry Events going in or anything suspect in global throughput statistics just before.

In any case I think it should be investigated why Clickhouse doesn't respect MAX_MEMORY_USAGE_RATIO.

@azaslavsky azaslavsky transferred this issue from getsentry/self-hosted Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
Status: No status
Status: No status
Development

No branches or pull requests

3 participants