-
Hello. Please understand that English is awkward using a translator. My company built an xmpp server using ejabberd a long time ago, and I'm investigating the related history, distributed source code, ejabberd, and erlang because I don't know. There are three servers, and these servers probably have set-top boxes (clients) connected to them. First, I'll tell you about the problem I'm having.
2024-01-19 06:05:28.548 [info] <0.24578.81>@ejabberd_c2s:bind:411 (tls|<0.24578.81>) Opened c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1
There are about 90,000 devices that we distributed, but the sm table has already exceeded 1.9 million rows... and it is still increasing. The claim came in saying that mysql is eating too much memory and I think it has something to do with it, so I'm investigating it.
From now on, I will continue to monitor the ejabberd.log of the test server, which is recorded to the debug level, and tell you one of the reasons I am expecting. The client connected to the ejabberd server and was given a pid managed by erlang in the process and registered in the sm table. GPT told me that pid is managed by erlang itself, not proaccessId in os. However, for some reason, erlang's process dies after a few seconds. As a result, the connection with the client who had the pid of the dead process is lost, so the client requests connection again, and I expect that the raw value with the dead pid generated in the sm table is not being deleted normally. I tried my best to download ejabberd's open source and analyze the source and log, but erlang is too difficult for me. I think I've done everything I can. The ejabberd in use is 18.12. It's very difficult to get a version up straight to the live server... but I can try to convince my boss if someone tells me it's almost certainly an old version of the bug. Any guess, experience, or advice would be great. Need help. And there's one thing I was really curious about while analyzing the sauce personally. When i look at the source code, it looks like you're updating with the priority when session closing, so I recognize that the connection is a closed client if the priority is "". But in normal situations, discconnecting from the client will delete the row from the sm table. I'm really curious what situation makes these two differences. Thank you for reading the long post. It's my first time asking git hub so I'm sorry if there's anything wrong. ====== additional ====== ejabberdctl.cfg
======= ejabberd.yml
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 4 replies
-
Soo old, at least attach a gist of the config https://docs.ejabberd.im/admin/configuration/modules/#mod-stream-mgmt -> |
Beta Was this translation helpful? Give feedback.
-
How are your session stored? Do you use sql for that or use some other backend? Do you run that in cluster or just on single machine? |
Beta Was this translation helpful? Give feedback.
-
In newer versions there we added code that tries to cleanup session table from dead processes (it was added mostly for issues with cluster, when communication problem between nodes could cause problem with synchronization of session table state), plus there were some changes around session closure in general that could possibly fix this bug in newer version. |
Beta Was this translation helpful? Give feedback.
-
There's something I've checked additionally. First, we checked the source code of 18.12 If the function ejabberd_c2s.erl's process_terminated/6 is called by the OTP framework (if the client is disconnected) Make sure to delete ejabberd_sm_sql.erl_session/1 The current problem is that data with pid should be deleted from the sm table because the "Closing c2s session for ~" log is constantly being logged on the problematic production server. Another one, I decided to look for the problem in mysql this time. When I ran show processlist; on mysql, there were many sleep-state processes. 800 based on test servers and 300 based on operational servers. When the test server shut down 40,000 virtual clients that were connected to it, it was confirmed that many processes, which were in the sleep state of the process list, were changed to query and were performing delete sm. However, since thousands of devices continue to repeat connection close and open on the production server, we expected that there should be a process running sm delete on the process list in the same way, but no matter how many times I checked, there was no such process and it continued to sleep. On the source code, the algorithm is set up to execute delete sm unconditionally, but I think something is wrong that the operation server does not reach the process of executing the sql statement. One more thing, the current production server is allocating 4 cpu to ejabberd, but using 400% cpu. I'm thinking this might be causing the problem, but I haven't found a way to prove it yet. Also, we will have to find the cause of why system resources are so exploding in the first place. As far as I know, ejabberd keeps about 20,000 connections stable. I'll share any additional information I've learned. |
Beta Was this translation helpful? Give feedback.
-
Yes terminate() hooks deletes entries from sm tables, but there are situations when this hook is not called (this may happen when session process is killed externally with a signal, and this may be happening when out of memory handler is triggered, it then may then looks for processes that take lot of memory and kills them - if session process is killed by this it may not clear it's entry from sm table (but usually c2s processes. As an emergency you could also try cleaning this table manually, maybe try seeing if there are some entries older than let say N days, and try to delete them? Maybe something like this:
This checks oldest 10000 sessions and if any of those sessions is not longer alive it will delete them (it only test for processes on current node, so you will need to run that on each node separately) it will return number of session cleaned up. As it checks maximum of 10k session, you probably will need to run that multiple times. |
Beta Was this translation helpful? Give feedback.
-
I think 18.12 version have a bug, when occur connection failed of some kind, ejabberd infinitely resent presence message. |
Beta Was this translation helpful? Give feedback.
Yes terminate() hooks deletes entries from sm tables, but there are situations when this hook is not called (this may happen when session process is killed externally with a signal, and this may be happening when out of memory handler is triggered, it then may then looks for processes that take lot of memory and kills them - if session process is killed by this it may not clear it's entry from sm table (but usually c2s processes.
As an emergency you could also try cleaning this table manually, maybe try seeing if there are some entries older than let say N days, and try to delete them? Maybe something like this: