Not removed from sm table when client disconnects #4150

Ktshas · 2024-01-19T07:23:33Z

Ktshas
Jan 19, 2024

Hello. Please understand that English is awkward using a translator.

My company built an xmpp server using ejabberd a long time ago, and I'm investigating the related history, distributed source code, ejabberd, and erlang because I don't know.

There are three servers, and these servers probably have set-top boxes (clients) connected to them.
ejabberd is configured to use mysql, so it seems that the sm table in mysql is determining the client's on/offline.

First, I'll tell you about the problem I'm having.

After multiple set-top boxes (clients) have successfully opened c2s session, a closing session occurs soon after. This is constantly repeating itself. I'll give you some of the logs below.

2024-01-19 06:05:28.548 [info] <0.24578.81>@ejabberd_c2s:bind:411 (tls|<0.24578.81>) Opened c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1
2024-01-19 06:05:28.704 [info] <0.24578.81>@ejabberd_c2s:process_terminated:267 (tls|<0.24578.81>) Closing c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1: Stream closed by local host: Replaced by new connection (conflict)
2024-01-19 06:05:28.704 [info] <0.24578.81>@ejabberd_c2s:process_terminated:267 (tls|<0.24578.81>) Closing c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1: Stream closed by local host: Replaced by new connection (conflict)
2024-01-19 06:06:03.402 [info] <0.24579.81>@ejabberd_c2s:bind:411 (tls|<0.24579.81>) Opened c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1
2024-01-19 06:06:03.402 [info] <0.24579.81>@ejabberd_c2s:bind:411 (tls|<0.24579.81>) Opened c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1
2024-01-19 06:06:03.560 [info] <0.24579.81>@ejabberd_c2s:process_terminated:267 (tls|<0.24579.81>) Closing c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1: Connection failed: connection closed
2024-01-19 06:06:03.560 [info] <0.24579.81>@ejabberd_c2s:process_terminated:267 (tls|<0.24579.81>) Closing c2s session for 9877e7-extender-bs1008051a002386@krms.commufa.jp/XMPPConn1: Connection failed: connection closed

I understand that ejabberd usually delete the row from the sm table when lose connection with the client. However, it is not deleted and continues to be inserted as new pid and usec with the same username.

There are about 90,000 devices that we distributed, but the sm table has already exceeded 1.9 million rows... and it is still increasing.

The claim came in saying that mysql is eating too much memory and I think it has something to do with it, so I'm investigating it.

I'm not sure about this information. This is not happening on servers with a small number of clients (about 5000). It's only happening on servers with 20,000 or 40,000 attached.
In order to confirm that this phenomenon is the same on the test server, we tried to attach 40,000 clients to ejabberd using a virtual client called tsung on the test server, and put 2 million dummy data in the sm table, but we found that both the connection status of the sm table and mysql memory.tsung client is stable.

From now on, I will continue to monitor the ejabberd.log of the test server, which is recorded to the debug level, and tell you one of the reasons I am expecting.

The client connected to the ejabberd server and was given a pid managed by erlang in the process and registered in the sm table. GPT told me that pid is managed by erlang itself, not proaccessId in os.

However, for some reason, erlang's process dies after a few seconds. As a result, the connection with the client who had the pid of the dead process is lost, so the client requests connection again, and I expect that the raw value with the dead pid generated in the sm table is not being deleted normally.

I tried my best to download ejabberd's open source and analyze the source and log, but erlang is too difficult for me.
I backtrack the code to some extent and there's a section where I can't do it anymore... so a lot of it is just my guess.

I think I've done everything I can.
I don't know the cause or solution of the problem that I can think about anymore, so I ask for help.

The ejabberd in use is 18.12.
Actually, I want to try versioning of ejabberd, but even in the current version, I used tsung emulator on the test server, but it doesn't seem meaningful to do versioning on the test server because symptoms haven't developed.

It's very difficult to get a version up straight to the live server... but I can try to convince my boss if someone tells me it's almost certainly an old version of the bug.

Any guess, experience, or advice would be great. Need help.

And there's one thing I was really curious about while analyzing the sauce personally.
The priority of the sm table may have a 0 or a "",

When i look at the source code, it looks like you're updating with the priority when session closing, so I recognize that the connection is a closed client if the priority is "".

But in normal situations, discconnecting from the client will delete the row from the sm table. I'm really curious what situation makes these two differences.

Thank you for reading the long post. It's my first time asking git hub so I'm sorry if there's anything wrong.

====== additional ======
add configurations, find in ejabberd's docker container.

ejabberdctl.cfg

ERL_MAX_PORTS=2000000
ERL_PROCESSES=2500000
ERL_MAX_ETS_TABLES=20000
ERL_OPTIONS="-mnesia dump_log_write_threshold 100 -mnesia dc_dump_limit 4 -sbt db -sbwt none -swt low -smp enable +S 4"

[ERLANG_NODE=secret_xmpp@secret_xmpp_server_11](mailto:ERLANG_NODE=secret_xmpp@secret_xmpp_server_11)

EJABBERD_BYPASS_WARNINGS=true

=======

ejabberd.yml

loglevel: 4
log_rotate_size: 10485760
log_rotate_count: 0
log_rate_limit: 100

hosts:
  - "krms"
  - "krms.commufa.jp"

listen:
  -
    port: 5223
    module: ejabberd_c2s
    starttls_required: true
    protocol_options:
      - "no_sslv2"
      - "no_sslv3"
      - "no_tlsv1"
    max_stanza_size: 65536
    shaper: c2s_shaper
    access: c2s
    tls_compression: false
    ciphers: "HIGH:!aNULL:!3DES"
  -
    port: 5279
    module: ejabberd_s2s_in
  -
    port: 4570
    module: ejabberd_xmlrpc
    access_commands:
      configure:
        all: []

  -
    port: 5290
    module: ejabberd_http
    request_handlers:
      "/websocket": ejabberd_http_ws
      "/api": mod_http_api
    web_admin: true
    http_bind: true
    ciphers: "HIGH:!aNULL:!3DES"

  -
    port: 5291
    #ip: "127.0.0.1"
    module: ejabberd_http
    request_handlers:
     "/api": mod_http_api

certfiles:
  - "/opt/ejabberd/ssl/host.pem"
  - "/opt/ejabberd/ssl/krms.pem"
  - "/opt/ejabberd/ssl/krms.commufa.jp.pem"

s2s_use_starttls: required
s2s_protocol_options:
  - "no_sslv3"
  - "no_tlsv1"
s2s_ciphers: "secret"

auth_method:
  - external

sm_cache_size: 200000
cache_size : 200000
sql_pool_size : 150
sql_query_timeout : 2500
extauth_program: "/opt/ejabberd/scripts/auth/auth.php"
auth_use_cache: true

shaper:
  normal: 10000
  fast: 50000
max_fsm_queue: 120000

acl:
  admin:
    user:
      - "admin": "secret"
  local:
    user_regexp: ""

access:
  max_user_sessions:
    #all: 10
    all: 50
  max_user_offline_messages:
    admin: 5000
    all: 100
  local:
    local: allow
  c2s:
    blocked: deny
    all: allow
  c2s_shaper:
    admin: none
    all: normal
  s2s_shaper:
    all: fast
  announce:
    admin: allow
  configure:
    admin: allow
  muc_admin:
    admin: allow
  muc_create:
    local: allow

  muc:
    all: allow
  pubsub_createnode:
    local: allow
  register:
    all: allow

  trusted_network:
    loopback: allow
  soft_upload_quota:
    all: 400
  hard_upload_quota:
    all: 500

language: "en"
api_permissions:
  "public commands":
    who:
       - all
       #- ip: "127.0.0.1/8"
    what:
       - "*"

modules:
  mod_adhoc: {}
  mod_admin_extra: {}
  mod_announce: # recommends mod_adhoc
    access: announce
  mod_blocking: {} # requires mod_privacy
  mod_bosh: {}
  mod_caps: {}
  mod_carboncopy: {}
  mod_client_state:
    queue_chat_states: false
    queue_pep: false
    queue_presence: false
  mod_configure: {}
  mod_disco: {}
  mod_last: {}
  mod_privacy: {}
  mod_push: {}
  mod_roster:
    versioning: true
  mod_s2s_dialback: {}
  mod_shared_roster: {}
  mod_stats: {}
  mod_time: {}
  mod_version: {}
certfiles:
  - "/opt/ejabberd/ssl/*.pem"
sql_type: mysql
sql_server: "xxx_rdb"
sql_database: "xmpp"
sql_username: "secret"
sql_password: "secret"
sql_port: 3326

Answered by prefiks

Jan 24, 2024

Yes terminate() hooks deletes entries from sm tables, but there are situations when this hook is not called (this may happen when session process is killed externally with a signal, and this may be happening when out of memory handler is triggered, it then may then looks for processes that take lot of memory and kills them - if session process is killed by this it may not clear it's entry from sm table (but usually c2s processes.

As an emergency you could also try cleaning this table manually, maybe try seeing if there are some entries older than let say N days, and try to delete them? Maybe something like this:

(fun(Host)->{selected, _, R} = ejabberd_sql:sql_query(Host, <<"select usec, p…

View full answer

licaon-kter · 2024-01-19T08:24:43Z

licaon-kter
Jan 19, 2024

The ejabberd in use is 18.12

Soo old, at least attach a gist of the config

https://docs.ejabberd.im/admin/configuration/modules/#mod-stream-mgmt -> resume_timeout

1 reply

Ktshas Jan 22, 2024
Author

Thank you for your comment. I have added the contents of the configuration file to the main text. Additionally, I am making an effort to understand the part related to 'resume_timeout' in the link. If the issue is resolved or there is any progress, I will share the information through additional comments

prefiks · 2024-01-19T11:58:04Z

prefiks
Jan 19, 2024
Maintainer

How are your session stored? Do you use sql for that or use some other backend? Do you run that in cluster or just on single machine?

1 reply

Ktshas Jan 22, 2024
Author

"We have three running ejabberd nodes, but it's not configured as a cluster; rather, it's set up as a single instance. Although we're unsure about the specifics of the load balancer configuration, clients do connect randomly to the three nodes. However, there seems to be an uneven distribution of connections, with a tendency for clients to concentrate on one of the nodes. Regarding session management, we assume that ejabberd inserts or deletes data in the 'sm' table in MySQL based on on/offline status. In our backend program, we determine the on/offline status by checking the presence of a username in the 'sm' table."

prefiks · 2024-01-19T12:22:00Z

prefiks
Jan 19, 2024
Maintainer

In newer versions there we added code that tries to cleanup session table from dead processes (it was added mostly for issues with cluster, when communication problem between nodes could cause problem with synchronization of session table state), plus there were some changes around session closure in general that could possibly fix this bug in newer version.

1 reply

Ktshas Jan 22, 2024
Author

It seems that there were changes in session management during the version upgrade process. Currently, the only feasible approach for me is to replicate the same issue in a test environment using ejabberd 18.12, before attempting the version upgrade. While the operational server deals with real set-top clients, the test environment utilizes a simple emulator called 'tsung,' so the discrepancy in replication might be attributed to such differences. Thank you.

Ktshas · 2024-01-24T02:15:41Z

Ktshas
Jan 24, 2024
Author

There's something I've checked additionally.

First, we checked the source code of 18.12

If the function ejabberd_c2s.erl's process_terminated/6 is called by the OTP framework (if the client is disconnected)

Make sure to delete ejabberd_sm_sql.erl_session/1
It appears to be programmed for this to be called.

The current problem is that data with pid should be deleted from the sm table because the "Closing c2s session for ~" log is constantly being logged on the problematic production server.

Another one, I decided to look for the problem in mysql this time.

When I ran show processlist; on mysql, there were many sleep-state processes.

800 based on test servers and 300 based on operational servers.

When the test server shut down 40,000 virtual clients that were connected to it, it was confirmed that many processes, which were in the sleep state of the process list, were changed to query and were performing delete sm.

However, since thousands of devices continue to repeat connection close and open on the production server, we expected that there should be a process running sm delete on the process list in the same way, but no matter how many times I checked, there was no such process and it continued to sleep.

On the source code, the algorithm is set up to execute delete sm unconditionally, but I think something is wrong that the operation server does not reach the process of executing the sql statement.
Unfortunately, the biggest problem is that the operational server service can be restarted at log level 4. (Customer issue...)

One more thing, the current production server is allocating 4 cpu to ejabberd, but using 400% cpu. I'm thinking this might be causing the problem, but I haven't found a way to prove it yet.

Also, we will have to find the cause of why system resources are so exploding in the first place.

As far as I know, ejabberd keeps about 20,000 connections stable.
However, there is a possibility that the problem may be occurring because thousands of connection requests and connection failures are continuously repeated at the same time due to the problem with our client equipment.

I'll share any additional information I've learned.
If anyone has an idea after seeing additional information, I would appreciate it if you could let me know.

0 replies

prefiks · 2024-01-24T11:23:49Z

prefiks
Jan 24, 2024
Maintainer

Yes terminate() hooks deletes entries from sm tables, but there are situations when this hook is not called (this may happen when session process is killed externally with a signal, and this may be happening when out of memory handler is triggered, it then may then looks for processes that take lot of memory and kills them - if session process is killed by this it may not clear it's entry from sm table (but usually c2s processes.

As an emergency you could also try cleaning this table manually, maybe try seeing if there are some entries older than let say N days, and try to delete them? Maybe something like this:

(fun(Host)->{selected, _, R} = ejabberd_sql:sql_query(Host, <<"select usec, pid, username, resource from sm  where server_host='",Host/binary,"' and node='",(erlang:atom_to_binary(node(), latin1))/binary,"' order by usec asc limit 10000">>), lists:foldl(fun([U,P,N,R], C) -> PP = erlang:list_to_pid(binary_to_list(P)), case is_process_alive(PP) of false -> ejabberd_sm:close_session({misc:usec_to_now(binary_to_integer(U)), PP}, N, Host, R), C+1; _ -> C end end, 0, R) end)(<<"my.server">>).

This checks oldest 10000 sessions and if any of those sessions is not longer alive it will delete them (it only test for processes on current node, so you will need to run that on each node separately) it will return number of session cleaned up. As it checks maximum of 10k session, you probably will need to run that multiple times.

1 reply

Ktshas Jan 25, 2024
Author

Thank you so much.

I also thought that due to the expected lack of system resources, the process with the pid (managed by erlang) of ejabberd may be lost and the deletion may not be smooth.

I think it's a program where we can see ejabberd_sql.erl

sync_send_event(Pid, Msg, Timeout) ->
try p1_fsm:sync_send_event(Pid, Msg, Timeout)
catch _:{Reason, {p1_fsm, _, _}} ->
{error, Reason}
end.

My guess is that an exception occurred in the process "p1_fsm:sync_send_event" and the sql_query_internal function could not be reached through nested_op, and the deletion would not take place because an error was returned.

Unfortunately, we don't even know the exact reason because 18.12 is not logging what error returns in this case.

If possible, I'd like to do a version up or at least add a log to analyze more detailed causes, but my boss says I can't redistribute or restart my system because I have to keep this a secret from my customers. I'd love to punch them.

As you told me, I think it would be the best solution in the current situation where it is not possible to update or modify the version of ejabberd.

In addition to the issue of deleting the sm table, there is an issue in which more than half of the equipment is closed immediately after the session is opened.
This is also likely due to problems with managing erlang processes due to a lack of system resources, so it seems meaningless to investigate further.

Finally, it would be best if it proves that the issue can be reproduced by generating system load, such as lowering the speck of a symptom-free environment or allowing the emulator to send white space messages more frequently.

Your reply was very helpful, thank you.

Ktshas · 2024-02-27T06:34:55Z

Ktshas
Feb 27, 2024
Author

I think 18.12 version have a bug, when occur connection failed of some kind, ejabberd infinitely resent presence message.
It is make CPU usage 100%, even if ejabberd server has 16 core cpu, ejabberd was using 1600%.
I'm trying to test now by 23.10 ver, and working stable yet.
I suspect that most of the problems in this discussion are caused by erlang killing processes because there are no CPU resources available due to the bug.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not removed from sm table when client disconnects #4150

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Not removed from sm table when client disconnects #4150

Ktshas Jan 19, 2024

Replies: 6 comments · 4 replies

licaon-kter Jan 19, 2024

Ktshas Jan 22, 2024 Author

prefiks Jan 19, 2024 Maintainer

Ktshas Jan 22, 2024 Author

prefiks Jan 19, 2024 Maintainer

Ktshas Jan 22, 2024 Author

Ktshas Jan 24, 2024 Author

prefiks Jan 24, 2024 Maintainer

Ktshas Jan 25, 2024 Author

Ktshas Feb 27, 2024 Author

Ktshas
Jan 19, 2024

Replies: 6 comments 4 replies

licaon-kter
Jan 19, 2024

Ktshas Jan 22, 2024
Author

prefiks
Jan 19, 2024
Maintainer

Ktshas Jan 22, 2024
Author

prefiks
Jan 19, 2024
Maintainer

Ktshas Jan 22, 2024
Author

Ktshas
Jan 24, 2024
Author

prefiks
Jan 24, 2024
Maintainer

Ktshas Jan 25, 2024
Author

Ktshas
Feb 27, 2024
Author