Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After vernemq starts the cluster, the cpu and memory usage is abnormal #2146

Open
GoneGo1ng opened this issue Apr 25, 2023 · 14 comments
Labels

Comments

@GoneGo1ng
Copy link

Environment

  • VerneMQ Version: 1.12.4
  • OS: Alibaba Cloud Linux 3 (Soaring Falcon)
  • Erlang/OTP version (if building from source):
  • Cluster size/standalone: 3 nodes

In the process of verifying the vernemq cluster, we found that even if there is no client connection, the CPU and memory usage of the vernemq is still high.
image
image

Current Behavior

The CPU and memory usage of vernemq is very high. Looking at the logs, it seems that what data is being synchronized, but we cannot judge what data is being synchronized.

Expected behaviour

The CPU and memory of vernemq should not be so high when no clients are connected to vernemq.

Configuration, logs, error output, etc.

Configuration:

accept_eula = yes
allow_anonymous = off
allow_register_during_netsplit = on
allow_publish_during_netsplit = on
allow_subscribe_during_netsplit = on
allow_unsubscribe_during_netsplit = on
allow_multiple_sessions = off
coordinate_registrations = on
max_inflight_messages = 20
max_online_messages = 1000
max_offline_messages = 1000
max_message_size = 0
upgrade_outgoing_qos = off
listener.max_connections = 10000
listener.nr_of_acceptors = 10
listener.tcp.default = 10.50.5.31:1883
listener.vmq.clustering = 10.50.5.31:44053
listener.http.default = 10.50.5.31:8888
systree_enabled = on
systree_interval = 20000
graphite_enabled = off
graphite_host = localhost
graphite_port = 2003
graphite_interval = 20000
shared_subscription_policy = prefer_local
plugins.vmq_passwd = on
plugins.vmq_acl = off
plugins.vmq_diversity = on
plugins.vmq_webhooks = off
plugins.vmq_bridge = off
topic_max_depth = 20
metadata_plugin = vmq_swc
vmq_acl.acl_file = /etc/vernemq/vmq.acl
vmq_acl.acl_reload_interval = 10
vmq_passwd.password_file = /etc/vernemq/vmq.passwd
vmq_passwd.password_reload_interval = 10
vmq_diversity.script_dir = /usr/share/vernemq/lua
vmq_diversity.auth_postgres.enabled = off
vmq_diversity.postgres.ssl = off
vmq_diversity.postgres.password_hash_method = crypt
vmq_diversity.auth_cockroachdb.enabled = off
vmq_diversity.cockroachdb.ssl = on
vmq_diversity.cockroachdb.password_hash_method = bcrypt
vmq_diversity.auth_mysql.enabled = off
vmq_diversity.mysql.password_hash_method = password
vmq_diversity.auth_mongodb.enabled = off
vmq_diversity.mongodb.ssl = off
vmq_diversity.auth_redis.enabled = on
vmq_diversity.redis.host = xxxxxx
vmq_diversity.redis.port = 6379
vmq_diversity.redis.password = xxxxxx
vmq_diversity.redis.database = 0
retry_interval = 60
vmq_bcrypt.pool_size = 1
vmq_bcrypt.nif_pool_size = 4
vmq_bcrypt.nif_pool_max_overflow = 10
vmq_bcrypt.default_log_rounds = 12
vmq_bcrypt.mechanism = port
log.console = file
log.console.level = info
log.console.file = /var/log/vernemq/console.log
log.error.file = /var/log/vernemq/error.log
log.syslog = off
log.crash = on
log.crash.file = /var/log/vernemq/crash.log
log.crash.maximum_message_size = 64KB
log.crash.size = 10MB
log.crash.rotation = $D0
log.crash.rotation.keep = 5
nodename = VerneMQ01@10.50.5.31
distributed_cookie = vmq
erlang.async_threads = 64
erlang.max_ports = 262144
leveldb.maximum_memory.percent = 70
include conf.d/*.conf

Console:

2023-04-25 00:04:33.309 [info] <0.17900.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta8: AE exchange with 'VerneMQ03@10.50.5.33' synced 77 objects
2023-04-25 00:04:37.184 [info] <0.19843.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta3: AE exchange with 'VerneMQ03@10.50.5.33' synced 68 objects
2023-04-25 00:04:38.813 [info] <0.17315.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta4: AE exchange with 'VerneMQ02@10.50.5.32' synced 155 objects
2023-04-25 00:04:48.484 [info] <0.23052.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta10: AE exchange with 'VerneMQ02@10.50.5.32' synced 119 objects
2023-04-25 00:04:48.491 [info] <0.24431.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta2: AE exchange with 'VerneMQ03@10.50.5.33' synced 51 objects
2023-04-25 00:04:50.110 [info] <0.23731.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta5: AE exchange with 'VerneMQ02@10.50.5.32' synced 127 objects
2023-04-25 00:04:50.368 [info] <0.24462.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta3: AE exchange with 'VerneMQ02@10.50.5.32' synced 422 objects
2023-04-25 00:04:50.427 [info] <0.23544.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta9: AE exchange with 'VerneMQ02@10.50.5.32' synced 163 objects
2023-04-25 00:04:50.442 [info] <0.19110.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta8: AE exchange with 'VerneMQ02@10.50.5.32' synced 142 objects
2023-04-25 00:04:50.805 [info] <0.24433.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta6: AE exchange with 'VerneMQ02@10.50.5.32' synced 134 objects
2023-04-25 00:04:50.954 [info] <0.24411.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta4: AE exchange with 'VerneMQ03@10.50.5.33' synced 63 objects
2023-04-25 00:04:50.958 [info] <0.24363.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta7: AE exchange with 'VerneMQ03@10.50.5.33' synced 72 objects
2023-04-25 00:04:59.636 [info] <0.25652.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta10: AE exchange with 'VerneMQ02@10.50.5.32' synced 119 objects
2023-04-25 00:05:03.661 [info] <0.24500.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta1: AE exchange with 'VerneMQ03@10.50.5.33' synced 61 objects
2023-04-25 00:05:05.030 [info] <0.26230.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta8: AE exchange with 'VerneMQ03@10.50.5.33' synced 77 objects
2023-04-25 00:05:05.080 [info] <0.27193.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta2: AE exchange with 'VerneMQ02@10.50.5.32' synced 402 objects
2023-04-25 00:05:05.193 [info] <0.27246.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta5: AE exchange with 'VerneMQ02@10.50.5.32' synced 127 objects
2023-04-25 00:05:05.621 [info] <0.26599.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta9: AE exchange with 'VerneMQ02@10.50.5.32' synced 163 objects
2023-04-25 00:05:06.209 [info] <0.27603.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta4: AE exchange with 'VerneMQ03@10.50.5.33' synced 63 objects
2023-04-25 00:05:06.229 [info] <0.27305.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta7: AE exchange with 'VerneMQ03@10.50.5.33' synced 72 objects
2023-04-25 00:05:10.841 [info] <0.27330.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta6: AE exchange with 'VerneMQ02@10.50.5.32' synced 134 objects
2023-04-25 00:05:11.294 [info] <0.27640.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta3: AE exchange with 'VerneMQ03@10.50.5.33' synced 68 objects
2023-04-25 00:05:11.481 [info] <0.29367.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta9: AE exchange with 'VerneMQ03@10.50.5.33' synced 68 objects
2023-04-25 00:05:18.128 [info] <0.30776.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta6: AE exchange with 'VerneMQ03@10.50.5.33' synced 85 objects
2023-04-25 00:05:18.229 [info] <0.29267.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta10: AE exchange with 'VerneMQ02@10.50.5.32' synced 119 objects
2023-04-25 00:05:19.835 [info] <0.29066.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta8: AE exchange with 'VerneMQ03@10.50.5.33' synced 77 objects
2023-04-25 00:05:20.402 [info] <0.30419.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta2: AE exchange with 'VerneMQ03@10.50.5.33' synced 51 objects
2023-04-25 00:05:21.043 [info] <0.29530.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta5: AE exchange with 'VerneMQ02@10.50.5.32' synced 127 objects
2023-04-25 00:05:21.576 [info] <0.30801.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta3: AE exchange with 'VerneMQ02@10.50.5.32' synced 422 objects
2023-04-25 00:05:22.176 [info] <0.30398.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta7: AE exchange with 'VerneMQ02@10.50.5.32' synced 149 objects
2023-04-25 00:05:23.510 [info] <0.28819.1087>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta4: AE exchange with 'VerneMQ03@10.50.5.33' synced 63 objects
2023-04-25 00:05:27.217 [info] <0.3148.1088>@vmq_swc_exchange_fsm:teardown:{112,13} Replica meta9: AE exchange with 'VerneMQ03@10.50.5.33' synced 68 objects

Error:

2023-04-21 02:03:48.123 [error] <0.268.0> Supervisor vmq_swc_exchange_sup_meta3 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.24745.582> exit with reason {{nodedown,'VerneMQ02@10.50.5.32'},{gen_server,call,[{vmq_swc_edist_meta3,'VerneMQ02@10.50.5.32'},{apply,vmq_swc_store,rpc_node_clock,[]},infinity]}} in context child_terminated
2023-04-21 02:03:48.123 [error] <0.346.0> Supervisor vmq_swc_exchange_sup_meta8 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.25338.582> exit with reason {{nodedown,'VerneMQ02@10.50.5.32'},{gen_server,call,[{vmq_swc_edist_meta8,'VerneMQ02@10.50.5.32'},{apply,vmq_swc_store,rpc_node_clock,[]},infinity]}} in context child_terminated
2023-04-21 02:03:48.123 [error] <0.243.0> Supervisor vmq_swc_exchange_sup_meta1 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.23505.582> exit with reason {{nodedown,'VerneMQ02@10.50.5.32'},{gen_server,call,[{vmq_swc_edist_meta1,'VerneMQ02@10.50.5.32'},{apply,vmq_swc_store,rpc_sync_missing,['VerneMQ01@10.50.5.31',[{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},218063},{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},218062},{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},218061},{{...},...},...]]},...]}} in context child_terminated
2023-04-21 02:03:48.123 [error] <0.305.0> Supervisor vmq_swc_exchange_sup_meta6 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.25004.582> exit with reason {{nodedown,'VerneMQ02@10.50.5.32'},{gen_server,call,[{vmq_swc_edist_meta6,'VerneMQ02@10.50.5.32'},{apply,vmq_swc_store,rpc_node_clock,[]},infinity]}} in context child_terminated
2023-04-21 02:03:48.124 [error] <0.368.0> Supervisor vmq_swc_exchange_sup_meta9 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.23687.582> exit with reason {{nodedown,'VerneMQ02@10.50.5.32'},{gen_server,call,[{vmq_swc_edist_meta9,'VerneMQ02@10.50.5.32'},{apply,vmq_swc_store,rpc_sync_missing,['VerneMQ01@10.50.5.31',[{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},185623},{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},185622},{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},185621},{{...},...},...]]},...]}} in context child_terminated
2023-04-21 02:03:48.124 [error] <0.253.0> Supervisor vmq_swc_exchange_sup_meta2 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.24045.582> exit with reason {{nodedown,'VerneMQ02@10.50.5.32'},{gen_server,call,[{vmq_swc_edist_meta2,'VerneMQ02@10.50.5.32'},{apply,vmq_swc_store,rpc_sync_missing,['VerneMQ01@10.50.5.31',[{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},119967},{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},119966},{{'VerneMQ03@10.50.5.33',<<164,146,163,148,243,48,192,18,144,11,142,80,165,238,181,39,17,195,98,249>>},119965},{{...},...},...]]},...]}} in context child_terminated
2023-04-21 02:03:48.124 [error] <0.393.0> Supervisor vmq_swc_exchange_sup_meta10 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.24991.582> exit with reason {{nodedown,'VerneMQ02@10.50.5.32'},{gen_server,call,[{vmq_swc_edist_meta10,'VerneMQ02@10.50.5.32'},{apply,vmq_swc_store,rpc_node_clock,[]},infinity]}} in context child_terminated
2023-04-21 07:51:39.736 [error] <0.12064.608>@vmq_mqtt_fsm:auth_on_publish:{739,13} can't auth publish [<<"20670539F543">>,{[],<<"ZYZH-YT407CAT-YTWSKJ-1|20670539F543_temp">>},1,[<<>>,<<"s">>,<<"ZYZH-YT407CAT-YTWSKJ-1">>,<<"20670539F543">>,<<"r">>,<<"57">>,<<"rq">>,<<"1111">>],<<"{\"d\":{\"pv\":1,\"ds\":128,\"sn\":\"20670539F543\"}}">>,false] due to not_authorized
2023-04-21 07:57:03.685 [error] <0.23941.608>@vmq_mqtt_fsm:auth_on_publish:{739,13} can't auth publish [<<"20670539F543">>,{[],<<"ZYZH-YT407CAT-YTWSKJ-1|20670539F543_temp">>},1,[<<>>,<<"s">>,<<"ZYZH-YT407CAT-YTWSKJ-1">>,<<"20670539F543">>,<<"r">>,<<"57">>,<<"rq">>,<<"1111">>],<<"{\"d\":{\"pv\":1,\"ds\":128,\"sn\":\"20670539F543\"}}">>,false] due to not_authorized
2023-04-21 09:46:10.518 [error] <0.368.0>@gen_server:call Supervisor vmq_swc_exchange_sup_meta9 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.12695.550> exit with reason no such process or port in call to gen_server:call({vmq_swc_edist_meta9,'VerneMQ02@10.50.5.32'}, {apply,vmq_swc_store,rpc_node_clock,[]}, infinity) in context child_terminated
2023-04-21 09:46:11.116 [error] <0.393.0>@gen_server:call Supervisor vmq_swc_exchange_sup_meta10 had child {vmq_swc_exchange_fsm,'VerneMQ02@10.50.5.32'} started with {vmq_swc_exchange_fsm,start_link,undefined} at <0.5150.612> exit with reason no such process or port in call to gen_server:call({vmq_swc_edist_meta10,'VerneMQ02@10.50.5.32'}, {apply,vmq_swc_store,rpc_sync_missing,['VerneMQ01@10.50.5.31',[{{'VerneMQ03@10.50.5.33',<<164,...>>},...},...]]}, infinity) in context child_terminated


### Code of Conduct

- [X] I agree to follow the VerneMQ's Code of Conduct
@GoneGo1ng GoneGo1ng added the bug label Apr 25, 2023
@ioolkos
Copy link
Contributor

ioolkos commented Apr 25, 2023

Can you specify what you mean by "after VerneMQ starts the cluster?"
What you see is an issue that is being investigated ("empty synchronisation" after nodes have left and joined the cluster; reported by multiple use cases).
But you should not see this when VerneMQ initially clusters, or when you just start and stop nodes.

Starting from 3 empty clustered nodes: can you tell me what you actually test to arrive at the behaviour described?


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@GoneGo1ng
Copy link
Author

I mean, after all the nodes of the vernemq cluster are started, the CPU will soar. When only one node is started, the CPU is normal.
We tested 6,000 clients accessing vernemq and started testing in mid-March.
The CPU of vernemq was normal at first, and it has been running continuously for more than a month now, and this problem began to appear a few days ago.

@GoneGo1ng
Copy link
Author

And we modified redis.lua.

redis.lua:

-- Licensed under the Apache License, Version 2.0 (the "License");
-- you may not use this file except in compliance with the License.
-- You may obtain a copy of the License at
--
--     http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.

-- Redis Configuration, read the documentation below to properly
-- provision your database.
require "auth/auth_commons"

local function split(str, reps)
    local resultStrList = {}
    string.gsub(str, '[^'..reps..']+', function (w)
        table.insert(resultStrList, w)
        end
    )
    return resultStrList
end

-- In order to use this Lua plugin you must store a JSON Object containing 
-- the following properties as Redis Value:
--
--  - passhash: STRING (bcrypt)
--  - publish_acl: [ACL]  (Array of ACL JSON Objects)
--  - subscribe_acl: [ACL]  (Array of ACL JSON Objects)
--
--  The JSON array passed as publish/subscribe ACL contains the ACL objects topic
--  for this particular user. MQTT wildcards as well as the variable 
--  substitution for %m (mountpoint), %c (client_id), %u (username) are allowed
--  inside a pattern. 
--
-- The Redis Key is the JSON Array [mountpoint, client_id, username]
-- 
-- IF YOU USE THE KEY/VALUE SCHEMA PROVIDED ABOVE NOTHING HAS TO BE CHANGED 
-- IN THE FOLLOWING SCRIPT.
function auth_on_register(reg)
    if reg.username ~= nil and reg.password ~= nil then
        key = json.encode({reg.mountpoint, reg.client_id, reg.username})
        res = redis.cmd(pool, "get " .. key)
        if res then
            res = json.decode(res)
            if res.passhash == reg.password then
                cache_insert(
                    reg.mountpoint, 
                    reg.client_id, 
                    reg.username,
                    res.publish_acl,
                    res.subscribe_acl
                    )
                return true
            end
        else
            pdk = split(reg.client_id, "|")
            res1 = redis.cmd(pool, "get secret:product:" .. pdk[1])
            if res1 then
                res1 = json.decode(res1)
                if not res1 or res1.secret ~= reg.password then
                    return false
                end
            end
            publish_acl = {
                {
                    ["pattern"] = "/sys/" .. pdk[1] .. "/" .. reg.username .. "/thing/event/connect"
                },
                {
                    ["pattern"] = "/s/" .. pdk[1] .. "/" .. reg.username .. "/e/c"
                }
            }
            subscribe_acl = {
                {
                    ["pattern"] = "/sys/" .. pdk[1] .. "/" .. reg.username .. "/device/rrpc/connect/response"
                },
                {
                    ["pattern"] = "/s/" .. pdk[1] .. "/" .. reg.username .. "/r/c/rp"
                }
            }
            acl = {
                ["passhash"] = reg.password,
                ["publish_acl"] = publish_acl,
                ["subscribe_acl"] = subscribe_acl
            }
            value = json.encode(acl)
            key = json.encode({reg.mountpoint, reg.client_id, reg.username})
            res2 = redis.cmd(pool, "setex " .. key .. " 300 ".. value)
            if res2 then
                cache_insert(
                    reg.mountpoint, 
                    reg.client_id, 
                    reg.username,
                    publish_acl,
                    subscribe_acl
                    )
                return true
            end
        end
    end
    return false
end

pool = "auth_redis"
config = {
    pool_id = pool
}

redis.ensure_pool(config)
hooks = {
    auth_on_register = auth_on_register,
    auth_on_publish = auth_on_publish,
    auth_on_subscribe = auth_on_subscribe,
    on_unsubscribe = on_unsubscribe,
    on_client_gone = on_client_gone,
    on_client_offline = on_client_offline,

    auth_on_register_m5 = auth_on_register_m5,
    auth_on_publish_m5 = auth_on_publish_m5,
    auth_on_subscribe_m5 = auth_on_subscribe_m5,
}

@ioolkos
Copy link
Contributor

ioolkos commented Apr 25, 2023

Re "a few days ago": did you do anything specific to the cluster then? like make one or multiple nodes leave and rejoin the cluster?
Btw: this is not on Kubernetes, right?


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@GoneGo1ng
Copy link
Author

Yes, this is not on Kubernetes, this is on three virtual machines.
We did not manually process the cluster, but last Friday because the load on a certain node was too high, the operation system killed the VerneMQ process of this node.
Emphasize some: Before the process is killed, the CPU of VerneMQ is already high.

@ioolkos
Copy link
Contributor

ioolkos commented Apr 25, 2023

Wouldn't a process only be killed/OOMed with high RAM, not high CPU?
But yes, this seems related to what you're seeing.
The cluster size shows as 3, though? (vmq-admin cluster show)


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@GoneGo1ng
Copy link
Author

image

@GoneGo1ng
Copy link
Author

When the process is killed, the memory is not particularly high, and the CPU has been in a high usage.
bf5a4a8ef3798767ff362635c59556f4

@ioolkos
Copy link
Contributor

ioolkos commented Apr 25, 2023

Ok. I wonder what happened during the VM takedown. Was the disk wiped?
Operationally, now, you probably have to rebuild the cluster to get out of empty sync.

I don't know why you see high CPU before the crash, you might want to look at this separately. But it must somehow be load related. (the input we give to the system).
You could also test vmq_plumtree as your alternative metadata_plugin (and if you don't use randomly generated one-time ClientIDs).


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@GoneGo1ng
Copy link
Author

This is some status of the disk, memory, and CPU when the process is killed. There are obvious fluctuations in IO, but also not sure what data is being processed.
We didn't wipe any data from the disk.
Thanks for your patience, I will try your suggestion.
image

@ioolkos
Copy link
Contributor

ioolkos commented Jun 3, 2023

potential fix: #2162


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@ioolkos
Copy link
Contributor

ioolkos commented Jun 16, 2023

@GoneGo1ng what is your status on this? can you possibly test the 1.13.0 release?


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@GoneGo1ng
Copy link
Author

@ioolkos Thx, I will test.

@localbubble
Copy link

Hi, I can confirm the issue is still happening on version VerneMQ 1.13.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants