Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Log spamed with vmq_metrics:met2idx({mqtt5_disconnect_sent,disconnect_migration}) #2087

Open
1 task done
mths1 opened this issue Feb 16, 2023 · 6 comments
Open
1 task done
Labels

Comments

@mths1
Copy link
Contributor

mths1 commented Feb 16, 2023

Environment

  • VerneMQ Version:
  • OS:
  • Erlang/OTP version (if building from source):
  • Cluster size/standalone:

Current Behavior

After an update of one of our nodes the log is full of the following error messages:

with 0 neighbours crashed with reason: no function clause matching vmq_metrics:met2idx({mqtt5_disconnect_sent,disconnect_migration}) line 1950

First, I thought: Easy fix. This seems to have been forgotten in met2idx. After looking into the code, it was not so obvious anymore.

The origin seems to be:
https://github.com/vernemq/vernemq/blob/master/apps/vmq_server/src/vmq_mqtt5_fsm.erl

gen_disconnect_(RCN, Props) ->
    _ = vmq_metrics:incr({?MQTT5_DISCONNECT_SENT, RCN}),
    serialise_frame(#mqtt5_disconnect{reason_code = rcn2rc(RCN), properties = Props}).

the code crashes in vmq_metrics, so before serialise frame is called (which would crash in rcn2rc anyway). At this point I realized that adding it to metrics is most likely not the right solution. What is the purpose of disconnect_migration, is this just some internal flag or does it really has to be sent to the clients (which does not happen at the moment?). Should this just be translated to an administrative disconnect, or be eaten somewhere and never reach gen_disconnect, or something completely different...

Expected behaviour

Clean log.

Configuration, logs, error output, etc.

No response

Code of Conduct

  • I agree to follow the VerneMQ's Code of Conduct
@mths1 mths1 added the bug label Feb 16, 2023
@ioolkos
Copy link
Contributor

ioolkos commented Feb 16, 2023

disconnect_migration is an internal reason code for a disconnect. I guess it should not be sent to the MQTT client. It originates from an online queue (where online means it still has a session attached, that is a connected MQTT client) that receives a migrate request. That queue then disconnects the session with a disconnect_migration code.

It's a little weird this happens here at all. Can't remember this seeing logged often, but seems to need more attention.


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@mths1
Copy link
Contributor Author

mths1 commented Feb 16, 2023

The weird thing is, that I can constantly see the log filling up with it and goes on for hours and hours. Even a restart did not help.

2023-02-16 15:15:32.777 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:32.877 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:32.978 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:33.079 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:33.180 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:33.282 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:33.383 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:33.484 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:33.585 [error] <0.22206.0>@vmq_queue:drain:{455,5} got unknown sync event in drain state {cleanup,session_taken_over}
2023-02-16 15:15:35.768 [error] <0.22248.0>@vmq_metrics:met2idx:1950 CRASH REPORT Process <0.22248.0> with 0 neighbours crashed with reason: no function clause matching vmq_metrics:met2idx({mqtt5_disconnect_sent,disconnect_migration}) line 1950

@mths1
Copy link
Contributor Author

mths1 commented Feb 16, 2023

So, I stopped the nodes. Deleted the data directories, rejoined them... one of the nodes keeps doing this, even after a reboot of the machine... This is getting really strange.

@ioolkos
Copy link
Contributor

ioolkos commented Feb 16, 2023

How long do your MQTT clients wait for the server CONNACK before they attempt a re-connect?
Do you have persistent sessions (clean_session=false) and a round-robin loadbalancer strategy? Ideally, persistent clients end up on the node where their offline queue waits for them. Then we can avoid queue migration completely.


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@mths1
Copy link
Contributor Author

mths1 commented Feb 16, 2023

Meanwhile, I found the culprint. It goes into the direction you mentioned. So the bug is not that my log is spammed, but that it bring down one of my nodes (it was completely unavailble at one point which triggered the warning the first place).

2023-02-16T15:17:35Z <8705.12353.0> Trace session for test stopped
2023-02-16T15:17:35Z New session with PID <8705.12384.0> found for client "test"
2023-02-16T15:17:35Z <8705.12384.0> MQTT RECV: CID: "test" CONNECT(c: test, v: 5, u: test-dev, p: testpw, cs: true, ka: 60)
2023-02-16T15:17:35Z <8705.12384.0> Calling auth_on_register_m5({{10,241,28,
                                                                  243},
                                                                 52976},{[],
                                                                         <<"test">>},test-dev,testpw,true)
2023-02-16T15:17:35Z <8705.12384.0> Hook returned "ok"
2023-02-16T15:17:35Z <8705.12384.0> MQTT SEND: CID: "test" CONNACK(sp: 0, rc: success(0))
2023-02-16T15:17:36Z New session with PID <8705.12391.0> found for client "test"
2023-02-16T15:17:36Z <8705.12391.0> MQTT RECV: CID: "test" CONNECT(c: test, v: 5, u: test-dev, p: testpw, cs: false, ka: 60)
    with properties: #{p_session_expiry_interval => 3600}
2023-02-16T15:17:36Z <8705.12391.0> Calling auth_on_register_m5({{10,241,29,
                                                                  35},
                                                                 34708},{[],
                                                                         <<"test">>},test-dev,testpw,false)
    with properties: #{p_session_expiry_interval => 3600}
2023-02-16T15:17:36Z <8705.12391.0> Hook returned "ok"
2023-02-16T15:17:36Z <8705.12384.0> MQTT SEND: CID: "test" DISCONNECT(rc:session_taken_over(142))
2023-02-16T15:17:36Z <8705.12384.0> Trace session for test stopped
2023-02-16T15:17:36Z <8705.12391.0> MQTT SEND: CID: "test" CONNACK(sp: 1, rc: success(0))
2023-02-16T15:17:36Z <8705.12391.0> Trace session for test stopped
2023-02-16T15:17:37Z New session with PID <8705.12417.0> found for client "test"
2023-02-16T15:17:37Z <8705.12417.0> MQTT RECV: CID: "test" CONNECT(c: test, v: 5, u: test-dev, p: testpw, cs: true, ka: 60)
2023-02-16T15:17:37Z <8705.12417.0> Calling auth_on_register_m5({{10,241,28,
                                                                  243},
                                                                 52980},{[],
                                                                         <<"test">>},test-dev,testpw,true)
2023-02-16T15:17:37Z <8705.12417.0> Hook returned "ok"
2023-02-16T15:17:37Z <8705.12417.0> MQTT SEND: CID: "test" CONNACK(sp: 0, rc: success(0))
2023-02-16T15:17:38Z New session with PID <8705.12433.0> found for client "test"
2023-02-16T15:17:38Z <8705.12433.0> MQTT RECV: CID: "test" CONNECT(c: test, v: 5, u: test-dev, p: testpw, cs: false, ka: 60)
    with properties: #{p_session_expiry_interval => 3600}
2023-02-16T15:17:38Z <8705.12433.0> Calling auth_on_register_m5({{10,241,29,
                                                                  35},
                                                                 34720},{[],
                                                                         <<"test">>},test-dev,testpw,false)
    with properties: #{p_session_expiry_interval => 3600}
2023-02-16T15:17:38Z <8705.12433.0> Hook returned "ok"
2023-02-16T15:17:38Z <8705.12417.0> Trace session for test stopped
2023-02-16T15:17:38Z <8705.12433.0> MQTT SEND: CID: "test" CONNACK(sp: 1, rc: success(0))

@ioolkos
Copy link
Contributor

ioolkos commented Feb 17, 2023

For future readers, the trace above shows a ClientId (<<"test">>) racing itself on the broker, connecting with cs=true and cs=false in alternation. This will trigger session take overs.

I suppose we need more data on whether this is able to block a whole node over time or not.

Verne does not do any sliding window accounting for TCP connection setup which would allow per-client ID or per client IP connection rate limiting.

Proxy components like HAProxy could do this. (HAProxy even supports MQTT ClientID since 2.4 https://www.haproxy.com/blog/announcing-haproxy-2-4/, although this likely implies that we'd need to terminate TLS in the proxy too [which is fine]).


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants