Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Loss of retained QoS 1 messages across cluster nodes #2251

Open
1 task done
dlanzafame opened this issue Feb 15, 2024 · 1 comment
Open
1 task done

[Bug]: Loss of retained QoS 1 messages across cluster nodes #2251

dlanzafame opened this issue Feb 15, 2024 · 1 comment

Comments

@dlanzafame
Copy link

Environment

  • VerneMQ Version: 1.12.6.2
  • OS: Docker Container on Amazon Linux 2 EC2 instance (Kernel 4.14.198-152.320.amzn2.x86_64)
  • Erlang/OTP version: 24.3.4.5
  • Cluster size/standalone: 2 Nodes on different ec2 instances

Current Behavior

If the publisher and subscriber are connected to different cluster nodes, the last retained message is lost if the message is published less than 1 second before the subscription is completed by another client. When the publish and subscribe actions are over 1 second apart, 100% of messages are delivered. However, if the time interval is less than 1 second, the percentage of lost messages increases, reaching up to 70% when publish and subscribe actions are concurrent. If the publisher and subscriber are connected to the same node, no message loss occurs, even with concurrent publish and subscribe actions.

Steps to Reproduce:

  1. Client A publishes a QoS 1 retained message with the message "offline" to topic T and waits 5 seconds.
  2. Client A then publishes a QoS 1 retained message with the message "online" to topic T.
  3. Introduce a variable sleep duration between 0 and 1 second.
  4. Client B subscribes to topic T and waits for the "online" message or a sequence of "offline" followed by "online" messages for 15 seconds.

Expected behaviour

A retained message published with QoS 1 should always be delivered to subscribers.

Configuration, logs, error output, etc.

listener.ssl.use_identity_as_username=on
max_inflight_messages=20
max_offline_messages=100000
vmq_bridge.ssl.sbr0.topic.2=device/+/foo/bar out 1
vmq_bridge.ssl.sbr0.topic.1=device/+/baz/foobar/barbaz out 1
vmq_bridge.ssl.sbr0.keepalive_interval=60
vmq_acl.acl_file=/etc/vernemq/acl/vmq.acl
vmq_bridge.ssl.sbr0.try_private=on
listener.max_connections=350000
vmq_bridge.ssl.sbr0.keyfile=/vernemq/ssl/key.pem
log.console.level=debug
accept_eula=yes
listener.tcp.pproxy=0.0.0.0:1884
listener.ssl.crlfile=/vernemq/ssl_crl/crl.pem
plugins.vmq_bridge=on
distributed_cookie=cluster1
listener.tcp.default=0.0.0.0:1883
listener.tcp.pproxy.proxy_protocol=on
vmq_bridge.ssl.sbr0.cleansession=on
listener.http.metrics=0.0.0.0:8990
listener.ssl.default=0.0.0.0:8883
listener.ssl.require_certificate=on
listener.vmq.clustering=10.1.1.7:44053
listener.ssl.keyfile=/vernemq/ssl/cloudkey.pem
listener.ssl.cafile=/vernemq/ssl/ca-chain.pem
vmq_bridge.ssl.sbr0.client_id=7snn1n0nos930oqrqrpq6ps0355n60n58q225s08ospr441496n52or37
erlang.distribution.port_range.minimum=6000
log.console=console
max_online_messages=10000
vmq_bridge.ssl.sbr0.cafile=/vernemq/ssl/RootCA.pem
listener.tcp.pproxy.proxy_protocol_use_cn_as_username=on
vmq_bridge.ssl.sbr0=foobarbaz.iot.somewhere-1.amazonaws.com:8883
allow_anonymous=on
listener.ssl.certfile=/vernemq/ssl/cloud.pem
nodename=mqtt1@10.1.1.7
listener.http.default=0.0.0.0:8888
erlang.distribution.port_range.maximum=7999
vmq_bridge.ssl.sbr0.certfile=/vernemq/ssl/certificate.pem
listener.ws.default = 10.1.1.7:8080

Code of Conduct

  • I agree to follow the VerneMQ's Code of Conduct
@dlanzafame dlanzafame added the bug label Feb 15, 2024
@ioolkos
Copy link
Contributor

ioolkos commented Feb 15, 2024

@dlanzafame Thanks for your report. The retain store is eventually consistent; it has often been noted.
You can try this PR to see whether it lowers the rate of missed Publish: #2219

But ultimately, the proper solution to this is to introduce consensus into the distribution of retained messages. This will lower performance of the retain store drastically, but make users more happy who observe wallclock time of events.
I'm working on a solution for this.

@ioolkos ioolkos removed the bug label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants