[Bug]: Loss of retained QoS 1 messages across cluster nodes #2251

dlanzafame · 2024-02-15T11:04:19Z

Environment

VerneMQ Version: 1.12.6.2
OS: Docker Container on Amazon Linux 2 EC2 instance (Kernel 4.14.198-152.320.amzn2.x86_64)
Erlang/OTP version: 24.3.4.5
Cluster size/standalone: 2 Nodes on different ec2 instances

Current Behavior

If the publisher and subscriber are connected to different cluster nodes, the last retained message is lost if the message is published less than 1 second before the subscription is completed by another client. When the publish and subscribe actions are over 1 second apart, 100% of messages are delivered. However, if the time interval is less than 1 second, the percentage of lost messages increases, reaching up to 70% when publish and subscribe actions are concurrent. If the publisher and subscriber are connected to the same node, no message loss occurs, even with concurrent publish and subscribe actions.

Steps to Reproduce:

Client A publishes a QoS 1 retained message with the message "offline" to topic T and waits 5 seconds.
Client A then publishes a QoS 1 retained message with the message "online" to topic T.
Introduce a variable sleep duration between 0 and 1 second.
Client B subscribes to topic T and waits for the "online" message or a sequence of "offline" followed by "online" messages for 15 seconds.

Expected behaviour

A retained message published with QoS 1 should always be delivered to subscribers.

Configuration, logs, error output, etc.

listener.ssl.use_identity_as_username=on
max_inflight_messages=20
max_offline_messages=100000
vmq_bridge.ssl.sbr0.topic.2=device/+/foo/bar out 1
vmq_bridge.ssl.sbr0.topic.1=device/+/baz/foobar/barbaz out 1
vmq_bridge.ssl.sbr0.keepalive_interval=60
vmq_acl.acl_file=/etc/vernemq/acl/vmq.acl
vmq_bridge.ssl.sbr0.try_private=on
listener.max_connections=350000
vmq_bridge.ssl.sbr0.keyfile=/vernemq/ssl/key.pem
log.console.level=debug
accept_eula=yes
listener.tcp.pproxy=0.0.0.0:1884
listener.ssl.crlfile=/vernemq/ssl_crl/crl.pem
plugins.vmq_bridge=on
distributed_cookie=cluster1
listener.tcp.default=0.0.0.0:1883
listener.tcp.pproxy.proxy_protocol=on
vmq_bridge.ssl.sbr0.cleansession=on
listener.http.metrics=0.0.0.0:8990
listener.ssl.default=0.0.0.0:8883
listener.ssl.require_certificate=on
listener.vmq.clustering=10.1.1.7:44053
listener.ssl.keyfile=/vernemq/ssl/cloudkey.pem
listener.ssl.cafile=/vernemq/ssl/ca-chain.pem
vmq_bridge.ssl.sbr0.client_id=7snn1n0nos930oqrqrpq6ps0355n60n58q225s08ospr441496n52or37
erlang.distribution.port_range.minimum=6000
log.console=console
max_online_messages=10000
vmq_bridge.ssl.sbr0.cafile=/vernemq/ssl/RootCA.pem
listener.tcp.pproxy.proxy_protocol_use_cn_as_username=on
vmq_bridge.ssl.sbr0=foobarbaz.iot.somewhere-1.amazonaws.com:8883
allow_anonymous=on
listener.ssl.certfile=/vernemq/ssl/cloud.pem
nodename=mqtt1@10.1.1.7
listener.http.default=0.0.0.0:8888
erlang.distribution.port_range.maximum=7999
vmq_bridge.ssl.sbr0.certfile=/vernemq/ssl/certificate.pem
listener.ws.default = 10.1.1.7:8080

Code of Conduct

I agree to follow the VerneMQ's Code of Conduct

ioolkos · 2024-02-15T11:15:52Z

@dlanzafame Thanks for your report. The retain store is eventually consistent; it has often been noted.
You can try this PR to see whether it lowers the rate of missed Publish: #2219

But ultimately, the proper solution to this is to introduce consensus into the distribution of retained messages. This will lower performance of the retain store drastically, but make users more happy who observe wallclock time of events.
I'm working on a solution for this.

dlanzafame added the bug label Feb 15, 2024

ioolkos removed the bug label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Loss of retained QoS 1 messages across cluster nodes #2251

[Bug]: Loss of retained QoS 1 messages across cluster nodes #2251

dlanzafame commented Feb 15, 2024

ioolkos commented Feb 15, 2024

[Bug]: Loss of retained QoS 1 messages across cluster nodes #2251

[Bug]: Loss of retained QoS 1 messages across cluster nodes #2251

Comments

dlanzafame commented Feb 15, 2024

Environment

Current Behavior

Expected behaviour

Configuration, logs, error output, etc.

Code of Conduct

ioolkos commented Feb 15, 2024