You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS: Docker Container on Amazon Linux 2 EC2 instance (Kernel 4.14.198-152.320.amzn2.x86_64)
Erlang/OTP version: 24.3.4.5
Cluster size/standalone: 2 Nodes on different ec2 instances
Current Behavior
If the publisher and subscriber are connected to different cluster nodes, the last retained message is lost if the message is published less than 1 second before the subscription is completed by another client. When the publish and subscribe actions are over 1 second apart, 100% of messages are delivered. However, if the time interval is less than 1 second, the percentage of lost messages increases, reaching up to 70% when publish and subscribe actions are concurrent. If the publisher and subscriber are connected to the same node, no message loss occurs, even with concurrent publish and subscribe actions.
Steps to Reproduce:
Client A publishes a QoS 1 retained message with the message "offline" to topic T and waits 5 seconds.
Client A then publishes a QoS 1 retained message with the message "online" to topic T.
Introduce a variable sleep duration between 0 and 1 second.
Client B subscribes to topic T and waits for the "online" message or a sequence of "offline" followed by "online" messages for 15 seconds.
Expected behaviour
A retained message published with QoS 1 should always be delivered to subscribers.
@dlanzafame Thanks for your report. The retain store is eventually consistent; it has often been noted.
You can try this PR to see whether it lowers the rate of missed Publish: #2219
But ultimately, the proper solution to this is to introduce consensus into the distribution of retained messages. This will lower performance of the retain store drastically, but make users more happy who observe wallclock time of events.
I'm working on a solution for this.
Environment
Current Behavior
If the publisher and subscriber are connected to different cluster nodes, the last retained message is lost if the message is published less than 1 second before the subscription is completed by another client. When the publish and subscribe actions are over 1 second apart, 100% of messages are delivered. However, if the time interval is less than 1 second, the percentage of lost messages increases, reaching up to 70% when publish and subscribe actions are concurrent. If the publisher and subscriber are connected to the same node, no message loss occurs, even with concurrent publish and subscribe actions.
Steps to Reproduce:
Expected behaviour
A retained message published with QoS 1 should always be delivered to subscribers.
Configuration, logs, error output, etc.
Code of Conduct
The text was updated successfully, but these errors were encountered: