Cluster ha improvements #1769

MarcVanOevelen · 2021-04-07T13:06:25Z

Proposed Changes

These changes fix missing messages in QoS1 in a HA VerneMQ cluster
as reported in this issue:
Message loss on node fault with QoS1(#1700)

Following problems were discovered/addressed:

Make enqueue a synhronous call :
When the broker receives a message from a publisher it was written into the subscriber(s) queue(s)
using an asynchronous function call. This causes the ack to be sent before the message was written to persistent storage
hence leaving a small time-window in which a crash of the broker causes the acknowledged message to be lost.

Solved by using a synchronous call.

Add handshake on inter-node TCP connection :
When the broker receives messages for a subscriber connected to another node, the messages are sent via the dedicated inter-node TCP connection. A crash of the receiving node while a message is present in the tcp receive buffer
and acknowledged to the sender but not yet processed by the broker will cause that message to be lost.
The publisher received an ack when the message was delivered to the senders tcp stack.

Solved by introducing an application-level handshake on the inter-node tcp connection:

The message block "in transit" is buffered until acknowledged by the receiving node.
In case the receiving node crashed and hence no ack was received, the buffer is resent when the connection is re-established.
In case the subscriber reconnected to another node, the resent messages are forwarded to that new node.

Catch crashes during queue migration :
During HA testing some some crashes were observed when queue migration was triggered
i.,e. rpc call to get queue pid from the crashed node and ets:lookup.

Solved by handling the exceptions.

Only release messages when acknowledged :
When the broker delivers messages to a connected subscriber, the messages from the queue are saved in a
backup queue and "mailed" to the fsm process that performs the interaction with the client. (retries, acks, ...).
But the backup messages were released (=deleted from persistent storage) before they were actually acknowledged
by the subscriber.
Again, when the broker crashes those messages are lost.

Solved by :

releasing each message individually when the fsm receives the ack.
QoS0 messages are released immediately (unless potentially being upgraded by "maybe_upgrade_qos")
next message batch is now added to the backup queue instead of replacing it.

Types of Changes

Bugfix (non-breaking change which fixes issue VernemQ 3-node Cluster High-Availability : Message loss on node fault with QoS1 #1700)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation (correction or otherwise)
Cosmetics (whitespace, styles...)
DevOps (Build scripts, pipelines...)

Checklist

I have read the CODE_OF_CONDUCT.md document
All tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if needed)
Any dependent changes have been merged and published in related repositories
I have updated changelog (At the bottom of the release version)
I have squashed all my commits into one before merging

Further Comments

These fixes do not yet cover all cases of message loss but already drastically reduced their ocurrence:

Without the fixes about 1 in 3 HA testruns showed message loss.
Above fixes reduced this to about 1 lost message in 1000 testruns.

The message block "in transit" is buffered until acknowledged by the receiving node. In case the receiving node crashed, the buffer is sent again when the connection is re-established. In case the subscriber reconnected to another node, the resent messages are forwarded to that new node.

Messages from the queue that are published to the subscriber were saved in a backup queue and released (=deleted from persistent storage) before they were delivered/acknowledged by the client. Now the messages are only released when the acknowledge is received.

ioolkos · 2021-04-07T13:27:49Z

@MarcVanOevelen thanks... what a great effort! :)
As it's a rather big change it'll take me a while to review. We also need to make sure that any "change of strategy" (from async to sync processing, for instance) has no downstream effects.

If you feel I need any more specific information to for an effective review, let me know. (I don't think so, for now at least).

MarcVanOevelen · 2021-04-07T15:12:54Z

@ioolkos Thanks!

It took me quite some time to get familiar with the codebase but I must say it is very well structured,
and this is such a great project that it was really worth the effort!

Indeed such changes are very tricky given the many features and use-cases.
Hopefully they don't cause any side-effects that are not covered by the tests.

Our use case is focused on a small number of publishers/subscribers but we really need
reliable message delivery (QoS1) in a HA cluster.

I still have to assess the inevitable performance cost of those changes.

The release smoketest fails now but I suppose this is caused by the rebar3_cuttlefish dependency
which points to master. I had to specify the previous version to get this working again.

I also see that a test is failing which did not happen when I ran them locally?

If you need more info please let me know.
Thanks.

apps/vmq_server/src/vmq_cluster_node.erl

codeadict · 2021-04-09T02:42:21Z

apps/vmq_server/src/vmq_cluster_node.erl

+%% Note that currently the backup is kept only in memory, a future improvement
+%% would be to persist the backup on disk to also protect against message loss
+%% in case the transmitting node crashes.


Maybe we should put this in a ticket @ioolkos so it doesn't get lost in the code comments

apps/vmq_server/src/vmq_mqtt5_fsm.erl

apps/vmq_server/test/vmq_cluster_com_SUITE.erl

apps/vmq_server/test/vmq_queue_SUITE.erl

Remove excess blank lines Co-authored-by: Dairon Medina Caro <dairon.medina@gmail.com>

ioolkos · 2021-04-12T08:28:21Z

Common Test suites and Dialyzer pass for me locally.

ioolkos · 2021-04-12T09:10:00Z

@MarcVanOevelen no in-depth review yet, but I like your code/commenting style, very much aligned with existing code.

It'd be great to have an idea on performance impact, as you already mentioned. We also need to protect any RAM-grabbing component (buffer/queue/backup) or at least ensure that it will hit an upstream config limit (haven't yet looked into he details of that point though).

ioolkos · 2021-04-16T14:02:09Z

@MarcVanOevelen I'll try to run some basic performance comparison over the weekend. Anything specific you'd want me to test?

MarcVanOevelen · 2021-04-16T14:33:51Z

@MarcVanOevelen I'll try to run some basic performance comparison over the weekend. Anything specific you'd want me to test?

I have successfully completed the HA behavior tests in our production system.
Now I am also running performance tests relative to vernemq:1.11.0-alpine deployment using helm chart in k8s.
These tests try to assess max sustainable message rates and latency using 1 publisher and 1 subscriber with following parameters:

QoS: [0, 1, 2]
message sizes: [100 bytes, 64K bytes]
setup : [single node, 2-node cluster]
clients connect via k8s service, hence in 2-node cluster publisher and subscriber connect to different nodes
I almost completed testing but unfortunately we just suffered from a power outage.
Results will be ready in the course of next week.

Since performance tests are very dependent on the test-environment (resources, ...) we can learn most from relative tests.
But it would be nice if you could test using the same parameters so we can compare our results.

ioolkos · 2021-04-16T16:50:04Z

@MarcVanOevelen great, thanks. Hopefully no power outages any more :)

ioolkos · 2021-04-18T08:13:27Z

Did a basic test to get a feeling for latencies. This is a quick test with 50 publisher on node 1 to 1 subscriber on node 2, sending 1500 100 byte messages total per second (QoS 1).
Highest line is max latency, then 95 percentile. Max line is at 90ms vs 30ms. So, this adds 60ms.

1769 Branch:

Main Branch

EDIT: I also noticed that CPU on both nodes stays very high after the test finishes, with 1769 branch. This needs to be investigated too.
EDIT2: The above happens as soon as you cluster nodes. vmq_cluster_node is needlessly looping somehow.

ioolkos · 2021-04-18T13:46:27Z

@MarcVanOevelen the loop in vmq_cluster_node is racing through loop(internal_flush) because you've disabled the check for Pending in the first clause.

This avoids high cpu due to needless looping.

MarcVanOevelen · 2021-04-18T18:12:47Z

@MarcVanOevelen the loop in vmq_cluster_node is racing through loop(internal_flush) because you've disabled the check for Pending in the first clause.

Nice catch!
I included the original "Pending" check again.
Thanks!

MarcVanOevelen · 2021-04-21T11:25:35Z

@ioolkos performance test comparison results in attached pdf
VerneMQ_performance_test.pdf

ioolkos · 2021-04-21T11:49:08Z

@MarcVanOevelen thanks... can you leave me your e-mail address at contact email here: https://vernemq.com/community.html

MarcVanOevelen · 2021-04-21T12:05:10Z

@ioolkos and here the HA test results
VerneMQ_cluster_HA_test.pdf

marc added 4 commits April 7, 2021 10:19

Make enqueue a synhronous call

331d03e

catch crashes during queue migration

b7f364e

Merge branch 'master' into cluster_ha_improvements

5b05e61

codeadict reviewed Apr 9, 2021

View reviewed changes

MarcVanOevelen and others added 2 commits April 9, 2021 09:49

Apply suggestions from code review

7cf085d

Remove excess blank lines Co-authored-by: Dairon Medina Caro <dairon.medina@gmail.com>

Clearer log statements using node name

a0c7aaf

Marc Van Oevelen added 2 commits April 18, 2021 20:06

Remove debug message

7f749b1

Fix bug introduced by removing Pending check

ed49ad2

This avoids high cpu due to needless looping.

ioolkos mentioned this pull request Mar 2, 2022

Question: QoS 2 behavior for an unstable network #1950

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster ha improvements #1769

Cluster ha improvements #1769

MarcVanOevelen commented Apr 7, 2021

ioolkos commented Apr 7, 2021

MarcVanOevelen commented Apr 7, 2021

codeadict Apr 9, 2021

ioolkos commented Apr 12, 2021

ioolkos commented Apr 12, 2021

ioolkos commented Apr 16, 2021

MarcVanOevelen commented Apr 16, 2021 •

edited

ioolkos commented Apr 16, 2021

ioolkos commented Apr 18, 2021 •

edited

ioolkos commented Apr 18, 2021

MarcVanOevelen commented Apr 18, 2021

MarcVanOevelen commented Apr 21, 2021

ioolkos commented Apr 21, 2021

MarcVanOevelen commented Apr 21, 2021

Cluster ha improvements #1769

Are you sure you want to change the base?

Cluster ha improvements #1769

Conversation

MarcVanOevelen commented Apr 7, 2021

Proposed Changes

Types of Changes

Checklist

Further Comments

ioolkos commented Apr 7, 2021

MarcVanOevelen commented Apr 7, 2021

codeadict Apr 9, 2021

Choose a reason for hiding this comment

ioolkos commented Apr 12, 2021

ioolkos commented Apr 12, 2021

ioolkos commented Apr 16, 2021

MarcVanOevelen commented Apr 16, 2021 • edited

ioolkos commented Apr 16, 2021

ioolkos commented Apr 18, 2021 • edited

ioolkos commented Apr 18, 2021

MarcVanOevelen commented Apr 18, 2021

MarcVanOevelen commented Apr 21, 2021

ioolkos commented Apr 21, 2021

MarcVanOevelen commented Apr 21, 2021

MarcVanOevelen commented Apr 16, 2021 •

edited

ioolkos commented Apr 18, 2021 •

edited