Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster ha improvements #1769

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

MarcVanOevelen
Copy link
Contributor

Proposed Changes

These changes fix missing messages in QoS1 in a HA VerneMQ cluster
as reported in this issue:
Message loss on node fault with QoS1(#1700)

Following problems were discovered/addressed:

  1. Make enqueue a synhronous call :
    When the broker receives a message from a publisher it was written into the subscriber(s) queue(s)
    using an asynchronous function call. This causes the ack to be sent before the message was written to persistent storage
    hence leaving a small time-window in which a crash of the broker causes the acknowledged message to be lost.

Solved by using a synchronous call.

  1. Add handshake on inter-node TCP connection :
    When the broker receives messages for a subscriber connected to another node, the messages are sent via the dedicated inter-node TCP connection. A crash of the receiving node while a message is present in the tcp receive buffer
    and acknowledged to the sender but not yet processed by the broker will cause that message to be lost.
    The publisher received an ack when the message was delivered to the senders tcp stack.

Solved by introducing an application-level handshake on the inter-node tcp connection:

  • The message block "in transit" is buffered until acknowledged by the receiving node.
  • In case the receiving node crashed and hence no ack was received, the buffer is resent when the connection is re-established.
  • In case the subscriber reconnected to another node, the resent messages are forwarded to that new node.
  1. Catch crashes during queue migration :
    During HA testing some some crashes were observed when queue migration was triggered
    i.,e. rpc call to get queue pid from the crashed node and ets:lookup.

Solved by handling the exceptions.

  1. Only release messages when acknowledged :
    When the broker delivers messages to a connected subscriber, the messages from the queue are saved in a
    backup queue and "mailed" to the fsm process that performs the interaction with the client. (retries, acks, ...).
    But the backup messages were released (=deleted from persistent storage) before they were actually acknowledged
    by the subscriber.
    Again, when the broker crashes those messages are lost.

Solved by :

  • releasing each message individually when the fsm receives the ack.
  • QoS0 messages are released immediately (unless potentially being upgraded by "maybe_upgrade_qos")
  • next message batch is now added to the backup queue instead of replacing it.

Types of Changes

Checklist

  • I have read the CODE_OF_CONDUCT.md document
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if needed)
  • Any dependent changes have been merged and published in related repositories
  • I have updated changelog (At the bottom of the release version)
  • I have squashed all my commits into one before merging

Further Comments

These fixes do not yet cover all cases of message loss but already drastically reduced their ocurrence:

Without the fixes about 1 in 3 HA testruns showed message loss.
Above fixes reduced this to about 1 lost message in 1000 testruns.

marc added 4 commits April 7, 2021 10:19
The message block "in transit" is buffered until acknowledged by the
receiving node.
In case the receiving node crashed, the buffer is sent again when
the connection is re-established.
In case the subscriber reconnected to another node, the resent messages
are forwarded to that new node.
Messages from the queue that are published to the subscriber were saved
in a backup queue and released (=deleted from persistent storage) before
they were delivered/acknowledged by the client.
Now the messages are only released when the acknowledge is received.
@ioolkos
Copy link
Contributor

ioolkos commented Apr 7, 2021

@MarcVanOevelen thanks... what a great effort! :)
As it's a rather big change it'll take me a while to review. We also need to make sure that any "change of strategy" (from async to sync processing, for instance) has no downstream effects.

If you feel I need any more specific information to for an effective review, let me know. (I don't think so, for now at least).

@MarcVanOevelen
Copy link
Contributor Author

@ioolkos Thanks!

It took me quite some time to get familiar with the codebase but I must say it is very well structured,
and this is such a great project that it was really worth the effort!

Indeed such changes are very tricky given the many features and use-cases.
Hopefully they don't cause any side-effects that are not covered by the tests.

Our use case is focused on a small number of publishers/subscribers but we really need
reliable message delivery (QoS1) in a HA cluster.

I still have to assess the inevitable performance cost of those changes.

The release smoketest fails now but I suppose this is caused by the rebar3_cuttlefish dependency
which points to master. I had to specify the previous version to get this working again.

I also see that a test is failing which did not happen when I ran them locally?

If you need more info please let me know.
Thanks.

apps/vmq_server/src/vmq_cluster_node.erl Outdated Show resolved Hide resolved
Comment on lines +285 to +287
%% Note that currently the backup is kept only in memory, a future improvement
%% would be to persist the backup on disk to also protect against message loss
%% in case the transmitting node crashes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put this in a ticket @ioolkos so it doesn't get lost in the code comments

apps/vmq_server/src/vmq_mqtt5_fsm.erl Outdated Show resolved Hide resolved
apps/vmq_server/src/vmq_mqtt5_fsm.erl Outdated Show resolved Hide resolved
apps/vmq_server/test/vmq_cluster_com_SUITE.erl Outdated Show resolved Hide resolved
apps/vmq_server/test/vmq_queue_SUITE.erl Outdated Show resolved Hide resolved
MarcVanOevelen and others added 2 commits April 9, 2021 09:49
Remove excess blank lines

Co-authored-by: Dairon Medina Caro <dairon.medina@gmail.com>
@ioolkos
Copy link
Contributor

ioolkos commented Apr 12, 2021

Common Test suites and Dialyzer pass for me locally.

@ioolkos
Copy link
Contributor

ioolkos commented Apr 12, 2021

@MarcVanOevelen no in-depth review yet, but I like your code/commenting style, very much aligned with existing code.

It'd be great to have an idea on performance impact, as you already mentioned. We also need to protect any RAM-grabbing component (buffer/queue/backup) or at least ensure that it will hit an upstream config limit (haven't yet looked into he details of that point though).

@ioolkos
Copy link
Contributor

ioolkos commented Apr 16, 2021

@MarcVanOevelen I'll try to run some basic performance comparison over the weekend. Anything specific you'd want me to test?

@MarcVanOevelen
Copy link
Contributor Author

MarcVanOevelen commented Apr 16, 2021

@MarcVanOevelen I'll try to run some basic performance comparison over the weekend. Anything specific you'd want me to test?

I have successfully completed the HA behavior tests in our production system.
Now I am also running performance tests relative to vernemq:1.11.0-alpine deployment using helm chart in k8s.
These tests try to assess max sustainable message rates and latency using 1 publisher and 1 subscriber with following parameters:

  • QoS: [0, 1, 2]
  • message sizes: [100 bytes, 64K bytes]
  • setup : [single node, 2-node cluster]
    clients connect via k8s service, hence in 2-node cluster publisher and subscriber connect to different nodes
    I almost completed testing but unfortunately we just suffered from a power outage.
    Results will be ready in the course of next week.

Since performance tests are very dependent on the test-environment (resources, ...) we can learn most from relative tests.
But it would be nice if you could test using the same parameters so we can compare our results.

@ioolkos
Copy link
Contributor

ioolkos commented Apr 16, 2021

@MarcVanOevelen great, thanks. Hopefully no power outages any more :)

@ioolkos
Copy link
Contributor

ioolkos commented Apr 18, 2021

Did a basic test to get a feeling for latencies. This is a quick test with 50 publisher on node 1 to 1 subscriber on node 2, sending 1500 100 byte messages total per second (QoS 1).
Highest line is max latency, then 95 percentile. Max line is at 90ms vs 30ms. So, this adds 60ms.

1769 Branch:
1769 Branch
Main Branch
Master Branch

EDIT: I also noticed that CPU on both nodes stays very high after the test finishes, with 1769 branch. This needs to be investigated too.
EDIT2: The above happens as soon as you cluster nodes. vmq_cluster_node is needlessly looping somehow.

@ioolkos
Copy link
Contributor

ioolkos commented Apr 18, 2021

@MarcVanOevelen the loop in vmq_cluster_node is racing through loop(internal_flush) because you've disabled the check for Pending in the first clause.

Marc Van Oevelen added 2 commits April 18, 2021 20:06
This avoids high cpu due to needless looping.
@MarcVanOevelen
Copy link
Contributor Author

@MarcVanOevelen the loop in vmq_cluster_node is racing through loop(internal_flush) because you've disabled the check for Pending in the first clause.

Nice catch!
I included the original "Pending" check again.
Thanks!

@MarcVanOevelen
Copy link
Contributor Author

@ioolkos performance test comparison results in attached pdf
VerneMQ_performance_test.pdf

@ioolkos
Copy link
Contributor

ioolkos commented Apr 21, 2021

@MarcVanOevelen thanks... can you leave me your e-mail address at contact email here: https://vernemq.com/community.html

@MarcVanOevelen
Copy link
Contributor Author

@ioolkos and here the HA test results
VerneMQ_cluster_HA_test.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants