Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle database timeouts from Khepri minority #10915

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

the-mikedavis
Copy link
Member

Operations like declaring/deleting queues fail when sent against a node that's part of a minority. We need to let the database failures ({error, timeout}) bubble up to the callers - usually the channel - so that these operations don't cause needless crash reports.

Closes #10753
This depends on a change upstream in Khepri: rabbitmq/khepri#256

@the-mikedavis the-mikedavis self-assigned this Apr 3, 2024
@mergify mergify bot added the bazel label Apr 3, 2024
@the-mikedavis the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch 3 times, most recently from 3207119 to 60e06ee Compare May 6, 2024 18:29
@the-mikedavis the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch from 1e47bb5 to cfad5d7 Compare May 8, 2024 21:02
This is a mix of a few changes:

* Suppress the compiler warning from the export_all attribute.
* Lower Khepri's command handling timeout value. By default this is
  set to 30s in rabbit which makes each of the cases in
  `client_operations` take an excessively long time. Before this change
  the suite took around 10 minutes to complete. Now it takes between two
  and three minutes.
* Swap the order of client and broker teardown steps in end_per_group
  hook. The client teardown steps will always fail if run after the
  broker teardown steps because they rely on a value in `Config` that
  is deleted by broker teardown.
The prior code skirted transactions because the filter function might
cause Khepri to call itself. We want to use the same idea as the old
code - get all queues, filter them, then delete them - but we want to
perform the deletion in a transaction and fail the transaction if any
queues changed since we read them.

This fixes a bug - that the call to `delete_in_khepri/2` could return
an error tuple that would be improperly recognized as `Deletions` -
but should also make deleting transient queues atomic and fast.
Each call to `delete_in_khepri/2` needed to wait on Ra to replicate
because the deletion is an individual command sent from one process.
Performing all deletions at once means we only need to wait for one
command to be replicated across the cluster.

We also bubble up any errors to delete now rather than storing them as
deletions. This fixes a crash that occurs on node down when Khepri is
in a minority.
The clause of the spec that allowed passing a list of queue name
resources is out of date: the guard prevents a list from ever matching.
Previously a failing transaction would go unnoticed. Now we return an
error tuple.
`khepri_tx:abort/1` is only meant for use within a transaction - I
assume this was a relic of implementing this function with a transaction
previously.

The only caller already wraps this function in a `try`/`catch` block
that logs the error and re-raises.
All callers assume that this operation will succeed.
This function is only used by the test suites. A backtrace should make
the thrown error clearer though.
Note that we don't refactor the `throw/1` to an `erlang:error/1` since
it's caught by `rabbit_vhost:add/3`.
This function is only used by a test suite which matches on the 'ok'
return.
@the-mikedavis the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch from cfad5d7 to f38326b Compare May 13, 2024 20:22
@the-mikedavis the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch from f38326b to 6add459 Compare May 13, 2024 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Khepri: timeouts when one of the nodes stops responding
1 participant