Handle database timeouts from Khepri minority #10915

the-mikedavis · 2024-04-03T22:12:59Z

Operations like declaring/deleting queues fail when sent against a node that's part of a minority. We need to let the database failures ({error, timeout}) bubble up to the callers - usually the channel - so that these operations don't cause needless crash reports.

Closes #10753
This depends on a change upstream in Khepri: rabbitmq/khepri#256

This is a mix of a few changes: * Suppress the compiler warning from the export_all attribute. * Lower Khepri's command handling timeout value. By default this is set to 30s in rabbit which makes each of the cases in `client_operations` take an excessively long time. Before this change the suite took around 10 minutes to complete. Now it takes between two and three minutes. * Swap the order of client and broker teardown steps in end_per_group hook. The client teardown steps will always fail if run after the broker teardown steps because they rely on a value in `Config` that is deleted by broker teardown.

The prior code skirted transactions because the filter function might cause Khepri to call itself. We want to use the same idea as the old code - get all queues, filter them, then delete them - but we want to perform the deletion in a transaction and fail the transaction if any queues changed since we read them. This fixes a bug - that the call to `delete_in_khepri/2` could return an error tuple that would be improperly recognized as `Deletions` - but should also make deleting transient queues atomic and fast. Each call to `delete_in_khepri/2` needed to wait on Ra to replicate because the deletion is an individual command sent from one process. Performing all deletions at once means we only need to wait for one command to be replicated across the cluster. We also bubble up any errors to delete now rather than storing them as deletions. This fixes a crash that occurs on node down when Khepri is in a minority.

The clause of the spec that allowed passing a list of queue name resources is out of date: the guard prevents a list from ever matching.

Previously a failing transaction would go unnoticed. Now we return an error tuple.

`khepri_tx:abort/1` is only meant for use within a transaction - I assume this was a relic of implementing this function with a transaction previously. The only caller already wraps this function in a `try`/`catch` block that logs the error and re-raises.

All callers assume that this operation will succeed.

This function is only used by the test suites. A backtrace should make the thrown error clearer though.

Note that we don't refactor the `throw/1` to an `erlang:error/1` since it's caught by `rabbit_vhost:add/3`.

This function is only used by a test suite which matches on the 'ok' return.

the-mikedavis self-assigned this Apr 3, 2024

mergify bot added the bazel label Apr 3, 2024

the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch 3 times, most recently from 3207119 to 60e06ee Compare May 6, 2024 18:29

the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch from 1e47bb5 to cfad5d7 Compare May 8, 2024 21:02

the-mikedavis added 20 commits May 13, 2024 16:16

WIP: Bump Khepri to X

572ec51

Introduce a rabbit_khepri:timeout_error() error type

b8d8a21

Handle database failures when declaring exchanges

e1d785f

Handle database failures when declaring queues

3fbad38

Handle database failures when adding/removing bindings

45fb884

Handle database failures when deleting exchanges

08572d9

Ignore timeout errors from deleting transient queues on node down

0232bce

minor: Correct outdated spec for rabbit_amqqueue:lookup/1

5fce957

The clause of the spec that allowed passing a list of queue name resources is out of date: the guard prevents a list from ever matching.

rabbit_db_queue: Bubble up errors in set_many/1 with Khepri enabled

bed8267

Previously a failing transaction would go unnoticed. Now we return an error tuple.

rabbit_db_exchange: Reflect possible failure in update/2 spec

6ece1bf

rabbit_db_exchange: Bubble up database errors in set/1

e804034

rabbit_db_exchange: Raise database errors in next_serial/1

f0567ee

All callers assume that this operation will succeed.

rabbit_db_exchange: Bubble up errors in delete_serial/1

e81b064

rabbit_db_exchange: Raise Khepri errors instead of throwing in clear/0

1a6489f

This function is only used by the test suites. A backtrace should make the thrown error clearer though.

rabbit_db_vhost: Declare no-return in create_or_get/3 spec

f069001

Note that we don't refactor the `throw/1` to an `erlang:error/1` since it's caught by `rabbit_vhost:add/3`.

rabbit_db_vhost: Bubble up database errors in delete/1

d45cda6

rabbit_db_vhost: Bubble up database errors in clear/0

ba55fbb

This function is only used by a test suite which matches on the 'ok' return.

the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch from cfad5d7 to f38326b Compare May 13, 2024 20:22

the-mikedavis added 3 commits May 13, 2024 16:54

rabbit_db_rtparams: Handle timeout failures from set/set_global

be6644e

rabbit_runtime_parameters: Remove unused value_global/2, value/4

ca55031

rabbit_runtime_parameters: Handle timeout failures in clear functions

6add459

the-mikedavis force-pushed the md/khepri/database-operations-in-minority branch from f38326b to 6add459 Compare May 13, 2024 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle database timeouts from Khepri minority #10915

Handle database timeouts from Khepri minority #10915

the-mikedavis commented Apr 3, 2024

Handle database timeouts from Khepri minority #10915

Are you sure you want to change the base?

Handle database timeouts from Khepri minority #10915

Conversation

the-mikedavis commented Apr 3, 2024