Khepri: timeouts when one of the nodes stops responding #10753

mkuratczyk · 2024-03-15T10:08:28Z

Describe the bug

During chaos tests where one of the VMs/nodes is suddenly restarted, timeouts like this occur:

   crasher:
     initial call: rabbit_prequeue:init/1
     pid: <0.1007.0>
     registered_name: []
     exception exit: {{badrecord,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      'rabbit@foobar-s5000-server-1.foobar-s5000-nodes.chaos-tests'}}}},
                      [{dict,map_dict,2,[{file,"dict.erl"},{line,467}]},
                       {rabbit_amqqueue,internal_delete,3,
                           [{file,"rabbit_amqqueue.erl"},{line,1805}]},
                       {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',
                           7,
                           [{file,"rabbit_amqqueue_process.erl"},{line,332}]},
                       {rabbit_amqqueue_process,terminate_shutdown,2,
                           [{file,"rabbit_amqqueue_process.erl"},{line,362}]},
                       {gen_server2,terminate,3,
                           [{file,"gen_server2.erl"},{line,1158}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1048}]},
                       {proc_lib,wake_up,3,
                           [{file,"proc_lib.erl"},{line,251}]}]}

   crasher:
     initial call: rabbit_channel:init/1
     pid: <0.90831.0>
     registered_name: []
     exception exit: {{case_clause,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      'rabbit@foobar-s5000-server-2.foobar-s5000-nodes.chaos-tests'}}}},
                      [{rabbit_channel,binding_action,10,
                           [{file,"rabbit_channel.erl"},{line,1825}]},
                       {rabbit_channel,handle_method,3,
                           [{file,"rabbit_channel.erl"},{line,1614}]},
                       {rabbit_channel,handle_cast,2,
                           [{file,"rabbit_channel.erl"},{line,631}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1056}]},
                       {proc_lib,init_p_do_apply,3,
                           [{file,"proc_lib.erl"},{line,241}]}]}
       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)

Of course timeouts are not unexpected when machines disappear, but we need to think through these scenarios and decide what to do. Either ways, we should not log such stacktraces probably.

Reproduction steps

It was a chaos test with a workload, including queue deletions and random restarts.

Expected behavior

?

Additional context

No response

The text was updated successfully, but these errors were encountered:

mkuratczyk added the bug label Mar 15, 2024

the-mikedavis self-assigned this Mar 26, 2024

the-mikedavis linked a pull request Apr 3, 2024 that will close this issue

Handle database timeouts from Khepri minority #10915

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Khepri: timeouts when one of the nodes stops responding #10753

Khepri: timeouts when one of the nodes stops responding #10753

mkuratczyk commented Mar 15, 2024

Khepri: timeouts when one of the nodes stops responding #10753

Khepri: timeouts when one of the nodes stops responding #10753

Comments

mkuratczyk commented Mar 15, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context