Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Khepri: timeouts when one of the nodes stops responding #10753

Open
mkuratczyk opened this issue Mar 15, 2024 · 0 comments · May be fixed by #10915
Open

Khepri: timeouts when one of the nodes stops responding #10753

mkuratczyk opened this issue Mar 15, 2024 · 0 comments · May be fixed by #10915
Assignees
Labels

Comments

@mkuratczyk
Copy link
Contributor

Describe the bug

During chaos tests where one of the VMs/nodes is suddenly restarted, timeouts like this occur:

   crasher:
     initial call: rabbit_prequeue:init/1
     pid: <0.1007.0>
     registered_name: []
     exception exit: {{badrecord,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      'rabbit@foobar-s5000-server-1.foobar-s5000-nodes.chaos-tests'}}}},
                      [{dict,map_dict,2,[{file,"dict.erl"},{line,467}]},
                       {rabbit_amqqueue,internal_delete,3,
                           [{file,"rabbit_amqqueue.erl"},{line,1805}]},
                       {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',
                           7,
                           [{file,"rabbit_amqqueue_process.erl"},{line,332}]},
                       {rabbit_amqqueue_process,terminate_shutdown,2,
                           [{file,"rabbit_amqqueue_process.erl"},{line,362}]},
                       {gen_server2,terminate,3,
                           [{file,"gen_server2.erl"},{line,1158}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1048}]},
                       {proc_lib,wake_up,3,
                           [{file,"proc_lib.erl"},{line,251}]}]}
   crasher:
     initial call: rabbit_channel:init/1
     pid: <0.90831.0>
     registered_name: []
     exception exit: {{case_clause,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      'rabbit@foobar-s5000-server-2.foobar-s5000-nodes.chaos-tests'}}}},
                      [{rabbit_channel,binding_action,10,
                           [{file,"rabbit_channel.erl"},{line,1825}]},
                       {rabbit_channel,handle_method,3,
                           [{file,"rabbit_channel.erl"},{line,1614}]},
                       {rabbit_channel,handle_cast,2,
                           [{file,"rabbit_channel.erl"},{line,631}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1056}]},
                       {proc_lib,init_p_do_apply,3,
                           [{file,"proc_lib.erl"},{line,241}]}]}
       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)

Of course timeouts are not unexpected when machines disappear, but we need to think through these scenarios and decide what to do. Either ways, we should not log such stacktraces probably.

Reproduction steps

It was a chaos test with a workload, including queue deletions and random restarts.

Expected behavior

?

Additional context

No response

@mkuratczyk mkuratczyk added the bug label Mar 15, 2024
@the-mikedavis the-mikedavis self-assigned this Mar 26, 2024
@the-mikedavis the-mikedavis linked a pull request Apr 3, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants