Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent excessive logging in certain failure scenarios #10723

Open
mkuratczyk opened this issue Mar 12, 2024 · 3 comments
Open

Prevent excessive logging in certain failure scenarios #10723

mkuratczyk opened this issue Mar 12, 2024 · 3 comments

Comments

@mkuratczyk
Copy link
Contributor

Describe the bug

This is an umbrella issue for a bunch of small issues.

Some RabbitMQ components can log excessively when they crash or some other relatively unusual event happens (but it can be a common one actually - a cluster node failure, etc).

The goal is to avoid stuff like this in the logs:
Screenshot 2024-03-12 at 12 53 41

or this:

 ** Last message in was {'$gen_cast',terminate}
 ** When Server state == {ch,
                          {conf,running,rabbit_framing_amqp_0_9_1,1,
                           <0.7674.0>,<0.7681.0>,<0.7674.0>,
                           <<"10.52.4.3:36244 -> 10.52.3.58:5672">>,
                           {user,<<"default_user_KVbiqyQs7TiNc32SbiD">>,
                            [administrator],
                            [{rabbit_auth_backend_internal,
                              #Fun<rabbit_auth_backend_internal.3.111050101>}]},
                           <<"/">>,<<>>,<0.7675.0>,
                           [{<<"exchange_exchange_bindings">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"publisher_confirms">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true}],
                           none,0,1800000,#{},1000000000},
                          {lstate,<0.7682.0>,false},
                          none,82301,
                          {263,
                           [{pending_ack,82300,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548856,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82299},
                            {pending_ack,82299,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548852,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82298},
                            {pending_ack,82298,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548852,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82297},
... (many more pending_ack)

Reproduction steps

I'm running chaos tests to see what kind of situations like that I can trigger. So far there are two:

  1. osiris_replica can fail and all of its mailbox gets logged. This is potentially a huge piece of binary data coming from a TCP. For example, after running out of disk space I get this:
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>     reason: {{badmatch,{error,enospc}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>              [{osiris_log,write_chunk,6,[{file,"osiris_log.erl"},{line,2348}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {osiris_replica,'-handle_incoming_data/3-fun-0-',2,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                               [{file,"osiris_replica.erl"},{line,537}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {osiris_replica,handle_incoming_data,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                               [{file,"osiris_replica.erl"},{line,536}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {gen_server,try_handle_info,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                           [{file,"gen_server.erl"},{line,1095}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {proc_lib,init_p_do_apply,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                         [{file,"proc_lib.erl"},{line,241}]}]}
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>     offender: [{pid,<0.15412.0>},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {id,"__stream-1_1710241152660281401"},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {mfargs,{osiris_replica,start_link,undefined}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {restart_type,temporary},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {significant,false},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {shutdown,5000},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {child_type,worker}]
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> 
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Generic server <0.15439.0> terminating
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Last message in was {tcp,#Port<0.5937>,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                             <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                               136,0,0,1,142,50,127,114,211,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                               117,176,0,0,19,136,0,0,1,142,50,127,114,211,0,0,

with lots and lots of data later. Moreover this process is continuously restarted and therefore this data is printed over and over as the stream leader keeps trying to send the data to this replica.

  1. A channel can have a large state that gets printed in some cases (see the pending_ack stuff above)

Expected behavior

General recommendations for logging:

  1. ~p should not be used for log formatting
  2. processes should implement format_status/2
  3. maximum size of the log should be configured

Additional context

No response

@michaelklishin michaelklishin changed the title Prevent excessive logging Prevent excessive logging in certain failure scenarios Mar 12, 2024
@mkuratczyk
Copy link
Contributor Author

another one from the channel process - this time it's the mailbox, not the state:

crasher:
  initial call: rabbit_channel:init/1
  pid: <0.11850.0>
  registered_name: []
  exception exit: noproc
    in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
  ancestors: [<0.11847.0>,<0.11845.0>,<0.11839.0>,<0.11838.0>,<0.966.0>,
                <0.965.0>,<0.964.0>,<0.962.0>,<0.961.0>,rabbit_sup,
                <0.262.0>]
  message_queue_len: 151
  messages: [{'EXIT',<0.11847.0>,shutdown},
                {'$gen_cast',
                 {queue_event,
                  {resource,<<"/">>,queue,<<"q-3">>},
                  {deliver,<<"amq.ctag-hrgXqfnbKzvHGRhnaoLYog">>,true,
                   [{{resource,<<"/">>,queue,<<"q-3">>},
                     <15143.922.0>,2461847,false,
                     {mc,mc_amqpl,
                      {content,60,none,
                       <<16,0,2>>,
                       rabbit_framing_amqp_0_9_1,
                       [<<0,2,77,48,0,0,1,142,50,202,34,181,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

@mkuratczyk
Copy link
Contributor Author

osiris:

 ** When Server state == #{name => <<"__stream-1_1710247245159438000">>,
                           log =>
                               #{file => "00000000000118320636.segment",
                                 mode =>
                                     #{type => acceptor,mode => write,
                                       tail =>
                                           {118420442,
                                            {11,118420257,1710257805308}},
                                       epoch => 11},
                                 directory =>
                                     "/var/lib/rabbitmq/mnesia/rabbit@main-s5000-server-0.main-s5000-nodes.chaos-tests/stream/__stream-1_1710247245159438000",
                                 max_segment_size_chunks => 256000,
                                 retention =>
                                     [{max_bytes,20000000000},{max_age,600000}],
                                 filter_size => 16,
                                 max_segment_size_bytes => 500000000,
                                 tracking_config => #{}},
                           committed_offset => 122425221,
                           external_reference =>
                               {resource,<<"/">>,queue,<<"stream-1">>},
                           has_parse_state => true,num_offset_listeners => 0}
 ** Reason for termination ==
 ** {{accept_chunk_out_of_order,118520700,118420442},
     [{osiris_log,accept_chunk,2,[{file,"osiris_log.erl"},{line,793}]},
      {osiris_replica,'-handle_incoming_data/3-fun-0-',2,
                      [{file,"osiris_replica.erl"},{line,537}]},
      {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
      {osiris_replica,handle_incoming_data,3,
                      [{file,"osiris_replica.erl"},{line,536}]},
      {gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},
      {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}
 
   crasher:
     initial call: osiris_replica:init/1
     pid: <0.20591.0>
     registered_name: []
     exception exit: {accept_chunk_out_of_order,118520700,118420442}
       in function  osiris_log:accept_chunk/2 (osiris_log.erl, line 793)
       in call from osiris_replica:'-handle_incoming_data/3-fun-0-'/2 (osiris_replica.erl, line 537)
       in call from lists:foldl/3 (lists.erl, line 1594)
       in call from osiris_replica:handle_incoming_data/3 (osiris_replica.erl, line 536)
       in call from gen_server:try_handle_info/3 (gen_server.erl, line 1095)
       in call from gen_server:handle_msg/6 (gen_server.erl, line 1183)
     ancestors: [osiris_server_sup,osiris_sup,<0.239.0>]
     message_queue_len: 3
     messages: [{tcp,#Port<0.2098>,
                       <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                         0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                         0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

@mkuratczyk
Copy link
Contributor Author

this one is somewhat different, but still worth fixing - it contains a lot of whitespace for some reason (screenshot to preserve the strange formatting):
Screenshot 2024-03-13 at 18 39 29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant