Prevent excessive logging in certain failure scenarios #10723

mkuratczyk · 2024-03-12T14:06:48Z

Describe the bug

This is an umbrella issue for a bunch of small issues.

Some RabbitMQ components can log excessively when they crash or some other relatively unusual event happens (but it can be a common one actually - a cluster node failure, etc).

The goal is to avoid stuff like this in the logs:

or this:

 ** Last message in was {'$gen_cast',terminate}
 ** When Server state == {ch,
                          {conf,running,rabbit_framing_amqp_0_9_1,1,
                           <0.7674.0>,<0.7681.0>,<0.7674.0>,
                           <<"10.52.4.3:36244 -> 10.52.3.58:5672">>,
                           {user,<<"default_user_KVbiqyQs7TiNc32SbiD">>,
                            [administrator],
                            [{rabbit_auth_backend_internal,
                              #Fun<rabbit_auth_backend_internal.3.111050101>}]},
                           <<"/">>,<<>>,<0.7675.0>,
                           [{<<"exchange_exchange_bindings">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"publisher_confirms">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true}],
                           none,0,1800000,#{},1000000000},
                          {lstate,<0.7682.0>,false},
                          none,82301,
                          {263,
                           [{pending_ack,82300,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548856,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82299},
                            {pending_ack,82299,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548852,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82298},
                            {pending_ack,82298,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548852,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82297},
... (many more pending_ack)

Reproduction steps

I'm running chaos tests to see what kind of situations like that I can trigger. So far there are two:

osiris_replica can fail and all of its mailbox gets logged. This is potentially a huge piece of binary data coming from a TCP. For example, after running out of disk space I get this:

2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>     reason: {{badmatch,{error,enospc}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>              [{osiris_log,write_chunk,6,[{file,"osiris_log.erl"},{line,2348}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {osiris_replica,'-handle_incoming_data/3-fun-0-',2,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                               [{file,"osiris_replica.erl"},{line,537}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {osiris_replica,handle_incoming_data,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                               [{file,"osiris_replica.erl"},{line,536}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {gen_server,try_handle_info,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                           [{file,"gen_server.erl"},{line,1095}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {proc_lib,init_p_do_apply,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                         [{file,"proc_lib.erl"},{line,241}]}]}
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>     offender: [{pid,<0.15412.0>},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {id,"__stream-1_1710241152660281401"},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {mfargs,{osiris_replica,start_link,undefined}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {restart_type,temporary},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {significant,false},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {shutdown,5000},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {child_type,worker}]
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> 
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Generic server <0.15439.0> terminating
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Last message in was {tcp,#Port<0.5937>,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                             <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                               136,0,0,1,142,50,127,114,211,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                               117,176,0,0,19,136,0,0,1,142,50,127,114,211,0,0,

with lots and lots of data later. Moreover this process is continuously restarted and therefore this data is printed over and over as the stream leader keeps trying to send the data to this replica.

A channel can have a large state that gets printed in some cases (see the pending_ack stuff above)

Expected behavior

General recommendations for logging:

~p should not be used for log formatting
processes should implement format_status/2
maximum size of the log should be configured

Additional context

No response

The text was updated successfully, but these errors were encountered:

mkuratczyk · 2024-03-12T14:56:06Z

another one from the channel process - this time it's the mailbox, not the state:

crasher:
  initial call: rabbit_channel:init/1
  pid: <0.11850.0>
  registered_name: []
  exception exit: noproc
    in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
  ancestors: [<0.11847.0>,<0.11845.0>,<0.11839.0>,<0.11838.0>,<0.966.0>,
                <0.965.0>,<0.964.0>,<0.962.0>,<0.961.0>,rabbit_sup,
                <0.262.0>]
  message_queue_len: 151
  messages: [{'EXIT',<0.11847.0>,shutdown},
                {'$gen_cast',
                 {queue_event,
                  {resource,<<"/">>,queue,<<"q-3">>},
                  {deliver,<<"amq.ctag-hrgXqfnbKzvHGRhnaoLYog">>,true,
                   [{{resource,<<"/">>,queue,<<"q-3">>},
                     <15143.922.0>,2461847,false,
                     {mc,mc_amqpl,
                      {content,60,none,
                       <<16,0,2>>,
                       rabbit_framing_amqp_0_9_1,
                       [<<0,2,77,48,0,0,1,142,50,202,34,181,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

mkuratczyk · 2024-03-13T11:13:51Z

osiris:

 ** When Server state == #{name => <<"__stream-1_1710247245159438000">>,
                           log =>
                               #{file => "00000000000118320636.segment",
                                 mode =>
                                     #{type => acceptor,mode => write,
                                       tail =>
                                           {118420442,
                                            {11,118420257,1710257805308}},
                                       epoch => 11},
                                 directory =>
                                     "/var/lib/rabbitmq/mnesia/rabbit@main-s5000-server-0.main-s5000-nodes.chaos-tests/stream/__stream-1_1710247245159438000",
                                 max_segment_size_chunks => 256000,
                                 retention =>
                                     [{max_bytes,20000000000},{max_age,600000}],
                                 filter_size => 16,
                                 max_segment_size_bytes => 500000000,
                                 tracking_config => #{}},
                           committed_offset => 122425221,
                           external_reference =>
                               {resource,<<"/">>,queue,<<"stream-1">>},
                           has_parse_state => true,num_offset_listeners => 0}
 ** Reason for termination ==
 ** {{accept_chunk_out_of_order,118520700,118420442},
     [{osiris_log,accept_chunk,2,[{file,"osiris_log.erl"},{line,793}]},
      {osiris_replica,'-handle_incoming_data/3-fun-0-',2,
                      [{file,"osiris_replica.erl"},{line,537}]},
      {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
      {osiris_replica,handle_incoming_data,3,
                      [{file,"osiris_replica.erl"},{line,536}]},
      {gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},
      {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}
 
   crasher:
     initial call: osiris_replica:init/1
     pid: <0.20591.0>
     registered_name: []
     exception exit: {accept_chunk_out_of_order,118520700,118420442}
       in function  osiris_log:accept_chunk/2 (osiris_log.erl, line 793)
       in call from osiris_replica:'-handle_incoming_data/3-fun-0-'/2 (osiris_replica.erl, line 537)
       in call from lists:foldl/3 (lists.erl, line 1594)
       in call from osiris_replica:handle_incoming_data/3 (osiris_replica.erl, line 536)
       in call from gen_server:try_handle_info/3 (gen_server.erl, line 1095)
       in call from gen_server:handle_msg/6 (gen_server.erl, line 1183)
     ancestors: [osiris_server_sup,osiris_sup,<0.239.0>]
     message_queue_len: 3
     messages: [{tcp,#Port<0.2098>,
                       <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                         0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                         0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

mkuratczyk · 2024-03-13T17:39:47Z

this one is somewhat different, but still worth fixing - it contains a lot of whitespace for some reason (screenshot to preserve the strange formatting):

michaelklishin changed the title ~~Prevent excessive logging~~ Prevent excessive logging in certain failure scenarios Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent excessive logging in certain failure scenarios #10723

Prevent excessive logging in certain failure scenarios #10723

mkuratczyk commented Mar 12, 2024

mkuratczyk commented Mar 12, 2024

mkuratczyk commented Mar 13, 2024

mkuratczyk commented Mar 13, 2024

Prevent excessive logging in certain failure scenarios #10723

Prevent excessive logging in certain failure scenarios #10723

Comments

mkuratczyk commented Mar 12, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

mkuratczyk commented Mar 12, 2024

mkuratczyk commented Mar 13, 2024

mkuratczyk commented Mar 13, 2024