Skip to content

Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661

@grep4error

Description

@grep4error

Bug Report

My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters.
Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port.
After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe.
Here’s the stack trace

0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2  0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3  0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
    tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
    ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4  0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5  0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x3039663238613165 in ?? ()
#

strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)

...
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\200\25\20\2\0\0\0", 8)    = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\200\25\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.

I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.

flb_engine.c:

    /*
     * Create a communication channel: this routine creates a channel to
     * signal the Engine event loop. It's useful to stop the event loop
     * or to instruct anything else without break.
     */
    ret = mk_event_channel_create(config->evl,
                                  &config->ch_manager[0],
                                  &config->ch_manager[1],
                                  config);
    if (ret != 0) {
        flb_error("[engine] could not create manager channels");
        return -1;
    }

    flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */

there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.

Environment

Version used: 1.5.7 (container fluent/fluent-bit:1.5.7)
kubernetes 1.16.13
docker 3.0.10+azure
Ubuntu 16.04.1

Metadata

Metadata

Assignees

Labels

enhancementwaiting-for-userWaiting for more information, tests or requested changes

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions