My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters.
Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port.
After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe.
Here’s the stack trace
0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1 0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2 0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3 0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4 0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5 0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6 0x3039663238613165 in ?? ()
#
strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)
...
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
read(13, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\0\27\20\2\0\0\0", 8) = 8
read(13, "\0\300\37\20\2\0\0\0", 8) = 8
write(14, "\0\200\25\20\2\0\0\0", 8) = 8
read(13, "\0\200\17\20\2\0\0\0", 8) = 8
write(14, "\0\200\17\20\2\0\0\0", 8) = 8
write(14, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\0\33\20\2\0\0\0", 8) = 8
read(13, "\0\300\r\20\2\0\0\0", 8) = 8
write(14, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\300\24\20\2\0\0\0", 8) = 8
write(14, "\0\300\r\20\2\0\0\0", 8) = 8
write(14, "\0\300\24\20\2\0\0\0", 8) = 8
read(13, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0\0\33\20\2\0\0\0", 8) = 8
read(13, "\0\200\25\20\2\0\0\0", 8) = 8
read(13, "\0\200\17\20\2\0\0\0", 8) = 8
write(14, "\0\200\17\20\2\0\0\0", 8) = 8
read(13, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0@\16\20\2\0\0\0", 8) = 8
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
read(13, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\0\36\20\2\0\0\0", 8) = 8
write(14, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\0\36\20\2\0\0\0", 8) = 8
write(14, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\300\r\20\2\0\0\0", 8) = 8
read(13, "\0\300\24\20\2\0\0\0", 8) = 8
read(13, "\0\0\33\20\2\0\0\0", 8) = 8
write(14, "\0\300\r\20\2\0\0\0", 8) = 8
read(13, "\0\200\17\20\2\0\0\0", 8) = 8
read(13, "\0@\16\20\2\0\0\0", 8) = 8
read(13, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0\300\24\20\2\0\0\0", 8) = 8
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0@\16\20\2\0\0\0", 8) = 8
read(13, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\0\33\20\2\0\0\0", 8) = 8
read(13, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\300\37\20\2\0\0\0", 8) = 8
write(14, "\0\0\27\20\2\0\0\0", 8) = 8
write(14, "\0\0\36\20\2\0\0\0", 8) = 8
read(13, "\0\300\r\20\2\0\0\0", 8) = 8
write(14, "\0\300\37\20\2\0\0\0", 8) = 8
read(13, "\0\300\24\20\2\0\0\0", 8) = 8
read(13, "\0\200\1\20\2\0\0\0", 8) = 8
read(13, "\0@\16\20\2\0\0\0", 8) = 8
write(14, "\0\200\1\20\2\0\0\0", 8) = 8
write(14, "\0\300\r\20\2\0\0\0", 8) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.
I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.
/*
* Create a communication channel: this routine creates a channel to
* signal the Engine event loop. It's useful to stop the event loop
* or to instruct anything else without break.
*/
ret = mk_event_channel_create(config->evl,
&config->ch_manager[0],
&config->ch_manager[1],
config);
if (ret != 0) {
flb_error("[engine] could not create manager channels");
return -1;
}
flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */
there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.
Bug Report
My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters.
Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port.
After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe.
Here’s the stack trace
strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)
it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.
I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.
flb_engine.c:
there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.
Environment
Version used: 1.5.7 (container fluent/fluent-bit:1.5.7)
kubernetes 1.16.13
docker 3.0.10+azure
Ubuntu 16.04.1