Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: netdata plugins segfaults after update 1.45.3 to 1.45.4 #17671

Closed
ekexcello opened this issue May 15, 2024 · 7 comments · Fixed by #17690
Closed

[Bug]: netdata plugins segfaults after update 1.45.3 to 1.45.4 #17671

ekexcello opened this issue May 15, 2024 · 7 comments · Fixed by #17690
Labels
area/collectors Everything related to data collection bug

Comments

@ekexcello
Copy link

Bug description

on some servers netdata started to crash on start-up.
the error is way brief:

[Wed May 15 20:41:19 2024] PD[qmail][22407]: segfault at 68 ip 000055725db9d98e sp 00007fd0e79fe9c0 error 6 in netdata[55725da17000+3fb000] likely on CPU 5 (core 10, socket 0)
[Wed May 15 20:41:19 2024] Code: e8 e7 8d 03 00 48 83 ec 08 4c 89 f6 4c 89 ef 8b 54 24 14 44 0f b6 c8 52 4c 8b 44 24 28 4c 89 fa 48 8b 4c 24 20 e8 52 7c 02 00 <80> 60 68 fd 49 89 c6 58 5a 4d 85 e4 0f 84 40 01 00 00 41 80 3c 24
[Wed May 15 20:41:24 2024] PD[qmail][23388]: segfault at 68 ip 000056147895b98e sp 00007e66033fe9c0 error 6 in netdata[5614787d5000+3fb000] likely on CPU 3 (core 6, socket 0)
[Wed May 15 20:41:24 2024] Code: e8 e7 8d 03 00 48 83 ec 08 4c 89 f6 4c 89 ef 8b 54 24 14 44 0f b6 c8 52 4c 8b 44 24 28 4c 89 fa 48 8b 4c 24 20 e8 52 7c 02 00 <80> 60 68 fd 49 89 c6 58 5a 4d 85 e4 0f 84 40 01 00 00 41 80 3c 24
[Wed May 15 20:41:27 2024] PD[qmail][24301]: segfault at 68 ip 0000568c3764598e sp 00007f2cc1dfe9c0 error 6 in netdata[568c374bf000+3fb000] likely on CPU 1 (core 2, socket 0)
[Wed May 15 20:41:27 2024] Code: e8 e7 8d 03 00 48 83 ec 08 4c 89 f6 4c 89 ef 8b 54 24 14 44 0f b6 c8 52 4c 8b 44 24 28 4c 89 fa 48 8b 4c 24 20 e8 52 7c 02 00 <80> 60 68 fd 49 89 c6 58 5a 4d 85 e4 0f 84 40 01 00 00 41 80 3c 24

They refer to the custom plugin we use, but on running this plugin manually it collects the data and runs without issues

Expected behavior

no segfault

Steps to reproduce

  1. upgrade 1.45.3 to 1.45.4
  2. don't disable any plugin
  3. start netdata

Installation method

other

System info

Linux 6.7.12-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 11 17:30:02 CEST 2024 x86_64 AMD EPYC 7402P 24-Core Processor AuthenticAMD GNU/Linux
/etc/gentoo-release:Gentoo Base System release 2.15
/etc/lsb-release:DISTRIB_ID="Gentoo"
/etc/os-release:NAME=Gentoo
/etc/os-release:ID=gentoo
/etc/os-release:PRETTY_NAME="Gentoo Linux"
/etc/os-release:ANSI_COLOR="1;32"
/etc/os-release:VERSION_ID="2.15"

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.45.4
    Installation Type __________________________________________ : custom
    Package Architecture _______________________________________ : unknown
    Package Distro _____________________________________________ : unknown
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /usr/share/netdata/web
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 6.7.12-gentoo-x86_64
    Operating System ___________________________________________ : Gentoo
    Operating System ID ________________________________________ : gentoo
    Operating System ID Like ___________________________________ : unknown
    Operating System Version ___________________________________ : 2.15
    Operating System Version ID ________________________________ : none
    Detection __________________________________________________ : Mixed
Hardware:
    CPU Cores __________________________________________________ : 12
    CPU Frequency ______________________________________________ : 2800000000
    RAM Bytes __________________________________________________ : 77747142656
    Disk Capacity ______________________________________________ : 161061273600
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : xen
    Virtualization Detection ___________________________________ : lscpu
Container:
    Container __________________________________________________ : unknown
    Container Detection ________________________________________ : none
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : none
    Container Operating System ID ______________________________ : none
    Container Operating System ID Like _________________________ : none
    Container Operating System Version _________________________ : none
    Container Operating System Version ID ______________________ : none
    Container Operating System Detection _______________________ : none
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : NO (unavailable)
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip brotli)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine ___________________________________________________ : YES
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : NO
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : YES
    protobuf (platform-neutral data serialization protocol) ____ : NO
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : YES
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : NO
    ebpf (monitor system calls) ________________________________ : NO
    freeipmi (monitor enterprise server H/W) ___________________ : NO
    nfacct (gather netfilter accounting) _______________________ : YES
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : NO
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : NO
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

unfortunately, no log is available except the segfault message

@ekexcello ekexcello added bug needs triage Issues which need to be manually labelled labels May 15, 2024
@ilyam8
Copy link
Member

ilyam8 commented May 16, 2024

Hi, @ekexcello. As you can guess, the information provided is not enough to identify the problem.

  • Does Netdata crash if you disable your plugin?
  • Can you provide the output of your plugin? Multiple data collection intervals will do.
  • Can you get coredump?

@ilyam8 ilyam8 added cannot reproduce This is to tag issues we weren't able to reproduce the problem and fix it and removed needs triage Issues which need to be manually labelled labels May 16, 2024
@ekexcello
Copy link
Author

Hello.
Sorry for the delay.
Netdata doesn't crash with plugins disabled. Will provide other details soon

@ekexcello
Copy link
Author

Please find plugin output and core dumps pid 19500 and pid 26213
Note, uncompressed they will have significant size.

@ekexcello
Copy link
Author

While taking core dumps it was found that another custom plugin we have also crashes, but it doesn't render netdata itself to crash. So the problem seems to be a bit more wide.

@arkamar
Copy link
Contributor

arkamar commented May 17, 2024

I wasn't able to reproduce this issue as well, but I have spent some time with the Code line from dmesg yesterday before logs and coredumps were available:

[Wed May 15 20:41:19 2024] Code: e8 e7 8d 03 00 48 83 ec 08 4c 89 f6 4c 89 ef 8b 54 24 14 44 0f b6 c8 52 4c 8b 44 24 28 4c 89 fa 48 8b 4c 24 20 e8 52 7c 02 00 <80> 60 68 fd 49 89 c6 58 5a 4d 85 e4 0f 84 40 01 00 00 41 80 3c 24

It disassembles to this:

 0x00000000      e8e78d0300     call 0x38dec
 0x00000005      4883ec08       sub rsp, 8
 0x00000009      4c89f6         mov rsi, r14
 0x0000000c      4c89ef         mov rdi, r13
 0x0000000f      8b542414       mov edx, dword [rsp + 0x14]
 0x00000013      440fb6c8       movzx r9d, al
 0x00000017      52             push rdx
 0x00000018      4c8b442428     mov r8, qword [rsp + 0x28]
 0x0000001d      4c89fa         mov rdx, r15
 0x00000020      488b4c2420     mov rcx, qword [rsp + 0x20]
 0x00000025      e8527c0200     call 0x27c7c
>0x0000002a      806068fd       and byte [rax + 0x68], 0xfd ; [0xfd:1]=255 ; 253
 0x0000002e      4989c6         mov r14, rax
 0x00000031      58             pop rax
 0x00000032      5a             pop rdx

I tried to search in my netdata binary and I have found only one instruction with the same parameters:

 0x00199a08      e8e38f0300     call dbg.rrd_algorithm_id
 0x00199a0d      4883ec08       sub rsp, 8
 0x00199a11      4c89fe         mov rsi, r15                ; char *arg2
 0x00199a14      4c89f7         mov rdi, r14                ; int64_t arg1
 0x00199a17      4155           push r13
 0x00199a19      4c8b442428     mov r8, qword [var_18h]     ; int64_t arg5
 0x00199a1e      440fb6c8       movzx r9d, al               ; int64_t arg_50h
 0x00199a22      488b4c2420     mov rcx, qword [var_10h]    ; int64_t arg4
 0x00199a27      488b542418     mov rdx, qword [var_8h]     ; int64_t arg3
 0x00199a2c      e8ef7c0200     call dbg.rrddim_add_custom
>0x00199a31      806068fd       and byte [rax + 0x68], 0xfd ; [0xfd:1]=0
 0x00199a35      4989c5         mov r13, rax
 0x00199a38      58             pop rax
 0x00199a39      5a             pop rdx

It is from pluginsd_dimension function, it crashes on this line

rrddim_option_clear(rd, RRDDIM_OPTION_DONT_DETECT_RESETS_OR_OVERFLOWS);

everything fits, 0xfd represents ~(1 << 1), where (1 << 1) is value of RRDDIM_OPTION_DONT_DETECT_RESETS_OR_OVERFLOWS. 0x68 is offset of collector.options in rd structure and the rd pointer is stored in rax register.
rrddim_add (rrddim_add_custom in the assembly) can return NULL in some situations, but this case is not checked here and this is the reason of the crash. At least I think so :)

@ilyam8
Copy link
Member

ilyam8 commented May 17, 2024

Hey, @ekexcello. The problem is missing DIMENSION id

CHART qmail.limit_maxconnip '' 'Qmail SMTPD maxconnip limit' '# reaches' 'tcpserver' 'qmail.qmail_smtpd_limits' line
DIMENSION  '' absolute 1 1

BEGIN qmail.limit_maxconnip 1000016
SET  = 1
END

The fact that Netdata crashes is a bug, we will fix it. After the fix, Netdata will stop your plugin because of incorrect output, so you'll need to fix this.

@stelfrag
Copy link
Collaborator

RRDDIM_OPTION_DONT_DETECT_RESETS_OR_OVERFLOWS. 0x68 is offset of collector.options in rd structure and the rd pointer is stored in rax register. rrddim_add (rrddim_add_custom in the assembly) can return NULL in some situations, but this case is not checked here and this is the reason of the crash. At least I think so :)

@arkamar You are correct !!! Thanks for the detailed investigation (and the assembly code 😃 )

@ilyam8 ilyam8 added area/collectors Everything related to data collection and removed cannot reproduce This is to tag issues we weren't able to reproduce the problem and fix it need feedback labels May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants