Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supervisor running unconfined (w/o apparmor profile) after reboot THEN w/ insufficient perms to monitor udev after recreate #4839

Open
donaldguy opened this issue Jan 26, 2024 · 7 comments
Labels
bug no-stale Avoids issue going stale

Comments

@donaldguy
Copy link

donaldguy commented Jan 26, 2024

TL;DR basically like the title says: I have discovered:

  1. that the supervisor has only ever(? recently-ever?) been working/starting for me because some bug on (re)boot is preventing the hassio-supervisor apparmor policy from being actually applied/enforced on any process in an existing-on-boot hassio_supervisor container (with e.g. ps auZ showing them "unrestricted", and `apparmor_status listing none of them)
  2. As soon as something causes the policy to actually be applied (most reproducibly a re-creation of the container, but possibly others) - the system cannot be healthy due to udev permission failures; as in Supervisor keeps going “Unhealthy” #4381

Describe the issue you are experiencing

Until pretty recently after many hours of investigating and writing I did not have a specific proposed pathology (had only the "disease"), but I have found it (the worrying "cure") and it raises potentially greater concerns, including both how it interacts with the udev issue and that it occurs at all.

the "disease"

As captured in the log-snippets below (and at root in #4381 and potentially related to continuing issues with #4827 after release of docker 25.0.1), the eventual problem is that (reliably after container recreation, so e.g. supervisor update, as well as potentially after some supervisor restarts in my testing, and also apparently just in letting HA run sometimes?) this code goes down the unhappy path:

async def load(self) -> None:
"""Start hardware monitor."""
try:
self.monitor = pyudev.Monitor.from_netlink(self.context, "kernel")
self.monitor.set_receive_buffer_size(32 * 1024 * 1024)
self.observer = pyudev.MonitorObserver(
self.monitor,
callback=lambda x: self.sys_loop.call_soon_threadsafe(
self._udev_events, x
),
)
except OSError:
self.sys_resolution.unhealthy = UnhealthyReason.PRIVILEGED
_LOGGER.critical("Not privileged to run udev monitor!")

producing a false-positive (or at least transiently-valid) claim of running an unsupported/unhealthy configuration, failing by ~short-circuit under the "privileged" check, owing to what I would assert is an erroneous re-usage of that error code in the network code, as

$ docker inspect hassio_supervisor -f '{{.HostConfig.Privileged}}'
true

remains the case at every point.

Against an already running core, this also blocks add-on (and core) upgrades even as the supervisor may be de facto functioning adequately (for most purposes, presumably it would fail at any responsibilities specifically related to hardware (change) monitoring)

the worrying "cure"

With limited exceptions (when running Docker 25.0.0 and hitting other problems) I have generally found that rebooting the entire machine will solve my problem… for a while.

I did not know why, even as I started to dig into this udev issue and hone in on the specifics within libudev

but as it turns out, after I reboot, the loaded hassio-supervisor profile is simply not applied to the container

$ docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]

$ aa-status | grep hassio-supervisor
   hassio-supervisor
   hassio-supervisor///usr/bin/gdbus
   hassio-supervisor///usr/bin/git

# the shortness of which alone should hint the issue, but we can confirm

$ docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid
LABEL                               PID TTY      STAT   TIME COMMAND
LABEL                               PID TTY      STAT   TIME COMMAND
unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
unconfined                         2722 ?        S      0:00 s6-supervise supervisor
unconfined                         2724 ?        S      0:00 s6-supervise watchdog
# with a more elaborate pipeline used below indicating that it does not in fact somehow come into effect by `python`

this being in contrast to the other hassio_* containers which are instead/at least under docker-default profile. (by virtue having no specified alternative profile)

What type of installation are you running?

Home Assistant Supervised

affecting supported systems?

  • I am of the impression that supervised-installer run on Debian Bookworm is considered supported, so ... then yes

  • But even if it is not, whereas there seems to be a reproducible critical failure upon the application of the stable apparmor profile at all, the issue may very well be waiting in the wings for HAOS

    • pending some upgrade to apparmor, systemd/sytemd-udev ;
    • but it is also possible it is a kernel bug that has already been fixed upstream and taken by y'all

I hope y'all take this seriously given the many hours [more than I meant to] I have put into it

but I will put my cards on the table, admitting I might be of spec somewhere (I believe only in an older kernel, but let me know if you see something else):

uname -a and filtered dpkg-query -l
$ uname -a
Linux odroidn2 6.1.63-current-meson64 #1 SMP PREEMPT Mon Nov 20 10:52:19 UTC 2023 aarch64 GNU/Linux

$ sudo dpkg-query -l | grep -E 'apparmor|cifs-utils|curl|dbus|jq|libglib2.0-bin|lsb-release|network-manager|nfs-common|systemd|udisks2|wget|^linux|docker|homeassistant'
hii  apparmor                                              3.0.8-3                        arm64        user-space parser utility for AppArmor
ii  cifs-utils                                            2:7.0-2                        arm64        Common Internet File System utilities
ii  curl                                                  7.88.1-10+deb12u5              arm64        command line tool for transferring data with URL syntax
ii  dbus                                                  1.14.10-1~deb12u1              arm64        simple interprocess messaging system (system message bus)
ii  dbus-bin                                              1.14.10-1~deb12u1              arm64        simple interprocess messaging system (command line utilities)
ii  dbus-broker                                           33-1                           arm64        Linux D-Bus Message Broker
ii  dbus-daemon                                           1.14.10-1~deb12u1              arm64        simple interprocess messaging system (reference message bus)
ii  dbus-session-bus-common                               1.14.10-1~deb12u1              all          simple interprocess messaging system (session bus configuration)
ii  dbus-system-bus-common                                1.14.10-1~deb12u1              all          simple interprocess messaging system (system bus configuration)
ii  dbus-user-session                                     1.14.10-1~deb12u1              arm64        simple interprocess messaging system (systemd --user integration)
ii  docker-ce                                             5:25.0.1-1~debian.12~bookworm  arm64        Docker: the open-source application container engine
ii  docker-ce-cli                                         5:25.0.1-1~debian.12~bookworm  arm64        Docker CLI: the open-source application container engine
ii  homeassistant-supervised                              1.5.0                          all          Home Assistant Supervised
ii  jq                                                    1.6-2.1                        arm64        lightweight and flexible command-line JSON processor
ii  libapparmor1:arm64                                    3.0.8-3                        arm64        changehat AppArmor library
ii  libcurl3-gnutls:arm64                                 7.88.1-10+deb12u5              arm64        easy-to-use client-side URL transfer library (GnuTLS flavour)
ii  libcurl4:arm64                                        7.88.1-10+deb12u5              arm64        easy-to-use client-side URL transfer library (OpenSSL flavour)
ii  libdbus-1-3:arm64                                     1.14.10-1~deb12u1              arm64        simple interprocess messaging system (library)
ii  libglib2.0-bin                                        2.74.6-2                       arm64        Programs for the GLib library
ii  libjq1:arm64                                          1.6-2.1                        arm64        lightweight and flexible command-line JSON processor - shared library
ii  libpam-systemd:arm64                                  252.19-1~deb12u1               arm64        system and service manager - PAM module
ii  libsystemd-shared:arm64                               252.19-1~deb12u1               arm64        systemd shared private library
ii  libsystemd0:arm64                                     252.19-1~deb12u1               arm64        systemd utility library
ii  libudisks2-0:arm64                                    2.9.4-4                        arm64        GObject based library to access udisks2
ii  lsb-release                                           12.0-1                         all          Linux Standard Base version reporting utility (minimal implementation)
ii  network-manager                                       1.42.4-1                       arm64        network management framework (daemon and userspace tools)
ii  network-manager-openvpn                               1.10.2-2                       arm64        network management framework (OpenVPN plugin core)
ii  networkd-dispatcher                                   2.2.3-1                        all          Dispatcher service for systemd-networkd connection status changes
ii  nfs-common                                            1:2.6.2-4                      arm64        NFS support files common to client and server
ii  python3-dbus                                          1.3.2-4+b1                     arm64        simple interprocess messaging system (Python 3 interface)
ii  systemd                                               252.19-1~deb12u1               arm64        system and service manager
ii  systemd-journal-remote                                252.19-1~deb12u1               arm64        tools for sending and receiving remote journal logs
ii  systemd-resolved                                      252.19-1~deb12u1               arm64        systemd DNS resolver
ii  systemd-sysv                                          252.19-1~deb12u1               arm64        system and service manager - SysV compatibility symlinks
ii  udisks2                                               2.9.4-4                        arm64        D-Bus service to access and manipulate storage devices
ii  wget                                                  1.21.3-1+b1                    arm64        retrieves files from the web

Which operating system are you running on?

Debian


Steps to reproduce the issue

  1. Reboot host machine (with a hassio_supervisor container already created. If I stop and delete before reboot, the initial creation is "broken". An additional reboot achieves)

  2. Observer that supervisor comes up reporting as in healthy state;

    odroidn2:~:# uptime
     18:44:13 up 1 min,  1 user,  load average: 1.65, 0.75, 0.28
    odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -2
    24-01-25 18:44:04 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
    24-01-25 18:44:04 INFO (MainThread) [supervisor.host.manager] Host information reload completed
    odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
    healthy: true
    supported: true
    version: 2023.12.1
    version_latest: 2023.12.1
    

    optionally: proceed in blissful ignorance as to why/how

  3. Alternatively ruin your sense of well-being by looking into the apparmor status of that-there container

    odroidn2:~:# docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
    [apparmor=hassio-supervisor label=disable]
    
    docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
        xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
        xargs ps Z
    LABEL                               PID TTY      STAT   TIME COMMAND
    unconfined                         2503 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
    unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
    unconfined                         2641 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
    unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
    unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
    unconfined                         2680 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
    unconfined                         2722 ?        S      0:00 s6-supervise supervisor
    unconfined                         2724 ?        S      0:00 s6-supervise watchdog
    unconfined                         2726 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
    unconfined                         2727 ?        Ssl    0:06 python3 -m supervisor
    unconfined                         7424 ?        S      0:00 sleep 300
    
    • 3.5. Confirm that in fact there is such profile actually

      odroidn2:~:# aa-status | grep hassio
         hassio-supervisor
         hassio-supervisor///usr/bin/gdbus
         hassio-supervisor///usr/bin/git
      odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep apparmor
      24-01-25 19:31:42 INFO (MainThread) [supervisor.host.apparmor] Loading AppArmor Profiles: {'hassio-supervisor'}
      odroidn2:~:# head /usr/share/hassio/apparmor/hassio-supervisor
      #include <tunables/global>
      
      profile hassio-supervisor flags=(attach_disconnected,mediate_deleted) {
        #include <abstractions/base>
        #include <abstractions/python>
      
        network,
        deny network raw,
      
        signal (send) set=(kill,term,int,hup,cont),
      
    • 3.75. Try various methods to simply restart the container [ha supervisor restart, systemctl restart hassio-supervisor or even systemctl stop hassio-supervisor wait a while systemctl start hassio-supervisor ... and oops actually everything was "fine";

      I kinda swear at least once it broke on restart, but I might be misremembering, or hit one of the edges suggested below)

      odroidn2:~:# ha supervisor restart
      Command completed successfully.
      odroidn2:~:# docker ps | grep  hassio_supervisor
      974b16d01e95   ghcr.io/home-assistant/aarch64-hassio-supervisor:latest     "/init"                  14 minutes ago   Up 13 minutes                                                                                                                                  hassio_supervisor
      odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
      healthy: true
      supported: true
      version: 2023.12.1
      version_latest: 2023.12.1
      

      still "fine" is all the processes remain unconstrained from the apparmor profile intended for the,

  4. Okay, but anything which actually recreates the container (as /usr/sbin/hassio-supervisor will do when the startup-marker (/run/supervisor/startup-marker) is left in place (so that is the supervisor crashes early?), or indeed as outlined above when the profile is missing initially, or on supervisor update; or if simply someone/something runs docker rm on the stopped container)

    So yeah we can do

    systemctl stop hassio-supervisor && touch /run/supervisor/startup-marker && systemctl start hassio-superviosr

    ... for example

    And now we will find our error state of interest

    odroidn2:~:# docker ps | grep hassio_supervisor
    325c2a447241   ghcr.io/home-assistant/aarch64-hassio-supervisor:latest     "/init"                  About a minute ago   Up About a minute                                                                                                                                  hassio_supervisor
    odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
    healthy: false
    supported: true
    version: 2023.12.1
    version_latest: 2023.12.1
    
    odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
    24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
    24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
    24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged
    
    odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep 'CRITICAL\|ERROR\|WARNING'
    24-01-25 20:07:42 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
    24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.rauc] Host has no rauc support. OTA updates have been disabled.
    24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.agent] No OS-Agent support on the host. Some Host functions have been disabled.
    24-01-25 20:07:52 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!
    
  5. What has changed? I mean you know, but its

    odroidn2:~:# aa-status | grep hassio-supervisor
       hassio-supervisor
       hassio-supervisor///usr/bin/gdbus
       hassio-supervisor///usr/bin/git
       /package/admin/s6-2.12.0.2/command/s6-svscan (9035) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9077) hassio-supervisor
       /package/admin/s6-linux-init-1.1.2.0/command/s6-linux-init-shutdownd (9079) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9086) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9087) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-ipcserverd (9095) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9135) hassio-supervisor
       /package/admin/s6-2.12.0.2/command/s6-supervise (9137) hassio-supervisor
       /usr/local/bin/python3.11 (9138) hassio-supervisor
       /usr/bin/bash (9142) hassio-supervisor
       /bin/busybox (9773) hassio-supervisor
    

    i.e.

    odroidn2:~:# docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
      xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
      xargs ps Z
    LABEL                               PID TTY      STAT   TIME COMMAND
    hassio-supervisor (enforce)        9035 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
    hassio-supervisor (enforce)        9077 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
    hassio-supervisor (enforce)        9079 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
    hassio-supervisor (enforce)        9086 ?        S      0:00 s6-supervise s6rc-fdholder
    hassio-supervisor (enforce)        9087 ?        S      0:00 s6-supervise s6rc-oneshot-runner
    hassio-supervisor (enforce)        9095 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
    hassio-supervisor (enforce)        9135 ?        S      0:00 s6-supervise supervisor
    hassio-supervisor (enforce)        9137 ?        S      0:00 s6-supervise watchdog
    hassio-supervisor (enforce)        9138 ?        Ssl    0:03 python3 -m supervisor
    hassio-supervisor (enforce)        9142 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
    hassio-supervisor (enforce)        9773 ?        S      0:00 sleep 30
    
  6. We can go deeper actually. You can see most the same thing by starting with debug and debug_wait and using the and attaching from vscode; but to really get there - lets more awkwardly do

    systemctl stop hassio-supervisor
    docker rm hassio_supervisor
    sh <(sed 's/docker container create/docker run --rm -it --entrypoint \/bin\/bash/' /usr/sbin/hassio-supervisor)

    This should (still1) give you into an interactive shell that is set up like the supervisor, including (which running our docker inspect ... pipeline can verify) the apparmor profile.

    From here its instructive to do 3 things (restarting the shell between or not as you prefer):

    1. actually run the supervisor (e.g. python3 -m supervisor ) and see it fail same as when deamonized (including CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!)

    2. run udevadm monitor, and discover that it works perfectly (prints the headers and will respond if I go e.g. plug and unplug USB devices)

    3. Attempt my minimal repro case, expanded from the real line via the pyudev source
      by starting a python repl, then

      import pyudev
      import os
      
      from ctypes import get_errno
      
      c = pyudev.Context()
      m = c._libudev.udev_monitor_new_from_netlink(c, b'kernel')
      
      if bool(m):
        print("successfully got pointer to a `struct udev_monitor`")
      else:
        print("udev_monitor_new_from_netlink(3) returned NULL pointer")
        e = get_errno()
        print(f"Errno set was: {os.strerror(e)} ({e})")

      and like the supervisor, you'll get

      udev_monitor_new_from_netlink(3) returned NULL pointer
      Errno set was: Permission denied (13)

    Why do these things? Well, it reveals that the apparmor policy is in effect AND is killing udev connect for the supervisor (and the python interpreter in general, and most things).

    How do I know that? Because udevadm still works. It is calling the same function (udev_monitor_new_from_netlink) of the same library (from alpine's eudev-libs 3.2.11-r8): https://github.com/eudev-project/eudev/blob/611b6fbae2cca9ceeacb820befb3a0e8caec88b8/src/udev/udevadm-monitor.c#L214

    (Unless in fact pyudev's wheel is bringing along its own copy of e.g. systemd libudev; but I don't see evidence of that)

    So it follows that the only reason that udevadm still works when the other two don't is because the apparmor policy gives it an unrestricted pass:

    /bin/udevadm Ux,
    

    where python gets no such


Additional information

So what is actually failing?

the libudev call

it's apparently udev_monitor_new_from_netlink(3) which is unexpectedly returning a NULL pointer and setting errno to 13 ("Access denied")

That's pretty much, at core, 3 syscalls:

  1. access on /run/udev/control - which, ls -l shows it srw------- root root and a stat also passed

  2. socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT)

    Indeed I can just now say with some confidence it is this syscall, because I belatedly discovered that pyudev does hook a few log priority functions and

    >>> import syslog
    >>> c.log_priority = syslog.LOG_DEBUG
    >>> m = c._libudev.udev_monitor_new_from_netlink(c, b'kernel')
    error getting socket: Permission denied

    That error appearing to be generated at https://github.com/eudev-project/eudev/blob/v3.2.11/src/libudev/libudev-monitor.c#L203-L206 :

            if (fd < 0) {
                    udev_monitor->sock = socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT);
                    if (udev_monitor->sock < 0) {
                            log_debug_errno(errno, "error getting socket: %m");
  3. getsockname(udev_monitor->sock, &snl.sa, &addrlen);

So... what's the problem? I don't know. Most of the internet agrees that a net_admin capability ought to clear the way for interfacing with netlink

If I had to guess, I'd point at maybe the:

network,
deny network raw,

at https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L7-L8

hitting up against the SOCK_RAW flag on the call ... but again nothing here has changed recently? or has it?

Anyway ... this turned into 8 hours of investigation & writing today on top of the 5 of investigation and 3 of probs useless PR (#4834) on Tuesday; ADHD be like that.

the rest of the standard bug form stuff

Anything in the Supervisor logs that might be useful for us?

...
CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
...
System is running in an unhealthy state and needs manual intervention!

#after restart
...
CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
...
(MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged
odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged

System Health information

System Information

version core-2024.1.5
installation_type Home Assistant Supervised
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.6
os_name Linux
os_version 6.1.63-current-meson64
arch aarch64
timezone America/Denver
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 5000
Installed Version 1.34.0
Stage running
Available Repositories 1380
Downloaded Repositories 19
Home Assistant Supervisor
host_os Debian GNU/Linux 12 (bookworm)
update_channel stable
supervisor_version supervisor-2023.12.1
agent_version 1.6.0
docker_version 25.0.1
disk_total 915.8 GB
disk_used 10.8 GB
healthy failed to load: Unhealthy
supported true
supervisor_api failed to load: unreachable
version_api ok
installed_addons Studio Code Server (5.15.0), ESPHome (2023.12.8), Mosquitto broker (6.4.0), Tailscale (0.14.0), Music Assistant BETA (2.0.0b84), Zigbee2MQTT (1.35.1-1), Govee to MQTT Bridge (2024.01.21-33e8cd7c)
Dashboards
dashboards 1
resources 12
views 5
mode storage
Recorder
oldest_recorder_run January 16, 2024 at 9:45 PM
current_recorder_run January 25, 2024 at 7:53 PM
estimated_db_size 290.71 MiB
database_engine sqlite
database_version 3.41.2

Supervisor diagnostics

config_entry-hassio-230a4ddd53de32307921bbc57f86054b.json.txt

Footnotes

  1. When I first investigated this on Tuesday (with docker 25.0.0 installed) it was sufficient to do docker exec -it hassio_supervisor bash to repro as below, but something has changed and now that generates a shell sibling to the pid1 as unrestricted instead of also getting the apparmor profile. Bafflingly at one point Tuesday this behavior seemed to change after I ran systemd-analyze set-log-level debug )

  2. Up to writing this section that didn't click and I was looking at systemd-udev's version ; and it seems possible that the mismatch (with the offered udevd) could be in play as the issue. But the code between it (eudev) and the systemd versions of it and the nested device_monitor_new_full, do not differ greatly, though perhaps systemd's version's if (DEBUG_LOGGING) { gated stanza's may end up instructive]

@donaldguy donaldguy added the bug label Jan 26, 2024
@donaldguy
Copy link
Author

donaldguy commented Jan 26, 2024

Extracting out this bit of ~speculation to its own comment:

why isn't the policy applied on boot?

I didn't know. It might be a kernel bug, or a docker bug, or a systemd bug. Or it might just be a good old fashioned race condition?

At the highest level these are activating in the right order

odroidn2:~:# systemctl status apparmor | grep active
     Active: active (exited) since Thu 2024-01-25 18:42:39 MST; 19min ago
odroidn2:~:# systemctl status hassio-apparmor | grep active
     Active: active (exited) since Thu 2024-01-25 18:42:40 MST; 19min ago
odroidn2:~:# systemctl status hassio-supervisor | grep active
     Active: active (running) since Thu 2024-01-25 18:42:44 MST; 19min ag

this chain is there:

And I can confirm that after a systemctl stop hassio-supervisor && docker rm -f $(docker ps -aq) && aa-remove-unknown && systemctl start hassio-supervisor, you end up at a-different-than-this-issue supervisor restart loop involving such journal lines as

audit[8358]: AVC apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" hassio-supervisor" pid=15881 comm="runc:[2:INIT]"
audit: type=1400 audit(1706235970.127:70): apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" name="hassio-supervisor" pid=15881 comm="runc:[2:INIT]"
...
hassio-supervisor[15823]: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: unable to apply apparmor profile: apparmor failed to apply profile: write /proc/self/attr/apparmor/exec: no such file or directory: unknown
hassio-supervisor[15823]: Error: failed to start containers: hassio_supervisor

followed by re-pulling the image and trying again fruitlessly in a loop which terminates only when you run systemctl start hassio-apparmor (at which point you end up landing in the profile-applied-to-supervisor-so-it-doesn't work state, until system reboot)

But clearly there is something going (enabalingly) wrong in the application of profile from cold start of an existing hassio_supervisor container

So there may be value (once the profile is not breaking the udev function) in adding to hassio-apparmor.service (and indeed potentially hassos-apparmor.service something like a

ExecStartPre=/usr/bin/aa-enabled -q

and/or indeed fully move the sort of apparmor_parser call which is the heart of these services to be instead a ExecStartPre of the supervisor

@voron
Copy link

voron commented Feb 22, 2024

Having the same issue

24-02-22 15:57:00 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!

Removal of app-armor from supervisor compose file eliminates this error, HA updates work as expected too. Thus I suppose the issue is in app-armor supervisor's profile, as originally reported.

@donaldguy
Copy link
Author

donaldguy commented Feb 29, 2024

specifically, as I sort of said, believe the issue with the policy is,

for some reason now-and-not-before, the interpretation of deny network raw
https://github.com/home-assistant/version/blob/master/apparmor_stable.txt#L8

now applying also to netlink domain, meaning the

I don't believe just adding an allow network netlink raw will work ... (but maybe it will with the right ordering?)

I think giving it a cut out will require instead saying like...

deny network inet raw,
deny network ax25 raw,
deny network ipx raw,
deny network appletalk raw,
deny network netrom raw,
deny network bridge raw,
deny network atpmvc raw,
deny network x25 raw,
deny network inet6 raw,
deny network rose raw,
deny network netbui raw,
deny network security raw,
deny network key raw,
deny network packet raw,
deny network ash raw,
deny network econet raw,
deny network atmsvc raw,
deny network sna raw,
deny network irda raw,
deny network ppox raw,
deny network wanpipe raw,
deny network bluetooth raw,
deny network unix raw,
deny network rds raw,
deny network llc raw,
deny network can raw,
deny network tipc raw,
deny network iucy raw,
deny network rxrpc raw,
deny network isdn raw,
deny network phonet raw,
deny network ieee802154 raw,
deny network caif raw,
deny network alg raw,
deny network nfc raw,
deny network vsock raw,
deny network mpls raw,
deny network ib raw,

Of course ... the logical question there is ... does it actually need non-raw access to most of those domains? (probs no)

so probs the sweeping [allow] network [] [] [*] of line 7 probs should go

maybe want like(?)

allow network inet tcp,
allow network inet6 tcp,
allow network bluetooth,
allow network netlink,

(maybe bridge? unix?)

Copy link

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 30, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2024
@agners agners added no-stale Avoids issue going stale and removed stale labels Apr 8, 2024
@agners agners reopened this Apr 8, 2024
@agners
Copy link
Member

agners commented Apr 8, 2024

Thanks for this in depth analyzes!

I was able to reproduce the original symptom (Supervisor getting unhealthy) on Supervised on a Debian 12 Bookworm.

I then was actually wondering if I can reproduce this with HAOS 12.2.rc1, specifically because it comes with new Buildroot which pushes it closer to what Debian Bookworm is using (specifically, Docker 25). However, I was not able to reproduce the problem on HAOS 12.2.rc1.

Which really made me question, what is different between Debian 12 and HAOS 12.2.rc1 🤔

Just like in Supervised, the AppArmor profile only gets applied on initial Supervisor container start on HAOS 12.2.rc1 as well. So this part seems to (mis)behave exactly the same way.

HAOS uses AppArmor 3.1.2 already in HAOS 11, no update has taken place there. Debian 12 uses 3.0.8. I looked through the changes between the two versions, but nothing caught my eyes.

So far I am a bit clueless 😢

I feel it should always have not worked (as in, deny network raw should always lead to deny network netlink raw).

In any case, I think applying an explicit allow protocols as you suggested make sense. We don't need Bluetooth in Supervisor, so I think I'd go with:

network inet  stream,
network inet6 stream,
network inet  dgram,
network inet6 dgram,
network netlink raw,

@agners
Copy link
Member

agners commented Apr 9, 2024

I am still lost as to why Debian behaves different than HAOS here 😢

A few things to note: Docker exec seems to not apply the AppArmor profile:

hassio-supervisor (enforce)     root       17131  0.2 13.8 192868 136592 ?       Ssl  15:32   0:00 python3 -m supervisor
hassio-supervisor (enforce)     root       17132  0.0  0.3   4160  3516 ?        Ss   15:32   0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce)     root       17151  0.0  0.0   1612     4 ?        S    15:32   0:00 sleep 300
unconfined                      root       17325  0.0  0.4  10008  4488 pts/1    S+   15:33   0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17326  0.0  0.0  10008   480 pts/6    Ss   15:33   0:00 sudo docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17327  0.0  2.8 1770644 28048 pts/6   Sl+  15:33   0:00 docker exec -it hassio_supervisor /bin/bash
unconfined                      root       17342  0.0  0.2   2616  2288 pts/0    Ss   15:33   0:00 /bin/bash

Hence my udevadm command always worked, even with the uncorrected profile.

The AppArmor rules processing seems not very well defined to me, especially for networking. The syntax is:

NETWORK RULE = [ QUALIFIERS ] 'network' [ DOMAIN ] [ TYPE | PROTOCOL ]

The two rules so far:

network,
deny network raw,

Again, technically allowed all networking, but then disallowed all raw protocol for all domains, even for netlink.

In any case, raw seems to be disallowed by default, so allowing only netlink explicitly seems to be the way to go. It seems that unix domain is also allowed by default. I did not found where this defaults are specified. But it is probably better to enable it explicitly too, so I end up with:

network unix stream,
network inet stream,
network inet6 stream,
network inet dgram,
network inet6 dgram,
network netlink raw,

I'll create a PR to update the profile, for dev and beta channel first.

@RichardPar
Copy link

just to confirm I was bitten by the above deny rule. I changed to what @agners suggested and all is healthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug no-stale Avoids issue going stale
Projects
None yet
Development

No branches or pull requests

4 participants