restarting i3 forgets i3bar as a zombie process #5756

Virv12 opened this issue Nov 4, 2023

Virv12 commented Nov 4, 2023

Current Behavior

When restarting i3, a new i3bar process is spawned and the old one is left alive as a zombie.

Expected Behavior

The old i3bar process is correctly closed.

Reproduction Instructions

Press $mod+Shift+r to restart i3.


Output of i3 --moreversion 2>&-:

$ i3 --moreversion 2>&-
Binary i3 version:  4.23 (2023-10-29) © 2009 Michael Stapelberg and contributors
Running i3 version: 4.23 (2023-10-29) (pid 1156)
Loaded i3 config:
  /home/filippo/.config/i3/config (main) (last modified: Mon 23 Oct 2023 10:38:31 PM CEST, 1030139 seconds ago)

The i3 binary you just called: /usr/bin/i3
The i3 binary you are running: i3
- Linux Distribution & Version: Archlinux
- Are you using a compositor (e.g., xcompmgr or compton): No

Tomorrow I will try to provide a log file.
Thanks for your help!

i3bot commented Nov 4, 2023

I don’t see a link to Did you follow (In case you actually provided a link to a logfile, please ignore me.)

The zombie processes clean themselves up after a while for me, however in 4.22 there were no zombie processes at all.

I ran a git bisect for this and it seems that: 3ae5f31 is the first bad commit

@kolayne can you take a look?

kolayne commented Nov 5, 2023

I can reproduce the issue. As far as I can tell, i3bar gets stuck in the Zombie state until any of i3's children terminates (I reckon this triggers that libev collects all the zombies).


Before 3ae5f31, when the double forking logic was used, every child process of i3 was immediately reparented, so taking care of that process's zombieness was never i3's responsibility. After double forking was removed, child processes of i3 started dying, becoming zombies, and getting reaped by libev, regardless of how they were spawned and even regardless of whether they were spawned by this current i3 or one of its previous incarnations (previous processes which execed into the current one). That's why, normally, zombie processes don't appear even if one restarts i3 (and the old i3 is replaced with its new incarnation).

The problem now is a race condition: if a child of i3 is still alive when i3 is restarting but dies before i3 creates the libev event loop, the event of that child's termination is lost. This is what seems to be happening with i3bar.

How to fix

I see a couple of possible solutions, I am not sure which one fits best.

  1. Ostrich algorithm. The event of zombie processes left behind is unlikely (it only happens with processes that terminate at the very short period of time while the new (incarnation of) i3 is initializing, which, in a common setup, is only i3bar), and these zombies are not around for a long time, as they'll be reaped anyway as soon as any new child dies. In an individual case this can be worked around with a config directive such as

    exec_always --no-startup-id killall -SIGCHLD i3

    which cleans zombies up.

    What I dislike about this solution is that overall the behavior of i3 has degraded, although it's not very noticeable.

  2. Make sure all zombies are reaped when i3 starts. This can be achieved very easily and naturally by doing a raise(SIGCHLD) in the main of i3 before entering the event loop, which will make libev reap all previous zombies. Note that by design of libev, there is no race condition between raise(SIGCHLD) and entering the event loop. We don't have to enter the event loop before raising the signal but, rather, create the event loop. From ev(3):

    Libev grabs "SIGCHLD" as soon as the default event loop is initialised. This is necessary to guarantee proper behaviour even if the first child watcher is started after the child exits. The occurrence of "SIGCHLD" is recorded asynchronously, but child reaping is done synchronously as part of the event loop processing. Libev always reaps all children, even ones not watched.

    A flaw I see in this solution is that it changes the current behavior of i3: if I write my own wrapper for i3 that creates a child but doesn't reap it (so that my main process ends up with a zombie child) and then execs the current version of i3, then that zombie child wouldn't get reaped (not until another child of i3 died). On the one hand, it kind of makes sense, as that child is not i3's responsibility, on the other hand, there's no one else to reap that zombie anyway. Could there be anyone out there relying on the current behavior?..

  3. Finally, another option is to, on i3 restart, block the SIGCHLD signal first, then exec the new incarnation, then unblock the signal, which would cause the pending SIGCHLDs (if any) to be delivered to the new process. This maintains the behavior described in the previous point compatible with the current version of i3 but seems like an overcomplication.

kolayne commented Nov 5, 2023

Okay, after writing it all out, I now come to a conclusion that, perhaps, the second option makes more sense than the other two.

The following patch fixes it for me:

diff --git a/src/main.c b/src/main.c
index 6fae7e41..e294f50e 100644
--- a/src/main.c
+++ b/src/main.c
@@ -1202,6 +1202,17 @@ int main(int argc, char *argv[]) {
      * when calling exit() */
+    /* There might be children who died before we initialized the event loop,
+     * e.g., when restarting i3 (see #5756).
+     * To not carry zombie children around, raise the signal to invite libev to
+     * reap them.
+     *
+     * Note that there is no race condition between raising the signal and
+     * entering the event loop below: the signal is just to notify libev that
+     * zombies might already be there. The child reaping will happen in the
+     * event loop anyway. */
+    (void)raise(SIGCHLD);
     sd_notify(1, "READY=1");
     ev_loop(main_loop, 0);

kolayne added a commit to kolayne/i3 that referenced this issue Feb 7, 2024
One case when this might be useful is when i3 is restarted and there are
children that terminate after the previous i3 instance shut down but
before the new one set things up.

Fixes i3#5756
kolayne added a commit to kolayne/i3 that referenced this issue May 7, 2024
One case when this might be useful is when i3 is restarted and there are
children that terminate after the previous i3 instance shut down but
before the new one set things up.

Fixes i3#5756
