Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logind: lxc payload no longer ran in scope under root user session #32929

Closed
bluca opened this issue May 19, 2024 · 1 comment · Fixed by #32934
Closed

logind: lxc payload no longer ran in scope under root user session #32929

bluca opened this issue May 19, 2024 · 1 comment · Fixed by #32934
Labels
login lxc/lxd regression ⚠️ A bug in something that used to work correctly and broke through some recent commit
Milestone

Comments

@bluca
Copy link
Member

bluca commented May 19, 2024

5099a50 (the first commit of PR #30884 ) introduced a regression that went unnoticed due to a test bug.
Previously, a payload execute by lxc, such as an autopkgtest test runner, would be ran in a scope under the root user session - $XDG_SESSION_ID was set, and /proc/self/cgroup would show:

0::/user.slice/user-0.slice/session-9.scope

$ loginctl list-sessions
SESSION UID USER SEAT TTY STATE  IDLE SINCE06:54
      9   0 root -    -   active no   -06:54
1 sessions listed.

Since after that commit, there is no root user session (or any session) at all, $XDG_SESSION_ID is not set, and proc/self/cgroup now shows:

0::/.lxc

In the logs we can see:

May 19 23:14:45 autopkgtest-lxc-fxmuec su[251]: (to root) root on none06:50
May 19 23:14:45 autopkgtest-lxc-fxmuec su[251]: pam_unix(su:session): session opened for user root(uid=0) by (uid=0)06:50
May 19 23:14:45 autopkgtest-lxc-fxmuec su[251]: pam_systemd(su:session): Failed to create session: No such device or address

From the full debug log, it's CreateSessionWithPIDFD that is returning ENXIO. Adding a fallback to CreateSession doesn't help, that also fails with ENXIO.

CreateSession gets called with: create_session(uid=0, leader_pid=0, leader_pidfd=10, service=su, type=unspecified, class=background, desktop=, cseat=, vtnr=0, tty=, display=, remote=0, remote_user=root, remote_host=, flags=0)

While in the successful case the session is created as expected.

The test bug was that the testbed booted with the old systemd+logind version, whatever was in the base distribution, and then would install the new packages at runtime, so the lxc payload was already running and assigned to a session by the code from the distribution. Fixing the test by rebooting after installing the code built from the branch is enough to make the issue show up.
For some reason this is not a problem under qemu, no idea why. I can reproduce this on the Semaphore CI very easily.

Successful run, at the commit before the one mentioned above:

https://the-real-systemd.semaphoreci.com/jobs/3a4a365c-4851-4a12-b9a0-85d81412a594

Failing run, at the commit mentioned above:

https://the-real-systemd.semaphoreci.com/jobs/70dd30b9-fe83-4adc-b179-b37d06df6ad7

Failing run on latest main, with full logind debug level logs:

https://the-real-systemd.semaphoreci.com/jobs/53096557-7f4c-4100-8720-cdd031f93536

To reproduce, enable Semaphore CI on a Github fork, then cherry pick this commit that ensures the right build and test options are used - a shortened build, that only runs the logind test, with the reboot fix:

bluca@3c371ec

@bluca bluca added login regression ⚠️ A bug in something that used to work correctly and broke through some recent commit lxc/lxd labels May 19, 2024
@bluca bluca added this to the v256 milestone May 19, 2024
@bluca
Copy link
Member Author

bluca commented May 20, 2024

Found the issue: 5099a50 switched from manager_get_user_by_pid() to manager_get_session_by_pidref(), but the latter due to changes in #29976 is not fully compatible as it fails with ENXIO if cg_pidref_get_unit() fails, instead of returning "not found" as it should, which caused the su login not to be tracked as a logind session anymore via pam_systemd

bluca added a commit to bluca/systemd that referenced this issue May 20, 2024
When running inside an LXC container the 'su' process will not be part of
any unit or slice.

manager_get_user_by_pid() which was used until v255 (included) does not fail
if it cannot find a unit/slice, but simply returns 'not found'. Do the same
in manager_get_session_by_pidref().

This was not detected as Semaphore CI does not reboot the testbed before
the logind test, so the session is started by the old logind from the base
distro, instead of the one being tested.

Follow-up for 8494f56
Follow-up for 5099a50

Fixes systemd#32929
bluca added a commit to bluca/systemd that referenced this issue May 20, 2024
When running inside an LXC container the 'su' process will not be part of
any unit or slice.

manager_get_user_by_pid() which was used until v255 (included) does not fail
if it cannot find a unit/slice, but simply returns 'not found'. Do the same
in manager_get_session_by_pidref().

This was not detected as Semaphore CI does not reboot the testbed before
the logind test, so the session is started by the old logind from the base
distro, instead of the one being tested.

Follow-up for 8494f56
Follow-up for 5099a50

Fixes systemd#32929
bluca added a commit to bluca/systemd-stable that referenced this issue May 26, 2024
When running inside an LXC container the 'su' process will not be part of
any unit or slice.

manager_get_user_by_pid() which was used until v255 (included) does not fail
if it cannot find a unit/slice, but simply returns 'not found'. Do the same
in manager_get_session_by_pidref().

This was not detected as Semaphore CI does not reboot the testbed before
the logind test, so the session is started by the old logind from the base
distro, instead of the one being tested.

Follow-up for 8494f56
Follow-up for 5099a50

Fixes systemd/systemd#32929

(cherry picked from commit eb56b56)
bluca added a commit to bluca/systemd-stable that referenced this issue May 26, 2024
When running inside an LXC container the 'su' process will not be part of
any unit or slice.

manager_get_user_by_pid() which was used until v255 (included) does not fail
if it cannot find a unit/slice, but simply returns 'not found'. Do the same
in manager_get_session_by_pidref().

This was not detected as Semaphore CI does not reboot the testbed before
the logind test, so the session is started by the old logind from the base
distro, instead of the one being tested.

Follow-up for 8494f56
Follow-up for 5099a50

Fixes systemd/systemd#32929

(cherry picked from commit eb56b56)
keszybz pushed a commit to systemd/systemd-stable that referenced this issue May 27, 2024
When running inside an LXC container the 'su' process will not be part of
any unit or slice.

manager_get_user_by_pid() which was used until v255 (included) does not fail
if it cannot find a unit/slice, but simply returns 'not found'. Do the same
in manager_get_session_by_pidref().

This was not detected as Semaphore CI does not reboot the testbed before
the logind test, so the session is started by the old logind from the base
distro, instead of the one being tested.

Follow-up for 8494f56
Follow-up for 5099a50

Fixes systemd/systemd#32929

(cherry picked from commit eb56b56)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
login lxc/lxd regression ⚠️ A bug in something that used to work correctly and broke through some recent commit
Development

Successfully merging a pull request may close this issue.

1 participant