Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kanidm-unixd stops returning users after a while #2632

Open
jcrawfordor opened this issue Mar 7, 2024 · 4 comments
Open

kanidm-unixd stops returning users after a while #2632

jcrawfordor opened this issue Mar 7, 2024 · 4 comments

Comments

@jcrawfordor
Copy link
Contributor

jcrawfordor commented Mar 7, 2024

I have a set of machines that use kanidm-unixd for authentication. One of them is running Dovecot with PAM as userdb/passdb. Occasionally I get a user report that Dovecot authentication has stopped working. When I check on the machine, I find that methods of looking up a user like getent passwd <user> or doveadm user <user> hang for a short time (5-10s) and then say that the user could not be found. Restarting kanidm-unixd fixes it.

I've only seen this happen on the mail server. It could be that something is different on the mail server from other machines, but all the auth config is done by an Ansible module so it should be very consistent. I suspect this might be a race condition or something that has a chance of happening anywhere, but because of mail clients polling the mail server does a lot more auth cycles than any other machine does so it encounters the problem far more often. The time period between the problem recurring is random but probably averages around two days.

I've had a hard time getting good info on where exactly it's going wrong, but I turned on debug logging on kanidm-unixd and observed the following. When it is in the broken state, I run doveadm user <me> (which hangs for a bit and then says no such user) and see the following in the kanidm logs:

Mar 06 18:19:16 mx.waffle.tech kanidm_unixd[94979]: 00000000-0000-0000-0000-000000000000 DEBUG    🐛 [debug]: Accepted connection
Mar 06 18:19:16 mx.waffle.tech kanidm_unixd[94979]: 00000000-0000-0000-0000-000000000000 DEBUG    🐛 [debug]: nssaccountbyname req
Mar 06 18:19:16 mx.waffle.tech kanidm_unixd[94979]: 00000000-0000-0000-0000-000000000000 DEBUG    🐛 [debug]: get_usertoken

And that's it, nothing after that. After I restart kanidm-unixd and try the same command again, it returns the user info as expected and the kanidm-unixd logs show:

Mar 06 18:25:18 mx.waffle.tech kanidm_unixd[101107]: 00000000-0000-0000-0000-000000000000 DEBUG    🐛 [debug]: Accepted connection
Mar 06 18:25:18 mx.waffle.tech kanidm_unixd[101107]: 00000000-0000-0000-0000-000000000000 DEBUG    🐛 [debug]: nssaccountbyname req
Mar 06 18:25:18 mx.waffle.tech kanidm_unixd[101107]: 00000000-0000-0000-0000-000000000000 DEBUG    🐛 [debug]: get_usertoken
Mar 06 18:25:18 mx.waffle.tech kanidm_unixd[101107]: 00000000-0000-0000-0000-000000000000 DEBUG    🐛 [debug]: online expired, refresh cache
Mar 06 18:25:19 mx.waffle.tech kanidm_unixd[101107]: 630638a8-2257-4e0f-b154-895cb161d420 DEBUG    perform_get_request [ <...>

Or, alternately, it might show that cached info was used.

In other words, when the system is in the "broken state," it seems like kanidm-unixd doesn't actually try to look up the user, or for some other reason never gets to where it emits the log entry about whether or not the cache is expired.

Kanidm version details

  • Output of kanidm(d) version: kanidm_unixd 1.1.0-rc.16
  • Are you running it in a container? If so, which image/tag?: no
  • If not a container, how'd you install it: From an RPM for Fedora that I built using a specfile based on the OpenSuSE specfile
  • Operating System / Version (On Unix please post the output of uname -a): Fedora release 39 (Thirty Nine). Linux mx.waffle.tech 6.5.6-300.fc39.x86_64 Support labelling system protected objects #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:57:21 UTC 2023 x86_64 GNU/Linux

System configuration

System configuration is directly based on the examples in the kanidm book.

nsswitch.conf:

passwd:     files kanidm sss systemd
shadow:     files
group:      files kanidm sss systemd
hosts:      files myhostname resolve [!UNAVAIL=return] dns
services:   files sss
netgroup:   files sss
automount:  files sss

aliases:    files
ethers:     files
gshadow:    files
networks:   files dns
protocols:  files
publickey:  files
rpc:        files

Relevant PAM file:

auth        required                                     pam_env.so
auth        required                                     pam_faildelay.so delay=2000000
auth        [default=1 ignore=ignore success=ok]         pam_usertype.so isregular
auth        [default=1 ignore=ignore success=ok]         pam_localuser.so
auth        sufficient                                   pam_unix.so nullok
auth        [default=1 ignore=ignore success=ok]         pam_usertype.so isregular
auth        sufficient                                   pam_kanidm.so ignore_unknown_user
auth        required                                     pam_deny.so

account     sufficient                                   pam_unix.so
account     sufficient                                   pam_localuser.so
account     sufficient                                   pam_usertype.so issystem
account     sufficient                                   pam_kanidm.so ignore_unknown_user
account     required                                     pam_permit.so

password    requisite                                    pam_pwquality.so local_users_only
password    sufficient                                   pam_unix.so yescrypt shadow nullok use_authtok
password    sufficient                                   pam_kanidm.so
password    required                                     pam_deny.so

session     optional                                     pam_keyinit.so revoke
session     required                                     pam_limits.so
-session    optional                                     pam_systemd.so
session     [success=1 default=ignore]                   pam_succeed_if.so service in crond quiet use_uid
session     required                                     pam_unix.so
session     optional                                     pam_kanidm.so

Any other comments

I am watching to see if this happens on any other systems and haven't seen it so far. It very much has the feeling of a race condition to me from the fact that it happens intermittently and only on the more heavily loaded system, but of course that's just a hunch. I haven't gotten useful logging out of anything else so far but PAM can be a royal pain to troubleshoot. Still, the logs I see from kanidm-unixd (or moreso the lack of logs when the system is broken) make me think that the problem is happening within unixd rather than somewhere else.

@Firstyear
Copy link
Member

The 10s timeout will be from the pam/nss module not receiving a response in time from the unixd daemon, so that is the part to look at here.

The most likely cause is recursion - this has happened previously where a call to kanidm unixd can accidentally trigger recursion via another nss module. Since the the worker lock is still held it causes a break. We may need to handle the recursion case better and throw a huge error.

For passwd and group in /etc/nsswitch.conf can you check if you still need sss and systemd?

Alternately, can you set the /etc/nsswitch order to: group: files sss systemd kanidm and the same for passwd. This way everything else has a chance to order first, and if it does recurse, the other modules will get the call.

Otherwise, we sohuld also likely list what the request contains at that point,

@jcrawfordor
Copy link
Contributor Author

I'm testing out removing sss and systemd from nsswitch.conf since they aren't in use. Of course the annoying thing here is that I'm testing for a negative, but if it goes a few days without issues that would suggest you're right. I'll keep an eye on it.

@Firstyear
Copy link
Member

I still would like to leave this open so we can improve this in other ways.

@jcrawfordor
Copy link
Contributor Author

Good news, the problem hasn't recurred in a week, so I'm pretty confident that change fixed it. You're probably right that it's something around recursion. If you add additional logging in the future I'm happy to put the config back and induce the break again to see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🐛 Bugs
Development

No branches or pull requests

2 participants