New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kanidm-unixd stops returning users after a while #2632
Comments
The 10s timeout will be from the pam/nss module not receiving a response in time from the unixd daemon, so that is the part to look at here. The most likely cause is recursion - this has happened previously where a call to kanidm unixd can accidentally trigger recursion via another nss module. Since the the worker lock is still held it causes a break. We may need to handle the recursion case better and throw a huge error. For passwd and group in /etc/nsswitch.conf can you check if you still need sss and systemd? Alternately, can you set the /etc/nsswitch order to: Otherwise, we sohuld also likely list what the request contains at that point, |
I'm testing out removing sss and systemd from nsswitch.conf since they aren't in use. Of course the annoying thing here is that I'm testing for a negative, but if it goes a few days without issues that would suggest you're right. I'll keep an eye on it. |
I still would like to leave this open so we can improve this in other ways. |
Good news, the problem hasn't recurred in a week, so I'm pretty confident that change fixed it. You're probably right that it's something around recursion. If you add additional logging in the future I'm happy to put the config back and induce the break again to see what happens. |
I have a set of machines that use kanidm-unixd for authentication. One of them is running Dovecot with PAM as userdb/passdb. Occasionally I get a user report that Dovecot authentication has stopped working. When I check on the machine, I find that methods of looking up a user like
getent passwd <user>
ordoveadm user <user>
hang for a short time (5-10s) and then say that the user could not be found. Restarting kanidm-unixd fixes it.I've only seen this happen on the mail server. It could be that something is different on the mail server from other machines, but all the auth config is done by an Ansible module so it should be very consistent. I suspect this might be a race condition or something that has a chance of happening anywhere, but because of mail clients polling the mail server does a lot more auth cycles than any other machine does so it encounters the problem far more often. The time period between the problem recurring is random but probably averages around two days.
I've had a hard time getting good info on where exactly it's going wrong, but I turned on debug logging on kanidm-unixd and observed the following. When it is in the broken state, I run
doveadm user <me>
(which hangs for a bit and then says no such user) and see the following in the kanidm logs:And that's it, nothing after that. After I restart kanidm-unixd and try the same command again, it returns the user info as expected and the kanidm-unixd logs show:
Or, alternately, it might show that cached info was used.
In other words, when the system is in the "broken state," it seems like kanidm-unixd doesn't actually try to look up the user, or for some other reason never gets to where it emits the log entry about whether or not the cache is expired.
Kanidm version details
kanidm(d) version
: kanidm_unixd 1.1.0-rc.16uname -a
): Fedora release 39 (Thirty Nine). Linux mx.waffle.tech 6.5.6-300.fc39.x86_64 Support labelling system protected objects #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:57:21 UTC 2023 x86_64 GNU/LinuxSystem configuration
System configuration is directly based on the examples in the kanidm book.
nsswitch.conf:
Relevant PAM file:
Any other comments
I am watching to see if this happens on any other systems and haven't seen it so far. It very much has the feeling of a race condition to me from the fact that it happens intermittently and only on the more heavily loaded system, but of course that's just a hunch. I haven't gotten useful logging out of anything else so far but PAM can be a royal pain to troubleshoot. Still, the logs I see from kanidm-unixd (or moreso the lack of logs when the system is broken) make me think that the problem is happening within unixd rather than somewhere else.
The text was updated successfully, but these errors were encountered: