Improve unixd resilence to infrastructure issues #2723

Firstyear · 2024-04-25T00:58:11Z

Recently a user of Kanidm had an outage where their fault-tolerant load balancer setup failed and both nodes went down. This caused their client machines to be unable to contact Kanidm for authentication.

Ordinarily this is not an issue since the unixd user cache would allow offline auth, but the user in this case had not yet logged into the machine and as a result did not have cached credentials.

The question is if there are ways we could make this more robust. Some initial ideas:

Use SRV records or similar to load balance rather than a load balancer
Allow a discovery URL that points to instances directly (which can be discovered by replication etc)
Allow the client to list multiple direct URL's to the various instances
Nominate a group of users that are "pre-cached" into the unixd cache on critical instances

Something to consider in this could be "site discovery" in the future with distributed replicas, allowing clients to lookup what nodes are in their site.

yaleman added enhancement New feature or request unix UNIX integration things (PAM/NSS/SSH, daemons etc) labels Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve unixd resilence to infrastructure issues #2723

Improve unixd resilence to infrastructure issues #2723

Firstyear commented Apr 25, 2024

Improve unixd resilence to infrastructure issues #2723

Improve unixd resilence to infrastructure issues #2723

Comments

Firstyear commented Apr 25, 2024