Replicating the services database #119

progval · 2024-04-20T08:18:32Z

Currently, sable_services writes its database as a single JSON file on disk. This is similar to what Atheme does, so we know it works are least at Libera.Chat's scale.

While this can easily be replicated to other services, it means sable_services going down causes an outage where people cannot login, channel ops cannot be opped, etc. This happens on Libera.Chat from time to time.

Given Sable's distributed architecture, we can do better here. @spb's idea is to have multiple sable_services nodes, one of which would be a leader and would stream its database to the other.
The database could remain a single JSON file, but it might become a scaling concern to copy this file over and over. We see a few options to solve this:

use a database that supports streaming replication, like PostgreSQL.
make sable_services nodes coordinate over the Sable network, and each have their own independent database
make sable_services nodes share a single replicated database (Cassandra, something on top of Ceph, CockroachDB, ...)

With options 1 and 2, if we want high availability,it means sable_services needs to somehow have a leader election, because we can't allow write to the same objects from multiple nodes at the same time. PostgreSQL does not provide a solution to this, and expects users to tell it when to switch between follower/leader state.

And option 3 may be unsustainable for Libera, as all solutions I'm aware of in this space require extensive specialized knowledge with that solution (maybe not CockroachDB though? I've never tried it). In particular, Cassandra and Ceph are designed to work with petabyte-scale data, which is far beyond what we need here. Additionally, they often come with constraints/caveats in what software developers can do with the database.

The text was updated successfully, but these errors were encountered:

spb · 2024-04-20T11:58:26Z

While this can easily be replicated to other services, it means sable_services going down causes an outage where people cannot login, channel ops cannot be opped, etc. This happens on Libera.Chat from time to time.

Minor correction: in the current code, if services are down then you can't log in and can't add new channel access, but anyone already logged in can use channel access that they already have. This already makes short downtime less of a problem than it currently is, so what I'm most concerned about is preventing data loss when switching over.

More generally:

Option 3 would have the advantage of not needing the rest of the network to care about which node is active; anything that requires services involvement could just be sent to any node with that capability. The main requirement we'd have on the database in that scenario is that changes are committed immediately and can't later fail or be rolled back; I suspect the main roadblock here would be the operational ones you mentioned.

Between options 1 and 2 I suspect it comes down to trading off development versus operational effort, which is probably a conversation to have internally before committing to either route.

progval mentioned this issue Apr 20, 2024

History servers #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating the services database #119

Replicating the services database #119

progval commented Apr 20, 2024 •

edited

spb commented Apr 20, 2024

Replicating the services database #119

Replicating the services database #119

Comments

progval commented Apr 20, 2024 • edited

spb commented Apr 20, 2024

progval commented Apr 20, 2024 •

edited