Roadmap consideration: rely on database layer for transaction, replication, and durability #270

hixichen · 2024-04-06T05:52:46Z

Claim: The statement is my own and does not represent my company

As someone who has operated Vault for a few years in production env, I feel that Vault's design has become trapped in a cocoon of its own making.

In its approach to modeling a generic product with an API - Controller - DB framework, Vault places itself partially within the DB's responsibilities.

This includes investing significantly in features like Raft/replication and transactions, which, in my opinion, add an unnecessary burden.

Taking a quick look at the production database requirements for secrets, keys, and certificates.

Durability: Guarantees that keys and secrets are preserved without loss, using advanced data persistence methods to withstand failures. For example, S3 offers an SLA of 11 nines, while Spanner provides a five-nine SLA.
Reliability: Ensures continuous service availability, engineered to be always accessible to meet the demands of critical operations.
Read Efficiency: Facilitates the swift retrieval of secrets and decryption of keys, minimizing latency to enable immediate access and use of secure data.
Write/Update Consistency: Prioritizes strong consistency for write and update operations. Although these processes may be slower, they must ensure complete data integrity and consistency across the system.
Multi-Tenancy and Data Isolation: Supports access for multiple users or tenants while ensuring strict data isolation, providing secure and separate environments for each tenant’s data.

it's clear that there are numerous databases on the market that already support these features, like spanner, foundationDB.

Vault's built-in Raft and its approach to locking its controller layer and database hinder the foundational database's ability to perform cross-region replication for reading. This seems to be a strategy to push for a business license that includes cross-region replication.

I hold high hopes for this project, especially because it is truly open source. I wish it would rely more on databases for durability, reliability, and replication, rather than on Vault itself. I suggest pushing Vault towards functioning mainly as a controller layer, where each node can handle reads and all nodes can write, assuming the database supports transactions. This could eliminate the need for a leader election model, the fake high availability(why you even need standby) that doesn't necessarily contribute to the system's robustness.

cipherboy · 2024-04-06T12:55:11Z

\o Hello @hixichen,

Thanks for the issue! This came at a good time, as we are considering a GA roadmap as a community; we welcome any additional thoughts you might have in that direction (perhaps in a separate issue?).

This reply also got rather long, so apologies in advance. :-)

This has definitely been on my mind a lot lately. There's three sets of issues that I had in mind when removing all of the storage backends (see #64):

Operator experience: an integrated storage mechanism means fewer dependencies to run -- too many storage backend choices forces them to make a decision (and often, in the case of the Vault community, they end up preferring their favorite backend and not the supported option(s)).
Prevent least-common-denominator stagnation: too many choices prevents contributors from pushing OpenBao forward to solve some of these issues (like paginated lists in RFC - Support Paginated Lists in Storage, APIs #140 or as you point out, transactional storage).
Attract operators: by making the initial migration story easy, if they're already aligned with best practices (see Proposal: Migration Path #55), operators can switch from Vault to OpenBao.

(I could definitely expand more on each of these, but trying to keep it brief!)

The issue with Spanner is it is a cloud-only offering and thus unattractive to on-prem, airgapped users. And it is wholly proprietary offering, so by rationale of #64, it'd be best served as an external plugin, not part of the core offering (OSI-licensed integrations only!).

FoundationDB is open source, yes. But from a cursory search, there's no vendor ecosystem of managed offerings. Who do I pay when my FoundationDB instance goes sideways? IMO, if we were to choose an external database, we'd want one that is OSI-licensed (so we can have native first-class support for it, and FoundationDB is!)... But also one that has a broader vendor ecosystem so that it is accessible to all categories of potential users: to run it at home, for smaller businesses who don't view infrastructure as a critical cost center and will want to rely on vendor support, and for large businesses that view careful infrastructure management as a critical requirement and will want to have experts in the technology on staff.

(As an aside, we don't have existing external plugin support for storage backends, but once we have a few more storage related core features, I'd be amenable to adding it: what I don't want to have the community again be burdened by attempting to support both the supported storage modes and the arbitrary external storage plugin ecosystem too early when the final interfaces aren't ready.)

Many of the existing storage backends stagnated because upstream did not maintain them. The only maintained ones were Consul (now non-OSI) and Raft. We'd have to take on a lot of work to update, modernize, and performance test any other backends, so selection should be done carefully, IMO.

I wholeheartedly agree about the lack of transactions (Write/Update Consistency above). Note though, that while "raft" is nominally the name on the storage backend, it derives this from its consistency algorithm: its actual backing database is bbolt. I think we can argue either way about some of the production requirements and whether bbolt meets these. :-)

Arguably it is simpler (than most fully-fledged databases) in certain aspects, so perhaps... :-)

Notably, we already have transactions for critical core storage.

The shortcomings presently, are thus:

They aren't widely used, even within Core (in OpenBao).
They are not exposed to plugins at all.
The design is... interesting. :-)

Part of this is due to shortcomings of the interface itself: we can only send a series of predetermined operations! This does not conform with widespread programmer expectations aligning to something that provides all operations within the context of a transaction (e.g., Go's transaction model that returns another transactional handle). However, BoltDB does provide semantics like that!

So the initial gap is about design, more so than fundamental shortcomings with the system. :-) Here's where an RFC would come in, once the problem space is understood better.

Digression on clustering.

You might already be aware of this, but for the benefit of the rest of the community...

Upstream has three clustering modes:

High Availability, present in their community offering and thus also in OpenBao. In this mode, the other nodes do no work and simply forward all calls to the active leader. This is a flat, single cluster with no hierarchy.
Performance Secondary, which has a tiered hierarchy (a primary cluster with several leaf clusters: each of these leaf clusters have different local data, which potentially allows scaling certain operations better if they don't need to write to globally shared storage). In this mode, standby nodes can make read operations, but cannot write. But each leaf cluster has its own leader to be able to write to local storage, and the active primary cluster can write to global storage. This allows vastly scaling reads and potentially (depending on each plugin's architecture) scaling writes as well.
Disaster Recovery, which is like HA but for Perf Secondaries, which replicates more data and is already started up and has all plugins mounted.

These last two are Vault Enterprise only.

Assuming a good choice of database, OpenBao can thus achieve fairly good availability; HA mode ensures that at least as well as Perf clusters would, and DR clusters are (for this discussion) merely an extension of HA mode for Perf clusters.

The requirements are relatively modest too: there's only one active node and so a multi-writer scenario would not occur and existing (removed) storage backends ensured this even on large databases which supported it.

However, Perf Secondary clusters impose a lot of infrastructure problems as each is its own cluster with some shared data syncing.

Instead, it'd make more sense for OpenBao to expand HA mode to allow reading on standby nodes. This gives the read scaling of Perf Secondaries without the additional overhead and is a net-better version. Allowing multi-writers likely would be too much work to be worthwhile and would require all plugins and storage operations use transactions and would require a substantial update to the entire plugin ecosystem (as they cache data and can assume data will not change under them unless they're the active node and thus did the operation themselves to invalidate that cache).

All this to say, I don't think improving clustering modes necessarily changes our database/storage requirements any.

But, w.r.t.:

Vault's built-in Raft and its approach to locking its controller layer and database hinder the foundational database's ability to perform cross-region replication for reading. This seems to be a strategy to push for a business license that includes cross-region replication.

while certainly there is a differentiated offering, as no doubt required by an open core model (and IMO, HashiCorp making money off of Vault via Enterprise was certainly a good thing on the whole, even if I personally had wished they had stayed with an OSI-approved license), I don't think this is quite as true as we'd think. It is only true from an Enterprise support perspective, I'd posit.

In particular, HA in Vault Community supports arbitrary storage backends as long as they provide their own HA mode. This is not Raft in many places and is completely transparent to Vault/OpenBao (e.g., pointing at a shared Postgres instance with replication, or some other natively distributed DB). Raft is just a convenient way to build this with their choice of backing datastore, bbolt.

However, since we have no enterprise licensing concerns, taking the simplest approach to allowing scaling (the natural extension to HA mode that upstream would not consider to preserve the open core model) IMO makes the best sense.

Would you mind elaborating on this point:

Multi-Tenancy and Data Isolation: Supports access for multiple users or tenants while ensuring strict data isolation, providing secure and separate environments for each tenant’s data.

Does data isolation necessarily come from multiple databases (which seems expensive technologically to maintain under a single instance of the app)? Or can it come from layered seal+barrier mechanisms per-tenant, writing into the same database?

IMHO, if tenants require strict database isolation, it becomes too much of an operational challenge (for a community supported open source project) to build a single platform to do so. Instead, it'd be easier to run multiple instances of the software for each of these customers, and use crypto-level separation rather than requiring multiple parallel database management, for the rest.

All this to say...

What is your view on Postgres?

If I were to suggest an alternative to your suggestions, I'd suggest Postgres. It is a widely adopted, widely supported, widely understood database. It is a boring choice, in the way that light, neutral colored walls are standard. There are many vendors (cloud or traditional) that offer paid support for Postgres. There are variations upon it for various scenarios (Percona comes to mind as one, else the managed Postgres or Postgres compatible offerings of clouds as another). It has one of the largest communities of open database users (besides perhaps MariaDB and Sqlite) and it is very widely deployed and distributed, which makes infrastructure setup easy. And it won't restrict our future ability to ship in Linux distributions in any way.

This is the direction I was leaning, but my personal roadmap was something closer to:

Better transactional storage semantics,
See about extending HA mode to have distributed read support,
Think about resurrecting the Postgres backend.

Curious to hear your thoughts!

hixichen · 2024-04-12T05:36:41Z

Thank you for your comprehensive response. It has provided me with a deeper understanding of Vault's decision-making regarding their technological direction.I might have been a bit hasty in my initial judgments. Essentially, the primary aim seems to be to delve into the problem space and clarify it.

To rephrase, I now grasp the high availability (HA) motivation and the concept of abstracting the key-value (KV) interface for physical storage.

From my perspective, I see it as follows:

The simplicity and generic interface of the KV model restricts transactional capabilities since a single Vault write operation necessitates multiple write calls.
This necessitates that Vault must have a robust HA mechanism with a Leader model to manage write operations.
My question is, given that replication and performance optimization are well-established in database technologies, why Vault doesn't separate itself as a functional client from the database layer, instead of taking on this burden directly.

However, I do concur with your analysis of Spanner and FoundationDB. Their limitations could restrict the choices available to many users, especially considering the varied environments in which they operate.

That said, if we can prioritize features that drive adoption and allow for easy updates, modernization, and performance tuning, it would be beneficial.

In this context, PostgreSQL emerges as a strong candidate for the backend.

To clarify my thought:

In an ideal scenario, users would deploy PostgreSQL as a single instance for local development and as a clustered setup for production, with the ability to replicate across multiple regions or zones as needed. Vault, functioning more like a client, would be deployed flexibly but primarily handle read operations, directing write operations to the master node of PostgreSQL.

cipherboy · 2024-04-12T12:30:47Z

\o hello again @hixichen!

The simplicity and generic interface of the KV model restricts transactional capabilities since a single Vault write operation necessitates multiple write calls.

Just to clarify, a single write HTTP request to OpenBao could (depending on the plugin's code) result in multiple write operations to the underlying storage. This today isn't transactional, but could and should be.

I think, the simplicity will likely remain, sans transactions. Adding transactions does admittedly complicate the interface, but we'll still probably keep the same core operations and not add say, relational data/queries as this would be hard to support in a storage backend-agnostic manner.

My question is, given that replication and performance optimization are well-established in database technologies, why Vault doesn't separate itself as a functional client from the database layer, instead of taking on this burden directly.

I cannot speak to the founding of upstream, nor do I wish to. However, I can make some general observations...

OpenBao maintains an application-level encryption of its backend storage. Which is to say, the backend itself (whether Raft or Postgres or what have you) is not aware of the decryption keys and only ever sees encrypted data. This makes the threat model easier: it is localized to compromise of OpenBao itself. Compared to say, database-naive row-level encryption or full-disk encryption of the database host, this limits your ability to have meaningful data structures in a relational database and thus K/V looks like the least-common denominator... unless you build complex relational queries on top of that, in OpenBao (because it is the only one with the encryption keys).

(In short, with this encryption model, a K/V interface is very attractive and hence a technology pairing like Raft+bbolt is rather attractive).

However! I will say, upstream's Vault is more like this, in the community edition offering. I guess HashiCorp's Vault Enterprise doesn't allow Postgres as a backend for Performance Secondaries, but for the Vault Community's HA mode, it does essentially rely on the database's replication. (With a slight asterisk: HA mode is only a single active node and other standby nodes simply forward requests; they don't even handle read operations like Performance Standby nodes do).

We, in OpenBao, removed Postgres and everything else, thus deviating from that client-like model (for those improvements I mentioned above -- ListPages already landed). And thanks to this discussion, I've finally gotten around to writing up this RFC on transactions that I had been thinking about.

But I think it is risky to consider multi-writer, even when backed by a capable backend (like Postgres). Plugins authoring is already hard, and many are inherently stateful. E.g., Should node A revoke the credentials? Or should node B? Which one will be doing CRL rebuilding? How do they communicate about this or other things (currently there is no external node-to-node, plugin-specific communication mechanism, other than storage)? &c.

By having a single-writer setup, it becomes the default that one node will handle these operations and others will service fewer (but no less important! -- PKI cert issuance without storage or Transit encryption operations, &c).

I like talking about how OpenBao lacks a cross-plugin communication mechanism. However, it also lacks a cross-node communication mechanism as well, outside of storage. There's no GRPC mechanism between instances of the same plugin running on different nodes, there's no discovery of other nodes (at the plugin level -- so even if you had embedded GRPC in a plugin, you wouldn't know where to connect to -- or if you had each node write to storage, you're not sure if that's out of date or if a node is temporarily down and will come back up or if it is permanently lost), &c.

In short, I think a multi-writer scheme would require active coordination between nodes, which OpenBao isn't necessarily suited to solve in the medium term. Long term, perhaps, anything is possible. :-)

All this to say, for transactions, OpenBao will definitely be a client of the underlying storage technology. For data integrity, it will again be a client of the underlying storage technology. But the top-level application cannot easily be made to be multi-writer without, IMHO, substantial work.

Looking forward, I think my immediate next goals are (after Transaction support lands), trying to make the existing HA mode multi-reader. Once we have this, I think we'll be in a good place to start re-introducing other storage backends if that's the community's desire, in a limited, maintainable fashion.

Your help modernizing the Postgres storage backend, then, would be much appreciated, if you're so willing! :-)

alberk8 · 2024-04-19T10:20:08Z

\o Hello @hixichen,

Thanks for the issue! This came at a good time, as we are considering a GA roadmap as a community; we welcome any additional thoughts you might have in that direction (perhaps in a separate issue?).

This reply also got rather long, so apologies in advance. :-)

This has definitely been on my mind a lot lately. There's three sets of issues that I had in mind when removing all of the storage backends (see #64):

Operator experience: an integrated storage mechanism means fewer dependencies to run -- too many storage backend choices forces them to make a decision (and often, in the case of the Vault community, they end up preferring their favorite backend and not the supported option(s)).

Prevent least-common-denominator stagnation: too many choices prevents contributors from pushing OpenBao forward to solve some of these issues (like paginated lists in RFC - Support Paginated Lists in Storage, APIs #140 or as you point out, transactional storage).

Attract operators: by making the initial migration story easy, if they're already aligned with best practices (see Proposal: Migration Path #55), operators can switch from Vault to OpenBao.

(I could definitely expand more on each of these, but trying to keep it brief!)

The issue with Spanner is it is a cloud-only offering and thus unattractive to on-prem, airgapped users. And it is wholly proprietary offering, so by rationale of #64, it'd be best served as an external plugin, not part of the core offering (OSI-licensed integrations only!).

FoundationDB is open source, yes. But from a cursory search, there's no vendor ecosystem of managed offerings. Who do I pay when my FoundationDB instance goes sideways? IMO, if we were to choose an external database, we'd want one that is OSI-licensed (so we can have native first-class support for it, and FoundationDB is!)... But also one that has a broader vendor ecosystem so that it is accessible to all categories of potential users: to run it at home, for smaller businesses who don't view infrastructure as a critical cost center and will want to rely on vendor support, and for large businesses that view careful infrastructure management as a critical requirement and will want to have experts in the technology on staff.

(As an aside, we don't have existing external plugin support for storage backends, but once we have a few more storage related core features, I'd be amenable to adding it: what I don't want to have the community again be burdened by attempting to support both the supported storage modes and the arbitrary external storage plugin ecosystem too early when the final interfaces aren't ready.)

Many of the existing storage backends stagnated because upstream did not maintain them. The only maintained ones were Consul (now non-OSI) and Raft. We'd have to take on a lot of work to update, modernize, and performance test any other backends, so selection should be done carefully, IMO.

I wholeheartedly agree about the lack of transactions (Write/Update Consistency above). Note though, that while "raft" is nominally the name on the storage backend, it derives this from its consistency algorithm: its actual backing database is bbolt. I think we can argue either way about some of the production requirements and whether bbolt meets these. :-)

Arguably it is simpler (than most fully-fledged databases) in certain aspects, so perhaps... :-)

Notably, we already have transactions for critical core storage.

The shortcomings presently, are thus:

They aren't widely used, even within Core (in OpenBao).

They are not exposed to plugins at all.

The design is... interesting. :-)

Part of this is due to shortcomings of the interface itself: we can only send a series of predetermined operations! This does not conform with widespread programmer expectations aligning to something that provides all operations within the context of a transaction (e.g., Go's transaction model that returns another transactional handle). However, BoltDB does provide semantics like that!

So the initial gap is about design, more so than fundamental shortcomings with the system. :-) Here's where an RFC would come in, once the problem space is understood better.

Digression on clustering.
All this to say, I don't think improving clustering modes necessarily changes our database/storage requirements any.

But, w.r.t.:

Vault's built-in Raft and its approach to locking its controller layer and database hinder the foundational database's ability to perform cross-region replication for reading. This seems to be a strategy to push for a business license that includes cross-region replication.

while certainly there is a differentiated offering, as no doubt required by an open core model (and IMO, HashiCorp making money off of Vault via Enterprise was certainly a good thing on the whole, even if I personally had wished they had stayed with an OSI-approved license), I don't think this is quite as true as we'd think. It is only true from an Enterprise support perspective, I'd posit.

In particular, HA in Vault Community supports arbitrary storage backends as long as they provide their own HA mode. This is not Raft in many places and is completely transparent to Vault/OpenBao (e.g., pointing at a shared Postgres instance with replication, or some other natively distributed DB). Raft is just a convenient way to build this with their choice of backing datastore, bbolt.

However, since we have no enterprise licensing concerns, taking the simplest approach to allowing scaling (the natural extension to HA mode that upstream would not consider to preserve the open core model) IMO makes the best sense.

Would you mind elaborating on this point:

Multi-Tenancy and Data Isolation: Supports access for multiple users or tenants while ensuring strict data isolation, providing secure and separate environments for each tenant’s data.

Does data isolation necessarily come from multiple databases (which seems expensive technologically to maintain under a single instance of the app)? Or can it come from layered seal+barrier mechanisms per-tenant, writing into the same database?

IMHO, if tenants require strict database isolation, it becomes too much of an operational challenge (for a community supported open source project) to build a single platform to do so. Instead, it'd be easier to run multiple instances of the software for each of these customers, and use crypto-level separation rather than requiring multiple parallel database management, for the rest.

All this to say...

What is your view on Postgres?

If I were to suggest an alternative to your suggestions, I'd suggest Postgres. It is a widely adopted, widely supported, widely understood database. It is a boring choice, in the way that light, neutral colored walls are standard. There are many vendors (cloud or traditional) that offer paid support for Postgres. There are variations upon it for various scenarios (Percona comes to mind as one, else the managed Postgres or Postgres compatible offerings of clouds as another). It has one of the largest communities of open database users (besides perhaps MariaDB and Sqlite) and it is very widely deployed and distributed, which makes infrastructure setup easy. And it won't restrict our future ability to ship in Linux distributions in any way.

This is the direction I was leaning, but my personal roadmap was something closer to:

Better transactional storage semantics,

See about extending HA mode to have distributed read support,

Think about resurrecting the Postgres backend.

Curious to hear your thoughts!

Can also consider YugabyteDB as it is wire compatible to Postgres and is open sourced cloud native with option for paid managed DB.

cipherboy · 2024-04-19T11:38:40Z

@alberk8 if it is wire compatible, I assume there would be no work to do? You can find the old backend here: https://github.com/openbao/openbao/blob/before-plugin-removal/physical/postgresql/postgresql.go

hixichen added the feature label Apr 6, 2024

hixichen changed the title ~~Rely on database layer for transaction, replication, and durability~~ Roadmap consideration: rely on database layer for transaction, replication, and durability Apr 6, 2024

cipherboy mentioned this issue Apr 15, 2024

RFC - Transactional Storage for Plugins & Core #296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap consideration: rely on database layer for transaction, replication, and durability #270

Roadmap consideration: rely on database layer for transaction, replication, and durability #270

hixichen commented Apr 6, 2024 •

edited

cipherboy commented Apr 6, 2024 •

edited

hixichen commented Apr 12, 2024

cipherboy commented Apr 12, 2024

alberk8 commented Apr 19, 2024

cipherboy commented Apr 19, 2024

Roadmap consideration: rely on database layer for transaction, replication, and durability #270

Roadmap consideration: rely on database layer for transaction, replication, and durability #270

Comments

hixichen commented Apr 6, 2024 • edited

cipherboy commented Apr 6, 2024 • edited

hixichen commented Apr 12, 2024

cipherboy commented Apr 12, 2024

alberk8 commented Apr 19, 2024

cipherboy commented Apr 19, 2024

hixichen commented Apr 6, 2024 •

edited

cipherboy commented Apr 6, 2024 •

edited