Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient support for linearizable reads #5636

Open
heidihoward opened this issue Sep 8, 2023 · 8 comments
Open

More efficient support for linearizable reads #5636

heidihoward opened this issue Sep 8, 2023 · 8 comments

Comments

@heidihoward
Copy link
Member

I believe the only way to achieve linearizable reads at the moment is for the app logic to create a phantom write to the kv to ensure that the primary replicates and reaches consensus on the associated txn. Without this, the read might be stale, even if forwarding is set to always and the client waits until the read txn ID is committed. This is because the primary who served the read may no longer be the current primary.

This is related to but somewhat separate from the discussion around wait for commit.

Here's some early ideas about how to make linearizable reads more efficient:

  1. Allow txns with empty write sets - this probably the simplest approach. The app logic would still need to create a transaction for the read but at least with no write set, the transaction will have few conflicts and be smaller. It's a cleaner solution but I'm not sure how much is gained by this approach
  2. Return the last sequence number +1 to the client for linearizable reads. This means that the client will not consider the read committed unless the primary commits another transaction (in the same term). The key issue here is that if the previous txn is a signature and no other write txns arrive then the client could be waiting indefinitely. This could be worked around by forcing another signature or using (1)
  3. Implement PreVote or some type of leader stickiness such that if we trust clocks then the primary can serve linearizable reads without a round trip to a majority
  4. Implement wait for commit and then delay responses to linearizable reads until after the next successful AppendEntries. This is closest to existing systems like etcd

Aside: This could be an interesting to model in TLA+ using a separate high-level specification such as https://github.com/Azure/azure-cosmos-tla/blob/master/general-model/cosmos_client.tla

@lemmy
Copy link
Collaborator

lemmy commented Sep 8, 2023

Aside: This could be an interesting to model in TLA+ using a separate high-level specification such as https://github.com/Azure/azure-cosmos-tla/blob/master/general-model/cosmos_client.tla

The latest versions of the Cosmos specs are hosted at https://github.com/tlaplus/azure-cosmos-tla/ (including a completely new spec discussed in Understanding Inconsistency in Azure Cosmos DB with TLA+). Lorin Hochstein's Linearizability spec could also prove useful.

@eddyashton
Copy link
Member

This is related to but somewhat separate from the discussion around wait for commit.

I think we'd need a phantom write and wait-for-commit behaviour to produce a linearizable read? If we only add a write to a read operation, then it could still be executed over stale state on an isolated (no longer current) primary. It would at least get a fresh transaction ID, that could later be recognised as invalid/uncommittable, but I don't see how that helps the initial response linearizability, since that read result may report stale state.

@heidihoward
Copy link
Member Author

Agreed, the client must wait for commit for a linearizable read (or write for that matter). This comment was referring to the discussion around CCF withholding client responses until the txn has been committed.

@achamayou
Copy link
Member

I don't think we need a phantom write to detect the next instance of a successful AppendEntries quorum. If we maintained in a counter next to the CheckQuorum response time for example, we could just check for that going up following commit?

@heidihoward
Copy link
Member Author

Yes, a successful appendEntries to a majority, even if just a heartbeat, is sufficient for the leader to know that a read is linearizable

@eddyashton
Copy link
Member

To use the AppendEntries roundtrip, do we need to include a new counter value in AEs (and responses)? I'm specifically worried about delayed arrival of ACKs. You can't currently distinguish a heartbeat-ACK emitted before a user's request arrived, from one emitted after. These are misleading but non-fatal for CheckQuorum, but if a node believes an ACK is fresh (since user request) when it is not, that could break linearizability?

@heidihoward
Copy link
Member Author

To use the AppendEntries roundtrip, do we need to include a new counter value in AEs (and responses)? I'm specifically worried about delayed arrival of ACKs. You can't currently distinguish a heartbeat-ACK emitted before a user's request arrived, from one emitted after. These are misleading but non-fatal for CheckQuorum, but if a node believes an ACK is fresh (since user request) when it is not, that could break linearizability?

Yes you're right. The leader will need to ensure that the AE (heartbeat or otherwise) was requested after the read-only transaction was received

@heidihoward
Copy link
Member Author

An example of the non-linearizability of read-only transactions is now provided in the TLA+ CI https://github.com/microsoft/CCF/actions/runs/6381743027/job/17318854604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants