Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent a possible race condition in blocking requests #3656

Open
dnephin opened this issue Nov 16, 2022 · 0 comments
Open

Prevent a possible race condition in blocking requests #3656

dnephin opened this issue Nov 16, 2022 · 0 comments
Labels
area/api Issue or PR related to the Infra API. kind/improvement A report of a quality problem, or a change that addresses a quality problem. status/never-stale Indicates to actions/stale that the issue or PR should never be marked stale.

Comments

@dnephin
Copy link
Contributor

dnephin commented Nov 16, 2022

The new blocking requests (aka requests used for long polling) have (I believe) a possible race condition between retrieving the next ID from the seq_update_index sequence, and committing the transaction that uses that ID. Note I haven't actually reproduced this race, so it's possible that it's unlikely in practice. We should try to reproduce before attempting to fix the problem.

Problem

There is no synchronization between nextval('seq_update_index') and committing of the transaction that uses the value. This means that if two CreateGrant or DeleteGrant requests are made at the same time, it's possible for them to be ordered like this:

  1. CreateGrant1 calls nextval('seq_update_index') to get value 1000
  2. CreateGrant2 calls nextval('seq_update_index') to get value 1001
  3. CreateGrant2 transaction commits, which sends a notify, and the blocking query reads value 1001
  4. CreateGrant1 transaction commits, which sends a notify, and the blocking query reads value 1001 (which means it has missed the update with id 1000)

This is only a problem if both requests (CreateGrant1 and CreateGrant2) are for the same organization and same destination (or in the future same org same user may also be a problem). If either the org or destination are different, then the race doesn't break anything.

Given that the query has to receive the notification and then repeat a query, it seems like this race is very unlikely, because even if it happens, as long as the second commit completes before the blocking request does its query, the problem will be hidden. None the less, I think it's important we have some way to fix it.

Proposed Solution

One way I've found to fix this is by using locks. https://www.postgresql.org/docs/current/explicit-locking.html says this:

PostgreSQL provides various lock modes to control concurrent access to data in tables. These modes can be used for application-controlled locking in situations where MVCC does not give the desired behavior.

Which sounds like exactly the scenario we are in.

I'm not sure yet exactly which locking mode we need, but I think what we want is a lock that synchronizes the nextval('seq_update_index') with the commit of the transaction. Since locks are automatically released on commit, all we need is to acquire the right lock before doing a CreateGrant or DeleteGrant.

The lock does not need to be global, it should be per-organization, so one org will never block other orgs. We could get even more fine grained locks by destination and user, but I think it's unlikely we need that complexity.

I don't think we want to lock the grants tables, and we can't lock individual rows in the grants table, because inserts will not be on an existing row.

One idea is to either use the organizations table (since the lock is scoped to an org), or to create a new table used exclusively for this purpose. The table would have at least one row per org (but if we end up with different updateIndex sequences maybe we add more).

The solution would be for every CreateGrant and DeleteGrant to do this:

-- this should get a per-organization row lock
SELECT FOR UPDATE * FROM organization_locks WHERE organization = ?

-- Create or Delete grant
INSERT ... nextval('seq_update_index')

-- release the lock
COMMIT

I believe that will prevent any races, and also ensure that one org can never block another.

If an org starts running into problems, we can direct them to the new bulk grants edit endpoint. That endpoint allows multiple CREATE/DELETE in the same transaction, which allows an org to get a lot more throughput without lock contention.

@dnephin dnephin added area/api Issue or PR related to the Infra API. kind/improvement A report of a quality problem, or a change that addresses a quality problem. status/never-stale Indicates to actions/stale that the issue or PR should never be marked stale. labels Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Issue or PR related to the Infra API. kind/improvement A report of a quality problem, or a change that addresses a quality problem. status/never-stale Indicates to actions/stale that the issue or PR should never be marked stale.
Projects
None yet
Development

No branches or pull requests

1 participant