Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add locking mechanism to prevent parallel execution as a safety measure #335

Closed
dahu33 opened this issue Apr 21, 2022 · 17 comments
Closed
Assignees
Labels

Comments

@dahu33
Copy link
Contributor

dahu33 commented Apr 21, 2022

I know this topic has already been discussed in #191 and #258 and while I understand the argument that migrations should not run in parallel in the first place, I still think that goose should have a locking mechanism to prevent parallel execution as a safety measure. This would not only allow new use cases but also make goose a much safer tool to use!

@mfridman
Copy link
Collaborator

I can't speak for @VojtechVitek, but I've come around on this. After chatting with other folks and gathering feedback the consensus is "it'd be a nice feature to have".

I'm not sure what this would actually look like to the end-user, but, I suggest we keep this issue open and continue the discussion.

@mfridman
Copy link
Collaborator

Not an exclusive list, but some things to think through..

  • How is this implemented and which database(s) are supported
  • Is this implemented as a default or an opt-in
  • Can it be implemented in a backwards compatible way

I'm sure more questions and implementation details will come up.

@jessedearing
Copy link

To solve this we had to write a wrapper executable for goose that obtains an advisory lock in postgres. It's pretty common in Kubernetes deployments to use a init container to perform database migrations. The issue arises due to the fact that init containers are part of deployments which run multiple copies of the container. With no locking in place, goose migrations interleave with each other.

It's a race condition so we don't see the issue all the time, but with no locking in place we run into issues randomly.

We use advisory locks, but the implementation is pretty specific to each DB backend.
A more general approach could be to use LOCK TABLE on the goose migrations table with an access exclusive lock in a transaction while the migrations run.

I'd expect the end user experience for either of these methods to not look any different that it does now. Goose runs all the same, the only difference is that another copy of goose cannot run until the first session is complete.

@mfridman
Copy link
Collaborator

Oddly enough, I was just chatting with folks about this.

@jessedearing, was your approach similar to the one described by @dahu33 in this #191 (comment)?

I've seen this come up a lot recently, probably because more and more environments are punting this problem to the tool itself.

@jessedearing
Copy link

was your approach similar to the one described by @dahu33 in this #191 (comment)?

It was. We moved it to it's own executable and launched Goose so we can still pull in upstream Goose updates easily. We just use os/exec to call Goose. Like @dahu33 describes we do use select pg_advisory_lock(1) and select pg_advisory_unlock(1) in this executable. Since it is in a separate executable we just create a single DB connection so we don't need to worry about pooling.

The implementation is very specific to Postgres, but MySQL has GET_LOCK and RELEASE_LOCK functions too.

@tomwhipple
Copy link

To add another scenario: I'm embedding Goose into the test suite to ensure that code and database (Postgres) remain in sync. However, I find that the golang test runner introduces concurrency such that migrations can inconsistently happen simultaneously.

@mfridman mfridman self-assigned this Nov 30, 2022
@mfridman
Copy link
Collaborator

Thanks all for the feedback. Going to bump this up and add a label as something we should add to this library. It'll likely be opt-in behaviour to avoid unexpected issues for existing users.

@mfridman mfridman pinned this issue Jan 22, 2023
@mfridman mfridman added this to the v3.10.0 milestone Jan 22, 2023
@mfridman
Copy link
Collaborator

mfridman commented Jan 30, 2023

After some thought, I think an advisory lock (at least for postgres) is the right approach, granted this won't always work (see cockroachdb/cockroach#13546). We can document which databases are safe for concurrent use and which aren't.

Although this may be a reason to use a table lock, similar to liquibase, which not only helps in the cockroachdb example but also where PgBouncer is being used with transaction pooling.

Maybe the current goose behavior should be the default and users are responsible for serializing their deployments (as an example) .. but if the database supports it then they can opt-in for advisory locks. Tbd whether this should be session level or transaction level advisory lock.

We could implement both and leave it up to the user to configure based on their needs, something along the lines of:

-lock={none, session, transaction}?


I really like @dahu33 comment from above, which is why I'd like to see goose add support for this feature and with some wood behind this arrow hopefully we can get an implementation we all like.

This would not only allow new use cases but also make goose a much safer tool to use!

If folks have further thoughts, suggestions or ideas on an implementation please drop your comments!

@mfridman
Copy link
Collaborator

Dropping some thoughts here. I think it'd be ideal if we implemented a transaction-level advisory lock, instead of a session-level advisory lock. But there are some notable gotchas with transaction-level locks, such as CREATE INDEX CONCURRENTLY (among others).

These statements wait for other transactions to complete before executing (reference):

.. and in addition it must wait for all existing transactions that could potentially modify or use the index to terminate.

This means if we took a transaction-level lock and then attempted to run a migration it'd hang indefinitely. Not good.

Now, there is another option. For each migration, we know whether it is safe to run within a transaction or not (with the -- +goose NO TRANSACTION annotation). By default, we run a migration using a transaction-level lock, but for these special-case migrations, we promote to a session-level lock.

The downside with session-level advisory locks is they MUST be released by the same session (read connection), and if anything happens to that session you're in for a bad time trying to figure out why subsequent operations are hanging.

Once acquired at session level, an advisory lock is held until explicitly released or the session ends. Unlike standard lock requests, session-level advisory lock requests do not honor transaction semantics: a lock acquired during a transaction that is later rolled back will still be held following the rollback, and likewise an unlock is effective even if the calling transaction fails later.


Another thing I've been thinking about is whatever the implementation is, we should keep the door open to supporting this request: #222

tl;dr it'd be nice if multiple migrations could be applied within a single atomic transaction. (assuming all those files don't have the -- +goose NO TRANSACTION annotation).

I'd like to explore what a separate goose_lock table could look like, where we run a SELECT ... FOR UPDATE to acquire a lock. But there are some gotchas with this approach as well. I'll drop them in another comment.

@mfridman
Copy link
Collaborator

For the postgres session-level advisory lock, I changed the implementation to use pg_try_advisory_lock (instead of pg_advisory_lock) with retries attempting to acquire the lock.

The retry has sane default but can be configured by the user.

This implementation avoids an edge case that can lead to a deadlock. It's quite easy to reproduce with any tool that uses pg_advisory_lock exclusively:

  • Create 10 workers trying to acquire a lock with pg_advisory_lock
  • Attempt to apply the following 2 migrations (as separate operations, e.g., 2 up files)

migration 1

CREATE TABLE owners (
    owner_id BIGSERIAL PRIMARY KEY,
    owner_name text NOT NULL
);

migration 2

CREATE UNIQUE INDEX CONCURRENTLY ON owners(owner_name);

This will lead to the following error: deadlock detected (SQLSTATE 40P01)

If I have some cycles, I'll add more details here.

@peterldowns
Copy link

@mfridman if/when you have the cycles, I'd be very interested to learn how taking a lock with pg_advisory_lock led to deadlock but acquiring the lock by looping over pg_try_advisory_lock fixed the problem. In my testdb project for spinning up cheap postgres databases in tests I acquire locks with pg_advisory_lock to serialize writes from potentially many tests running in parallel, and I'd be quite worried about running into this problem. Any chance you could post a repro that I could look into?

@peterldowns
Copy link

peterldowns commented May 30, 2023

Oh, I was able to reproduce this with just two psql connections to a database.

Connection 1:

postgres=# select pg_advisory_lock(10101);
 pg_advisory_lock
------------------

(1 row)

Connection 2:

postgres=# select pg_advisory_lock(10101);

This hangs, waiting for Connection 1 to release the lock.

Then, back in Connection 1:

postgres=# CREATE TABLE owners (name text);
CREATE TABLE
postgres=# CREATE INDEX CONCURRENTLY foobar ON owners (name);
ERROR:  deadlock detected
DETAIL:  Process 68 waits for ShareLock on virtual transaction 3/2; blocked by process 67.
Process 67 waits for ExclusiveLock on advisory lock [5,0,10101,1]; blocked by process 68.
HINT:  See server log for query details

That's... pretty crazy, and not what I would have expected! The documentation on CREATE INDEX CONCURRENTLY and on advisory locks does not clearly explain that they would conflict with each other. I guess the virtual transaction waiting on the advisory lock (Connection 2 in my example) is counted as a "transaction that has a snapshot", even though its virtual transaction began before the owners table was even created?

After the second scan, the index build must wait for any transactions that have a snapshot (see Chapter 13) predating the second scan to terminate, including transactions used by any phase of concurrent index builds on other tables, if the indexes involved are partial or have columns that are not simple column references.

Or, like you quoted earlier, maybe it is a virtual transaction that "could use the index"

it must wait for all existing transactions that could potentially modify or use the index to terminate.

Looks like the Flyway project also switched their implementation to spinning on pg_try_advisory_lock as a result, per this thread flyway/flyway#1654 (comment)

Appreciate you mentioning it here, I wouldn't have been aware of the issue otherwise.

@jessedearing
Copy link

We never saw this in our tooling because our tables are small enough that we haven't added CONCURRENTLY to CREATE INDEX. Good catch

@mfridman
Copy link
Collaborator

mfridman commented May 31, 2023

@peterldowns Coincidentally, I recently discovered a series of Blog posts from Sourcegraph (huge kudos to @efritz) that outlines their migration woes with golang-migrate. Specifically, they describe the #335 (comment) above.

https://about.sourcegraph.com/blog/introducing-migrator-service .. scroll down to A deeper contributing factor

The good news is the /v4 version of goose handles this case and there are a bunch of tests to capture the behaviour. It's quite reproducible.

@peterldowns
Copy link

peterldowns commented May 31, 2023

@mfridman thanks for the links to those blogposts. Where I last worked, we would build large indexes concurrently "out of band", and then check in a migration that creates-if-not-exists.

For example, let's say we needed to create some index on a really large table where it would probably take a few hours. We would normally run any new migrations as part of our deployment process, as a prelude to replacing the existing app containers with the new version. We couldn't create that index during deployment, even with CREATE INDEX CONCURRENTLY, because we would set strict timeouts on the deploy process. So we'd write a migration like this:

CREATE INDEX -- CONCURRENTLY
  IF NOT EXISTS my_example_idx
  ON some_really_large_table (primary_key, created_at, some_other_key);

Once the PR with this migration was approved, we'd use a script to manually execute the concurrent version of that query:

CREATE INDEX CONCURRENTLY
  IF NOT EXISTS my_example_idx
  ON some_really_large_table (primary_key, created_at, some_other_key);

Only once that concurrent index creation had succeeded would we merge the PR. After merging the PR, redeploying would involve running migrations before starting up the app, but it would take no time at all because the index was already created!

In a testing scenario where we would need to migrate from an empty database to the current state, this would be fast becuase there wouldn't be any data in the tables, and we didn't do concurrent index creation.

I'm actually planning on releasing a migration library/framework soon based on this example and a few other things I learned. Because of this dance around concurrent migrations I had never run into this problem that you mentioned. Thank you again for helping me understand the problem and for teaching me something new about Postgres!

@efritz
Copy link

efritz commented Jun 1, 2023

Saw my blog post mentioned above, thought I'd drop in and say 👋! The way we do migrations at Sourcegraph assumes:

  • we attempt to grab an advisory lock around
  • we log attempts into a migration_logs table
  • each transaction (except for create index concurrently) is wrapped in a transaction by the migration machinery
  • we update the attempt rows with success/failure once done (transactionally)

This affords us certain properties such as "a incomplete attempt while no migrator holds a lock" means a previous migration died before completion, but it didn't affect anything (since it was done atomically in a transaction).

Concurrent transactions throw a wrench in the works here because you can't hold an advisory lock and complete the index creation at the same time. The way we get around this is special-casing migrations with concurrent index creation here:

  1. Drop the advisory lock before attempting a CIC transaction
  2. Poll the index status from pg catalog tables
  3. If the index is 100% complete, you're done
  4. If the index is invalid, drop
  5. If the index not being created, apply the CREATE INDEX CONCURRENTLY statement
  6. Go to 2

Note that the reason we could do this (and golang-migrate couldn't easily) was because we only support Postgres.

@mfridman
Copy link
Collaborator

This functionality has been added in v3.16.0 and higher. Requires a *goose.Provider and the goose.WithSessionLocker option.

A bit more detail in this blog post:

https://pressly.github.io/goose/blog/2023/goose-provider/#database-locking

Thank you to everyone for the feedback, fruitful discussions, and contributions! Going to close this issue, but feel free to continue the discussion and looking forward to your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants