Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(client): Better throttle snapshot stop strategy #1251

Merged
merged 13 commits into from May 15, 2024

Conversation

davidmartos96
Copy link
Contributor

@davidmartos96 davidmartos96 commented May 9, 2024

We've encountered a race condition that can occurr if, after cancelling the throttle when stopping the process, some other piece of code calls throttlesnapshot. That's because cancelling the throttle doesn't prevent from scheduling new ones.

The proposed solution is to assign undefined to the function, so places that would normally trigger a snapshot, like the poll interval, would just do nothing until properly closed.

On top of this, we found that setClientListeners is only called when instantiating the process, but it's cleaned up when calling stop with the disconnect function. This makes it so that stopping and then starting the process won't have some of the client listeners properly set up. The proposed solution is to just move the client listeners intialization to the start function.

I'm not sure how to test this. It would probably need some way to mock client events after a stop and a start.

@msfstef
Copy link
Contributor

msfstef commented May 13, 2024

Thanks for spotting this issue and submitting a fix!

Although unsetting the throttled snapshot resolves the issue I am a bit reluctant about using that as a solution cause ultimately if it's getting called after termination we have some more underlying timing issue which should probably be solved instead.

The way I understand it, the throttled snapshot is called in four places:

  1. when the process is started
  2. on a polling interval
  3. as a response to the notifier "potential data changes" subscription
  4. as a response to the client "outbound started" subscription

I think one main issue right now is that in _stop we first wait for the lock to be acquired before clearing the polling interval, notifier subscriptions, and client subscriptions, so until any current snapshot is done running and releases the lock all the schedulers can still schedule snapshots (or do any other work they are scheduled to do, which is why I think it's better to fix the order of operations).

To solve 2) and 3), I think moving the lock acquisition for the sake of waiting on any current snapshots after clearing the interval and notifier subscriptions should work:

// Ensure that no snapshot is left running in the background
// by acquiring the mutex and releasing it immediately.
const releaseMutex = await this.snapshotMutex.acquire()
releaseMutex()

To solve 4) I think there needs to be an additional, separate call to unsubscribe from client listeners (since client.shutdown is not necessarily called and thus does not necessarily clear the scheduler for snapshots), in a similar way with the additional fix you've made for calling setClientListeners in the start method.

The 1) case is only an issue if stop is called before start finishes executing, which is really an edge case, and there are a few ways we could handle it but I think it can be split out as a separate fix - I suspect the issue you're running into stems from the scheduling of calls as you've said.

Let me know if this sounds reasonable to you, the solution you've proposed should work but we'd rather solve the underlying issues first as they could cause problems elsewhere!

P.S. indeed a test would be great for catching this, doing a timing mock for the 2), a notifier event mock for 3), or a client event mock for 4) I believe should capture this error somehow

@msfstef msfstef self-assigned this May 13, 2024
@msfstef msfstef self-requested a review May 13, 2024 09:41
@davidmartos96
Copy link
Contributor Author

@msfstef Thank you for reviewing!
You are right that moving the "wait for snapshot" code to the end would solve the scheduling issues.

While doing the test I noticed that I wasn't able to properly trigger it. I then noticed that the problem is that the clean function in the context doesn't actually close the underlying db connection, so even if the db files are removed, the connection still works, preventing seeing the issue of snapshoting when the db is closed.

I used the same mechanism for postgres and attach the stop function to the sqlite tests context. Making this change already manifested problem number 1 in some of the tests, so I included my proposed solution for that here as well.
I opted for a "startProcess" promise which would need to be awaited before stopping.

@davidmartos96
Copy link
Contributor Author

It looks like some pglite tests can now fail because the db connection is actually closed when tearing down each test.

For instance this simple test, because the connection promise is not awaited, when doing startReplication the db is closed, so it will fail when reading the meta table. Should we update those tests to actually wait for the connection to finish? Or should this be fixed somehow in the source code? Maybe if the process is stopped, don't propagate the errors in the initializing Promise

test('can use sub in JWT', async (t) => {
    const { satellite, authState } = t.context

    await t.notThrowsAsync(async () => {
      await startSatellite(
        satellite,
        authState,
        insecureAuthToken({ sub: 'test-userB' })
      )
    })
  })

@msfstef
Copy link
Contributor

msfstef commented May 14, 2024

@davidmartos96 thank you for updating the tests and everything!

You're right, there are some tests that are not waiting for the connection promise to fulfil and that causes issues. From my testing it seems the following tests inprocess.ts require this kind of fix:

await t.notThrowsAsync(async () => {
    const { connectionPromise } = await startSatellite(
      satellite,
      authState,
      insecureAuthToken({ user_id: 'test-userA' })
    )
    await connectionPromise
})
  • set persistent client id
  • can use sub in JWT
  • require user_id or sub in JWT
  • cannot update user id (needs to await connection after failure assertion)

There seems to be one additional test for postgres failing, might require the following fix in drivers/node-postgres/database.ts:

  let stopPromise: Promise<void>

  // We use the database directory as the name
  // because it uniquely identifies the DB
  return {
    db,
    stop: () => {
      if (stopPromise) return stopPromise
      stopPromise = pg.stop()
      return stopPromise
    },
  }

With those fixes in place it seems tests are good! I will separately review the updated changes

Copy link
Contributor

@msfstef msfstef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left some comments about ordering and some nits, if you could add a changeset for this as well that would be great! (pnpm changeset in the root directory, should be a patch change)

clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved
clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved
clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved
clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved
clients/typescript/src/satellite/process.ts Show resolved Hide resolved
clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved
@davidmartos96
Copy link
Contributor Author

@msfstef Tests seem to be ok now, thank you for the comments!

Copy link
Contributor

@msfstef msfstef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you for your contribution! This was a good catch!

clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@kevin-dp kevin-dp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch and solution, awesome! 💯
Left some minor comments.

clients/typescript/src/drivers/node-postgres/database.ts Outdated Show resolved Hide resolved
clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved
@davidmartos96
Copy link
Contributor Author

@kevin-dp Done!

@msfstef msfstef merged commit 237e323 into electric-sql:main May 15, 2024
8 of 10 checks passed
@davidmartos96 davidmartos96 deleted the race_cond branch May 16, 2024 08:06
alco pushed a commit that referenced this pull request May 19, 2024
We've encountered a race condition that can occurr if, after cancelling
the throttle when stopping the process, some other piece of code calls
throttlesnapshot. That's because cancelling the throttle doesn't prevent
from scheduling new ones.

The proposed solution is to assign undefined to the function, so places
that would normally trigger a snapshot, like the poll interval, would
just do nothing until properly closed.

On top of this, we found that `setClientListeners` is only called when
instantiating the process, but it's cleaned up when calling `stop` with
the `disconnect` function. This makes it so that stopping and then
starting the process won't have some of the client listeners properly
set up. The proposed solution is to just move the client listeners
intialization to the `start` function.

I'm not sure how to test this. It would probably need some way to mock
client events after a stop and a start.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants