fix(client): Better throttle snapshot stop strategy #1251

davidmartos96 · 2024-05-09T16:09:22Z

We've encountered a race condition that can occurr if, after cancelling the throttle when stopping the process, some other piece of code calls throttlesnapshot. That's because cancelling the throttle doesn't prevent from scheduling new ones.

The proposed solution is to assign undefined to the function, so places that would normally trigger a snapshot, like the poll interval, would just do nothing until properly closed.

On top of this, we found that setClientListeners is only called when instantiating the process, but it's cleaned up when calling stop with the disconnect function. This makes it so that stopping and then starting the process won't have some of the client listeners properly set up. The proposed solution is to just move the client listeners intialization to the start function.

I'm not sure how to test this. It would probably need some way to mock client events after a stop and a start.

msfstef · 2024-05-13T09:40:41Z

Thanks for spotting this issue and submitting a fix!

Although unsetting the throttled snapshot resolves the issue I am a bit reluctant about using that as a solution cause ultimately if it's getting called after termination we have some more underlying timing issue which should probably be solved instead.

The way I understand it, the throttled snapshot is called in four places:

when the process is started
on a polling interval
as a response to the notifier "potential data changes" subscription
as a response to the client "outbound started" subscription

I think one main issue right now is that in _stop we first wait for the lock to be acquired before clearing the polling interval, notifier subscriptions, and client subscriptions, so until any current snapshot is done running and releases the lock all the schedulers can still schedule snapshots (or do any other work they are scheduled to do, which is why I think it's better to fix the order of operations).

To solve 2) and 3), I think moving the lock acquisition for the sake of waiting on any current snapshots after clearing the interval and notifier subscriptions should work:

electric/clients/typescript/src/satellite/process.ts

Lines 339 to 342 in d2df5fe

    
           // Ensure that no snapshot is left running in the background 
        
           // by acquiring the mutex and releasing it immediately. 
        
           const releaseMutex = await this.snapshotMutex.acquire() 
        
           releaseMutex()

To solve 4) I think there needs to be an additional, separate call to unsubscribe from client listeners (since client.shutdown is not necessarily called and thus does not necessarily clear the scheduler for snapshots), in a similar way with the additional fix you've made for calling setClientListeners in the start method.

The 1) case is only an issue if stop is called before start finishes executing, which is really an edge case, and there are a few ways we could handle it but I think it can be split out as a separate fix - I suspect the issue you're running into stems from the scheduling of calls as you've said.

Let me know if this sounds reasonable to you, the solution you've proposed should work but we'd rather solve the underlying issues first as they could cause problems elsewhere!

P.S. indeed a test would be great for catching this, doing a timing mock for the 2), a notifier event mock for 3), or a client event mock for 4) I believe should capture this error somehow

davidmartos96 · 2024-05-13T16:39:18Z

@msfstef Thank you for reviewing!
You are right that moving the "wait for snapshot" code to the end would solve the scheduling issues.

While doing the test I noticed that I wasn't able to properly trigger it. I then noticed that the problem is that the clean function in the context doesn't actually close the underlying db connection, so even if the db files are removed, the connection still works, preventing seeing the issue of snapshoting when the db is closed.

I used the same mechanism for postgres and attach the stop function to the sqlite tests context. Making this change already manifested problem number 1 in some of the tests, so I included my proposed solution for that here as well.
I opted for a "startProcess" promise which would need to be awaited before stopping.

davidmartos96 · 2024-05-13T17:03:05Z

It looks like some pglite tests can now fail because the db connection is actually closed when tearing down each test.

For instance this simple test, because the connection promise is not awaited, when doing startReplication the db is closed, so it will fail when reading the meta table. Should we update those tests to actually wait for the connection to finish? Or should this be fixed somehow in the source code? Maybe if the process is stopped, don't propagate the errors in the initializing Promise

test('can use sub in JWT', async (t) => {
    const { satellite, authState } = t.context

    await t.notThrowsAsync(async () => {
      await startSatellite(
        satellite,
        authState,
        insecureAuthToken({ sub: 'test-userB' })
      )
    })
  })

msfstef · 2024-05-14T08:36:55Z

@davidmartos96 thank you for updating the tests and everything!

You're right, there are some tests that are not waiting for the connection promise to fulfil and that causes issues. From my testing it seems the following tests inprocess.ts require this kind of fix:

await t.notThrowsAsync(async () => {
    const { connectionPromise } = await startSatellite(
      satellite,
      authState,
      insecureAuthToken({ user_id: 'test-userA' })
    )
    await connectionPromise
})

set persistent client id
can use sub in JWT
require user_id or sub in JWT
cannot update user id (needs to await connection after failure assertion)

There seems to be one additional test for postgres failing, might require the following fix in drivers/node-postgres/database.ts:

  let stopPromise: Promise<void>

  // We use the database directory as the name
  // because it uniquely identifies the DB
  return {
    db,
    stop: () => {
      if (stopPromise) return stopPromise
      stopPromise = pg.stop()
      return stopPromise
    },
  }

With those fixes in place it seems tests are good! I will separately review the updated changes

msfstef

Looks good! Left some comments about ordering and some nits, if you could add a changeset for this as well that would be great! (pnpm changeset in the root directory, should be a patch change)

clients/typescript/src/satellite/process.ts

davidmartos96 · 2024-05-14T14:15:50Z

@msfstef Tests seem to be ok now, thank you for the comments!

msfstef

Looks good, thank you for your contribution! This was a good catch!

clients/typescript/src/satellite/process.ts

kevin-dp

Great catch and solution, awesome! 💯
Left some minor comments.

clients/typescript/src/drivers/node-postgres/database.ts

clients/typescript/src/satellite/process.ts

davidmartos96 · 2024-05-15T08:17:16Z

@kevin-dp Done!

We've encountered a race condition that can occurr if, after cancelling the throttle when stopping the process, some other piece of code calls throttlesnapshot. That's because cancelling the throttle doesn't prevent from scheduling new ones. The proposed solution is to assign undefined to the function, so places that would normally trigger a snapshot, like the poll interval, would just do nothing until properly closed. On top of this, we found that `setClientListeners` is only called when instantiating the process, but it's cleaned up when calling `stop` with the `disconnect` function. This makes it so that stopping and then starting the process won't have some of the client listeners properly set up. The proposed solution is to just move the client listeners intialization to the `start` function. I'm not sure how to test this. It would probably need some way to mock client events after a stop and a start.

Fix race condition

60609e6

davidmartos96 force-pushed the race_cond branch from 3354bb2 to 60609e6 Compare May 9, 2024 16:11

msfstef self-assigned this May 13, 2024

msfstef self-requested a review May 13, 2024 09:41

davidmartos96 added 4 commits May 13, 2024 15:41

revert

e31b23d

handle review comments

30a184f

format

47c636d

lint

e7603ec

don't stop the db if already stopped

59e1b09

msfstef requested changes May 14, 2024

View reviewed changes

davidmartos96 added 5 commits May 14, 2024 15:46

resolve review comments

fefd1aa

fix tests

746744c

update comment

40266dd

clearInterval before setting it

3674bdc

don't wait for start to end when stopping

c356230

msfstef approved these changes May 14, 2024

View reviewed changes

clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved

update comment

7366066

kevin-dp approved these changes May 15, 2024

View reviewed changes

clients/typescript/src/drivers/node-postgres/database.ts Outdated Show resolved Hide resolved

clients/typescript/src/satellite/process.ts Outdated Show resolved Hide resolved

address comments

b48379e

msfstef merged commit 237e323 into electric-sql:main May 15, 2024
8 of 10 checks passed

davidmartos96 deleted the race_cond branch May 16, 2024 08:06

davidmartos96 mentioned this pull request May 16, 2024

fix(client): Make the throttle skip a snapshot if one is already running #1273

Open

msfstef mentioned this pull request May 21, 2024

fix(client): asyncEventEmitter to not silence unhandled exceptions raised in event handlers #1247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): Better throttle snapshot stop strategy #1251

fix(client): Better throttle snapshot stop strategy #1251

davidmartos96 commented May 9, 2024 •

edited

msfstef commented May 13, 2024

davidmartos96 commented May 13, 2024

davidmartos96 commented May 13, 2024

msfstef commented May 14, 2024

msfstef left a comment

davidmartos96 commented May 14, 2024

msfstef left a comment

kevin-dp left a comment

davidmartos96 commented May 15, 2024

fix(client): Better throttle snapshot stop strategy #1251

fix(client): Better throttle snapshot stop strategy #1251

Conversation

davidmartos96 commented May 9, 2024 • edited

msfstef commented May 13, 2024

davidmartos96 commented May 13, 2024

davidmartos96 commented May 13, 2024

msfstef commented May 14, 2024

msfstef left a comment

Choose a reason for hiding this comment

davidmartos96 commented May 14, 2024

msfstef left a comment

Choose a reason for hiding this comment

kevin-dp left a comment

Choose a reason for hiding this comment

davidmartos96 commented May 15, 2024

davidmartos96 commented May 9, 2024 •

edited