Refactor `DocumentRoom`, throttle saves instead of debouncing saves #250

dlqqq · 2024-03-15T00:01:52Z

Description

Refactors the DocumentRoom class to not rely on the asyncio.Lock context manager. This greatly improves readability by reducing nesting and reliance on the asyncio API.
Throttles saves instead of debouncing them, ensuring that document changes will be flushed to disk on a minimum interval. Related: Automatic file save strategies #244
- This PR may fix that issue, but I think consensus is needed on that.
All existing unit tests pass locally.

Change summary

Removes self._initialization_lock.
- This is not necessary as long as the initialize() method is only called once and awaited, as is the case in this extension. This is a very reasonable usage constraint. Furthermore, in asyncio, locks do not provide thread safety, contrary to the original docstring of this method.
Removes self._update_lock.
- The purpose of this lock is to prevent self._on_document_change() from being called while the lock is held. Removing the locks results in a save loop, as without the lock, self._maybe_save_document() would trigger self._on_document_change().
- However, the save loop was caused by a single self._document.dirty = False statement, which could have been removed without consequence. See below.
Throttles saves instead of debouncing saves by cancelling the previous task
- Throttling seems preferable to debouncing here, as debouncing could result in the document not being saved if the document is being changed too frequently, which may arise in rooms with lots of collaborators.
- Previously, every time a new self._maybe_save_document() task was started, the previous task was cancelled if it was in progress. The method required the previous task as an argument, which resulted in a weird way of calling the method:
```
self._saving_document = asyncio.create_task(
    self._maybe_save_document(self._saving_document)
)
```
- The new implementation of self._maybe_save_document() does not require an extra argument or task.cancel(); instead it relies on knowledge of its own state, stored in a couple of new instance attributes. Here is an overview of the algorithm:
  - If a previous self._maybe_save_document() task is waiting on self._save_delay, then the current task can return early, as that previous task will save the Ydoc anyways.
  - If a previous self._maybe_save_document() task is currently saving via FileLoader, then the current task should set self._should_resave = True and then return. Later, when the previous task is done saving, if this attribute is True, then it will re-run itself via asyncio.create_task(self.maybe_save_document()).
~~Removed self._document.dirty = False statements~~
- This dirty attribute was only referenced in a single unit test, and I could not find it mentioned in pycrdt or pycrdt-websocket. I removed this because setting self._document.dirty triggers the _on_document_change() observer, causing a save loop in this branch. Removing the statement from self._maybe_save_document() allows for self._update_lock to be removed.
- It seems preferable to avoid triggering a document change when handling one, rather than relying on a lock.

welcome · 2024-03-15T00:01:55Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.

You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

github-actions · 2024-03-15T00:02:04Z

👈 Launch a Binder on branch dlqqq/jupyter-collaboration/refactor-room

jupyter_collaboration/rooms.py

davidbrochart

At first sight, this breaks the dirty indicator in JupyterLab's UI.
I created a test that passes on main and fails with this PR.

jupyter_collaboration/rooms.py

davidbrochart · 2024-03-15T15:40:15Z

Removes self._initialization_lock.

This is not necessary as long as the initialize() method is only called once and awaited, as is the case in this extension. This is a very reasonable usage constraint. Furthermore, in asyncio, locks do not provide thread safety, contrary to the original docstring of this method.

How do you handle the case where the room is created at the same time from two different clients?

Removes self._update_lock.

The purpose of this lock is to prevent self._on_document_change() from being called while the lock is held. Removing the locks results in a save loop, as without the lock, self._maybe_save_document() would trigger self._on_document_change().

However, the save loop was caused by a single self._document.dirty = False statement, which could have been removed without consequence. See below.

Removing self._document.dirty = False had a consequence.

Throttles saves instead of debouncing saves by cancelling the previous task

The new implementation of self._maybe_save_document() does not require an extra argument or task.cancel(); instead it relies on knowledge of its own state, stored in a couple of new instance attributes. Here is an overview of the algorithm:

If a previous self._maybe_save_document() task is waiting on self._save_delay, then the current task can return early, as that previous task will save the Ydoc anyways.

If a previous self._maybe_save_document() task is currently saving via FileLoader, then the current task should set self._should_resave = True and then return. Later, when the previous task is done saving, if this attribute is True, then it will re-run itself via asyncio.create_task(self.maybe_save_document()).

I don't think it's any simpler than previously, as there is now a possible recursive call to self._maybe_save_document().
I don't quite see how the class attributes this method is using are not going to collide with nested calls of this method.

Removed self._document.dirty = False statements

This dirty attribute was only referenced in a single unit test, and I could not find it mentioned in pycrdt or pycrdt-websocket. I removed this because setting self._document.dirty triggers the _on_document_change() observer, causing a save loop in this branch. Removing the statement from self._maybe_save_document() allows for self._update_lock to be removed.

The dirty attribute is used by frontends such as JupyterLab to set/clear a dirty indicator, just like Google Docs shows "Saving..."/"Saved to Drive".

dlqqq · 2024-03-19T15:30:46Z

The dirty attribute is used by frontends such as JupyterLab to set/clear a dirty indicator, just like Google Docs shows "Saving..."/"Saved to Drive".

Where is dirty being set to True? I cannot find any mention of this in the source.

davidbrochart · 2024-03-19T16:39:57Z

To clarify, JupyterLab's UI sets a graphical dirty indicator whenever a change occurs, and listens to the shared model dirty attribute to clear this dirty indicator (on False). So even though the shared model dirty attribute is never set to True, it cannot be removed.

… loop

dlqqq · 2024-03-20T01:07:41Z

@davidbrochart Thank you for your feedback! It has allowed me to simplify and correct the implementation in this branch. Let me address the rest of your feedback here:

How do you handle the case where the room is created at the same time from two different clients?

The current implementation doesn't handle this case either. Having a lock within DocumentRoom.initialize() won't stop the main body of the method from being run more than once concurrently, since the lock is local to each DocumentRoom instance. Consider the example:

room1 = DocumentRoom(...)
room2 = DocumentRoom(...)
async with asyncio.TaskGroup() as tg:
    task1 = tg.create_task(room1.initialize())
    task2 = tg.create_task(room2.initialize())

This will still result in the main body of initialize() being run twice concurrently in the event loop. If you really want to prevent this, you would need to pass a lock from an outer scope as an argument to the constructor, then have the rooms wait on that outer lock before running the rest of initialize().

I don't think this is an issue, since Tornado awaits the WebsocketHandler.open() method where DocumentRoom.initialize() is called: https://github.com/tornadoweb/tornado/blob/464f6084a5e18a01b46f77c011b1bbffe1f4d96c/tornado/websocket.py#L937-L940. This means that at most one DocumentRoom.initialize() method should be running at any time on the event loop, which means that we shouldn't need to implement an outer lock.

However, we should probably open an issue to Tornado to verify that this method isn't ultimately run as a concurrent task upstream somewhere.

The dirty attribute is used by frontends such as JupyterLab to set/clear a dirty indicator, just like Google Docs shows "Saving..."/"Saved to Drive".

Thanks for clarifying! I've reverted that change and added some logic to prevent a save loop without needing a lock.

I don't think it's any simpler than previously, as there is now a possible recursive call to self._maybe_save_document().
I don't quite see how the class attributes this method is using are not going to collide with nested calls of this method.

Only one self._maybe_save_document() is run as a task at any given time in the current implementation. So using instance attributes is safe here. BTW, I changed the logic to be more clear, so please take another look! I think the implementation proposed by this PR is definitely simpler and more correct with regards to how DocumentRoom saves the Ydoc in response to user updates. 👍

dlqqq · 2024-03-20T01:08:14Z

Proof that this PR correctly sets the dirty attribute:

Screen.Recording.2024-03-19.at.5.43.22.PM.mov

dlqqq · 2024-03-20T01:13:33Z

This PR seems to set the dirty indicator more responsively than the main branch, shown below:

Screen.Recording.2024-03-19.at.6.12.15.PM.mov

I think this is mainly due to the performance benefits of simplifying the async code and removing the lock waits.

davidbrochart · 2024-03-20T08:39:37Z

Just opening JupyterLab and creating a new notebook, it seems to be saved twice without doing any modification:

[I 2024-03-20 09:36:49.612 ServerApp] Connecting to kernel 727323aa-f8b0-4cac-aff6-dcccabb68e84.
[I 2024-03-20 09:36:49.892 ServerApp] Saving the content from room json:notebook:e952e112-3652-40b2-a4e6-27f90cd35f4c
[I 2024-03-20 09:36:49.895 YDocExtension] Saving file: Untitled.ipynb
[I 2024-03-20 09:36:50.918 ServerApp] Saving the content from room json:notebook:e952e112-3652-40b2-a4e6-27f90cd35f4c
[I 2024-03-20 09:36:50.922 YDocExtension] Saving file: Untitled.ipynb

Can you explain why?

davidbrochart · 2024-03-20T14:17:03Z

The current implementation doesn't handle this case either. Having a lock within DocumentRoom.initialize() won't stop the main body of the method from being run more than once concurrently, since the lock is local to each DocumentRoom instance.

Good point. I opened #255 which includes a test for concurrent room creation, which fails on this PR.

BTW, I changed the logic to be more clear, so please take another look!

I still think that the recursive call to _maybe_save_document() is a code smell.

dlqqq · 2024-03-20T18:26:42Z

@davidbrochart Thanks for the feedback! Let me address it here:

I still think that the recursive call to _maybe_save_document() is a code smell.

I'm happy to change this, since I'm personally indifferent. I've changed this in the latest revision; this branch now uses a Task.add_done_callback() to avoid a recursive coroutine call. See 9f55400.

Just opening JupyterLab and creating a new notebook, it seems to be saved twice without doing any modification:
Can you explain why?

Sure! The main implementation had a self._update_lock, and whenever self._on_document_change() was called while this lock was held, that update was entirely ignored and not written to disk until another update occurred while the lock was freed. This meant that in the main implementation, some document updates would not trigger self._maybe_save_document().

The current implementation ensures that all updates are written to disk by using the self._should_resave flag. Multiple Ydoc updates occur in quick succession when the room is initialized, so this ultimately results in two saves.

I tried to determine what Ydoc updates were occurring when the room is initialized (to see if they can be removed), but for some reason, I wasn't able to access every MapEvent and ArrayEvent object. I added the following log statements:

    def _on_document_change(self, target: str, event: Any) -> None:
        """
        ...
        """
        self.log.error(target)
        self.log.error(event)
        if isinstance(event, MapEvent):
            self.log.error(event.keys)

For some reason, isinstance(event, MapEvent) evaluates to False, even though it seems like some of these events should be MapEvent objects. Here are the logged document updates that are resulting in a double-write:

[E 2024-03-20 11:03:32.171 ServerApp] cells
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.array.ArrayEvent object at 0x10cae9900>]
[E 2024-03-20 11:03:32.171 ServerApp] cells
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.array.ArrayEvent object at 0x10cae9900>]
[E 2024-03-20 11:03:32.171 ServerApp] meta
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9e40>]
[E 2024-03-20 11:03:32.171 ServerApp] meta
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9e40>]
[E 2024-03-20 11:03:32.172 ServerApp] meta
[E 2024-03-20 11:03:32.172 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9bc0>]
[E 2024-03-20 11:03:32.172 ServerApp] meta
[E 2024-03-20 11:03:32.172 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9bc0>]
[E 2024-03-20 11:03:32.172 ServerApp] meta
[E 2024-03-20 11:03:32.173 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9880>]
[E 2024-03-20 11:03:32.173 ServerApp] meta
[E 2024-03-20 11:03:32.173 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9880>]

I don't think this issue should block this PR however; we should track this in a separate issue and address it later if possible.

Zsailer · 2024-03-20T23:04:33Z

High level comments...

I do think this PR dramatically improves the readability of the code. Great job there, @dlqqq!

Reviewing the code, though, it makes me a bit uneasy to change so much under the hood of the DocumentRoom without more unit test coverage (and possible some integration tests). The current codebase has been "cooking" in a released state for a little while, so I have some confidence—albeit limited—that it works relatively well. We've heard edge case issues, but don't have unit tests to measure if we're getting any closer to a better state with these changes.

Before merging impactful rewrites like this, I'd prefer we increase our test coverage first.

I'm happy to change this, since I'm personally indifferent. I've changed this in the latest revision; this branch now uses a Task.add_done_callback() to avoid a recursive coroutine call. See 9f55400.

Personally—if we're aiming for readability—I like this new commit over the recursive _maybe_save_document. I agree with @davidbrochart that this recursion triggers some concern for me as well, though it's hard to tell without some better ways to test.

dlqqq · 2024-03-20T23:22:43Z

@Zsailer I can add more unit test coverage in this PR. 👍

davidbrochart · 2024-03-25T10:33:51Z

@dlqqq Thanks for identifying the bug in concurrent room initialization. A fix for it was merged in #255, and this PR now has conflicts, but it shouldn't be a problem since you are essentially rewriting the DocumentRoom. Please go ahead and make the changes you want, keeping in mind that the test added in #255 must pass with this PR.

dlqqq added 4 commits March 14, 2024 08:02

prefer if/else over if/if not

f675964

refactor DocumentRoom to not rely on asyncio locks

0fd9b4c

remove references to unused 'dirty' attribute set on pycrdt.Doc

d1e28ef

correct first commit 'f67596', allowing unit tests to pass

32082cf

dlqqq commented Mar 15, 2024

View reviewed changes

jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved

davidbrochart requested changes Mar 15, 2024

View reviewed changes

Zsailer added the maintenance label Mar 15, 2024

davidbrochart reviewed Mar 15, 2024

View reviewed changes

jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved

jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved

jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved

dlqqq marked this pull request as draft March 20, 2024 00:28

dlqqq added 2 commits March 19, 2024 17:34

save references to scheduled tasks

51604bf

prefer asyncio.create_task()

0a34513

dlqqq force-pushed the refactor-room branch from 1296f08 to 0a34513 Compare March 20, 2024 00:35

correct commit 'd1e28ef', set 'dirty' attribute while preventing save…

43df034

… loop

dlqqq force-pushed the refactor-room branch from 3731b83 to 43df034 Compare March 20, 2024 00:42

dlqqq marked this pull request as ready for review March 20, 2024 00:44

prefer add_done_callback() over recursive coroutine

9f55400

pre-commit

52b1b12

Zsailer mentioned this pull request Mar 20, 2024

Fix concurrent room initialization #255

Merged

dlqqq marked this pull request as draft March 20, 2024 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `DocumentRoom`, throttle saves instead of debouncing saves #250

Refactor `DocumentRoom`, throttle saves instead of debouncing saves #250

dlqqq commented Mar 15, 2024 •

edited

welcome bot commented Mar 15, 2024

github-actions bot commented Mar 15, 2024

davidbrochart left a comment

davidbrochart commented Mar 15, 2024

dlqqq commented Mar 19, 2024

davidbrochart commented Mar 19, 2024

dlqqq commented Mar 20, 2024

dlqqq commented Mar 20, 2024

dlqqq commented Mar 20, 2024

davidbrochart commented Mar 20, 2024

davidbrochart commented Mar 20, 2024

dlqqq commented Mar 20, 2024 •

edited

Zsailer commented Mar 20, 2024

dlqqq commented Mar 20, 2024

davidbrochart commented Mar 25, 2024

Refactor DocumentRoom, throttle saves instead of debouncing saves #250

Are you sure you want to change the base?

Refactor DocumentRoom, throttle saves instead of debouncing saves #250

Conversation

dlqqq commented Mar 15, 2024 • edited

Description

Change summary

welcome bot commented Mar 15, 2024

github-actions bot commented Mar 15, 2024

davidbrochart left a comment

Choose a reason for hiding this comment

davidbrochart commented Mar 15, 2024

dlqqq commented Mar 19, 2024

davidbrochart commented Mar 19, 2024

dlqqq commented Mar 20, 2024

dlqqq commented Mar 20, 2024

dlqqq commented Mar 20, 2024

davidbrochart commented Mar 20, 2024

davidbrochart commented Mar 20, 2024

dlqqq commented Mar 20, 2024 • edited

Zsailer commented Mar 20, 2024

dlqqq commented Mar 20, 2024

davidbrochart commented Mar 25, 2024

Refactor `DocumentRoom`, throttle saves instead of debouncing saves #250

Refactor `DocumentRoom`, throttle saves instead of debouncing saves #250

dlqqq commented Mar 15, 2024 •

edited

dlqqq commented Mar 20, 2024 •

edited