Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor DocumentRoom, throttle saves instead of debouncing saves #250

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dlqqq
Copy link
Collaborator

@dlqqq dlqqq commented Mar 15, 2024

Description

  • Refactors the DocumentRoom class to not rely on the asyncio.Lock context manager. This greatly improves readability by reducing nesting and reliance on the asyncio API.

  • Throttles saves instead of debouncing them, ensuring that document changes will be flushed to disk on a minimum interval. Related: Automatic file save strategies #244

    • This PR may fix that issue, but I think consensus is needed on that.
  • All existing unit tests pass locally.

Change summary

  • Removes self._initialization_lock.

    • This is not necessary as long as the initialize() method is only called once and awaited, as is the case in this extension. This is a very reasonable usage constraint. Furthermore, in asyncio, locks do not provide thread safety, contrary to the original docstring of this method.
  • Removes self._update_lock.

    • The purpose of this lock is to prevent self._on_document_change() from being called while the lock is held. Removing the locks results in a save loop, as without the lock, self._maybe_save_document() would trigger self._on_document_change().
    • However, the save loop was caused by a single self._document.dirty = False statement, which could have been removed without consequence. See below.
  • Throttles saves instead of debouncing saves by cancelling the previous task

    • Throttling seems preferable to debouncing here, as debouncing could result in the document not being saved if the document is being changed too frequently, which may arise in rooms with lots of collaborators.

    • Previously, every time a new self._maybe_save_document() task was started, the previous task was cancelled if it was in progress. The method required the previous task as an argument, which resulted in a weird way of calling the method:

      self._saving_document = asyncio.create_task(
          self._maybe_save_document(self._saving_document)
      )
    • The new implementation of self._maybe_save_document() does not require an extra argument or task.cancel(); instead it relies on knowledge of its own state, stored in a couple of new instance attributes. Here is an overview of the algorithm:

      • If a previous self._maybe_save_document() task is waiting on self._save_delay, then the current task can return early, as that previous task will save the Ydoc anyways.
      • If a previous self._maybe_save_document() task is currently saving via FileLoader, then the current task should set self._should_resave = True and then return. Later, when the previous task is done saving, if this attribute is True, then it will re-run itself via asyncio.create_task(self.maybe_save_document()).
  • Removed self._document.dirty = False statements

    • This dirty attribute was only referenced in a single unit test, and I could not find it mentioned in pycrdt or pycrdt-websocket. I removed this because setting self._document.dirty triggers the _on_document_change() observer, causing a save loop in this branch. Removing the statement from self._maybe_save_document() allows for self._update_lock to be removed.
    • It seems preferable to avoid triggering a document change when handling one, rather than relying on a lock.

Copy link

welcome bot commented Mar 15, 2024

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

Copy link
Contributor

Binder 👈 Launch a Binder on branch dlqqq/jupyter-collaboration/refactor-room

jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first sight, this breaks the dirty indicator in JupyterLab's UI.
I created a test that passes on main and fails with this PR.

jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved
jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved
jupyter_collaboration/rooms.py Outdated Show resolved Hide resolved
@davidbrochart
Copy link
Collaborator

Removes self._initialization_lock.

  • This is not necessary as long as the initialize() method is only called once and awaited, as is the case in this extension. This is a very reasonable usage constraint. Furthermore, in asyncio, locks do not provide thread safety, contrary to the original docstring of this method.

How do you handle the case where the room is created at the same time from two different clients?

Removes self._update_lock.

  • The purpose of this lock is to prevent self._on_document_change() from being called while the lock is held. Removing the locks results in a save loop, as without the lock, self._maybe_save_document() would trigger self._on_document_change().
  • However, the save loop was caused by a single self._document.dirty = False statement, which could have been removed without consequence. See below.

Removing self._document.dirty = False had a consequence.

Throttles saves instead of debouncing saves by cancelling the previous task

  • The new implementation of self._maybe_save_document() does not require an extra argument or task.cancel(); instead it relies on knowledge of its own state, stored in a couple of new instance attributes. Here is an overview of the algorithm:

    • If a previous self._maybe_save_document() task is waiting on self._save_delay, then the current task can return early, as that previous task will save the Ydoc anyways.
    • If a previous self._maybe_save_document() task is currently saving via FileLoader, then the current task should set self._should_resave = True and then return. Later, when the previous task is done saving, if this attribute is True, then it will re-run itself via asyncio.create_task(self.maybe_save_document()).

I don't think it's any simpler than previously, as there is now a possible recursive call to self._maybe_save_document().
I don't quite see how the class attributes this method is using are not going to collide with nested calls of this method.

Removed self._document.dirty = False statements

  • This dirty attribute was only referenced in a single unit test, and I could not find it mentioned in pycrdt or pycrdt-websocket. I removed this because setting self._document.dirty triggers the _on_document_change() observer, causing a save loop in this branch. Removing the statement from self._maybe_save_document() allows for self._update_lock to be removed.

The dirty attribute is used by frontends such as JupyterLab to set/clear a dirty indicator, just like Google Docs shows "Saving..."/"Saved to Drive".

@dlqqq
Copy link
Collaborator Author

dlqqq commented Mar 19, 2024

The dirty attribute is used by frontends such as JupyterLab to set/clear a dirty indicator, just like Google Docs shows "Saving..."/"Saved to Drive".

Where is dirty being set to True? I cannot find any mention of this in the source.

@davidbrochart
Copy link
Collaborator

To clarify, JupyterLab's UI sets a graphical dirty indicator whenever a change occurs, and listens to the shared model dirty attribute to clear this dirty indicator (on False). So even though the shared model dirty attribute is never set to True, it cannot be removed.

@dlqqq dlqqq marked this pull request as draft March 20, 2024 00:28
@dlqqq dlqqq marked this pull request as ready for review March 20, 2024 00:44
@dlqqq
Copy link
Collaborator Author

dlqqq commented Mar 20, 2024

@davidbrochart Thank you for your feedback! It has allowed me to simplify and correct the implementation in this branch. Let me address the rest of your feedback here:

How do you handle the case where the room is created at the same time from two different clients?

The current implementation doesn't handle this case either. Having a lock within DocumentRoom.initialize() won't stop the main body of the method from being run more than once concurrently, since the lock is local to each DocumentRoom instance. Consider the example:

room1 = DocumentRoom(...)
room2 = DocumentRoom(...)
async with asyncio.TaskGroup() as tg:
    task1 = tg.create_task(room1.initialize())
    task2 = tg.create_task(room2.initialize())

This will still result in the main body of initialize() being run twice concurrently in the event loop. If you really want to prevent this, you would need to pass a lock from an outer scope as an argument to the constructor, then have the rooms wait on that outer lock before running the rest of initialize().

I don't think this is an issue, since Tornado awaits the WebsocketHandler.open() method where DocumentRoom.initialize() is called: https://github.com/tornadoweb/tornado/blob/464f6084a5e18a01b46f77c011b1bbffe1f4d96c/tornado/websocket.py#L937-L940. This means that at most one DocumentRoom.initialize() method should be running at any time on the event loop, which means that we shouldn't need to implement an outer lock.

However, we should probably open an issue to Tornado to verify that this method isn't ultimately run as a concurrent task upstream somewhere.

The dirty attribute is used by frontends such as JupyterLab to set/clear a dirty indicator, just like Google Docs shows "Saving..."/"Saved to Drive".

Thanks for clarifying! I've reverted that change and added some logic to prevent a save loop without needing a lock.

I don't think it's any simpler than previously, as there is now a possible recursive call to self._maybe_save_document().
I don't quite see how the class attributes this method is using are not going to collide with nested calls of this method.

Only one self._maybe_save_document() is run as a task at any given time in the current implementation. So using instance attributes is safe here. BTW, I changed the logic to be more clear, so please take another look! I think the implementation proposed by this PR is definitely simpler and more correct with regards to how DocumentRoom saves the Ydoc in response to user updates. 👍

@dlqqq
Copy link
Collaborator Author

dlqqq commented Mar 20, 2024

Proof that this PR correctly sets the dirty attribute:

Screen.Recording.2024-03-19.at.5.43.22.PM.mov

@dlqqq
Copy link
Collaborator Author

dlqqq commented Mar 20, 2024

This PR seems to set the dirty indicator more responsively than the main branch, shown below:

Screen.Recording.2024-03-19.at.6.12.15.PM.mov

I think this is mainly due to the performance benefits of simplifying the async code and removing the lock waits.

@davidbrochart
Copy link
Collaborator

Just opening JupyterLab and creating a new notebook, it seems to be saved twice without doing any modification:

[I 2024-03-20 09:36:49.612 ServerApp] Connecting to kernel 727323aa-f8b0-4cac-aff6-dcccabb68e84.
[I 2024-03-20 09:36:49.892 ServerApp] Saving the content from room json:notebook:e952e112-3652-40b2-a4e6-27f90cd35f4c
[I 2024-03-20 09:36:49.895 YDocExtension] Saving file: Untitled.ipynb
[I 2024-03-20 09:36:50.918 ServerApp] Saving the content from room json:notebook:e952e112-3652-40b2-a4e6-27f90cd35f4c
[I 2024-03-20 09:36:50.922 YDocExtension] Saving file: Untitled.ipynb

Can you explain why?

@davidbrochart
Copy link
Collaborator

The current implementation doesn't handle this case either. Having a lock within DocumentRoom.initialize() won't stop the main body of the method from being run more than once concurrently, since the lock is local to each DocumentRoom instance.

Good point. I opened #255 which includes a test for concurrent room creation, which fails on this PR.

BTW, I changed the logic to be more clear, so please take another look!

I still think that the recursive call to _maybe_save_document() is a code smell.

@dlqqq
Copy link
Collaborator Author

dlqqq commented Mar 20, 2024

@davidbrochart Thanks for the feedback! Let me address it here:

I still think that the recursive call to _maybe_save_document() is a code smell.

I'm happy to change this, since I'm personally indifferent. I've changed this in the latest revision; this branch now uses a Task.add_done_callback() to avoid a recursive coroutine call. See 9f55400.

Just opening JupyterLab and creating a new notebook, it seems to be saved twice without doing any modification:
Can you explain why?

Sure! The main implementation had a self._update_lock, and whenever self._on_document_change() was called while this lock was held, that update was entirely ignored and not written to disk until another update occurred while the lock was freed. This meant that in the main implementation, some document updates would not trigger self._maybe_save_document().

The current implementation ensures that all updates are written to disk by using the self._should_resave flag. Multiple Ydoc updates occur in quick succession when the room is initialized, so this ultimately results in two saves.

I tried to determine what Ydoc updates were occurring when the room is initialized (to see if they can be removed), but for some reason, I wasn't able to access every MapEvent and ArrayEvent object. I added the following log statements:

    def _on_document_change(self, target: str, event: Any) -> None:
        """
        ...
        """
        self.log.error(target)
        self.log.error(event)
        if isinstance(event, MapEvent):
            self.log.error(event.keys)

For some reason, isinstance(event, MapEvent) evaluates to False, even though it seems like some of these events should be MapEvent objects. Here are the logged document updates that are resulting in a double-write:

[E 2024-03-20 11:03:32.171 ServerApp] cells
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.array.ArrayEvent object at 0x10cae9900>]
[E 2024-03-20 11:03:32.171 ServerApp] cells
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.array.ArrayEvent object at 0x10cae9900>]
[E 2024-03-20 11:03:32.171 ServerApp] meta
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9e40>]
[E 2024-03-20 11:03:32.171 ServerApp] meta
[E 2024-03-20 11:03:32.171 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9e40>]
[E 2024-03-20 11:03:32.172 ServerApp] meta
[E 2024-03-20 11:03:32.172 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9bc0>]
[E 2024-03-20 11:03:32.172 ServerApp] meta
[E 2024-03-20 11:03:32.172 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9bc0>]
[E 2024-03-20 11:03:32.172 ServerApp] meta
[E 2024-03-20 11:03:32.173 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9880>]
[E 2024-03-20 11:03:32.173 ServerApp] meta
[E 2024-03-20 11:03:32.173 ServerApp] [<pycrdt.map.MapEvent object at 0x10cae9880>]

I don't think this issue should block this PR however; we should track this in a separate issue and address it later if possible.

@Zsailer
Copy link
Member

Zsailer commented Mar 20, 2024

High level comments...

I do think this PR dramatically improves the readability of the code. Great job there, @dlqqq!

Reviewing the code, though, it makes me a bit uneasy to change so much under the hood of the DocumentRoom without more unit test coverage (and possible some integration tests). The current codebase has been "cooking" in a released state for a little while, so I have some confidence—albeit limited—that it works relatively well. We've heard edge case issues, but don't have unit tests to measure if we're getting any closer to a better state with these changes.

Before merging impactful rewrites like this, I'd prefer we increase our test coverage first.

I'm happy to change this, since I'm personally indifferent. I've changed this in the latest revision; this branch now uses a Task.add_done_callback() to avoid a recursive coroutine call. See 9f55400.

Personally—if we're aiming for readability—I like this new commit over the recursive _maybe_save_document. I agree with @davidbrochart that this recursion triggers some concern for me as well, though it's hard to tell without some better ways to test.

@dlqqq
Copy link
Collaborator Author

dlqqq commented Mar 20, 2024

@Zsailer I can add more unit test coverage in this PR. 👍

@dlqqq dlqqq marked this pull request as draft March 20, 2024 23:22
@davidbrochart
Copy link
Collaborator

@dlqqq Thanks for identifying the bug in concurrent room initialization. A fix for it was merged in #255, and this PR now has conflicts, but it shouldn't be a problem since you are essentially rewriting the DocumentRoom. Please go ahead and make the changes you want, keeping in mind that the test added in #255 must pass with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants