Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve OOB change handling #243

Open
dlqqq opened this issue Feb 29, 2024 · 1 comment
Open

Improve OOB change handling #243

dlqqq opened this issue Feb 29, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@dlqqq
Copy link
Collaborator

dlqqq commented Feb 29, 2024

Introduction

(this section is mostly copied from #240)

An out-of-band (abbrev. OOB) change is an edit of a collaborative file not done through JupyterLab's UI. For example, suppose user A runs git reset --hard <other-branch> while another user B is editing a collaborative document within the same repository. A's git reset deletes all edits made by B and sets the contents of that document to some fixed value. Then the YStore table containing B's edits needs to be dropped, and this event must be somehow communicated to all Yjs clients connected to the document. Here, we say that an out-of-band change occurred on B's document.

Please see #240 for information on how OOB changes are detected & signaled. This issue focuses exclusively on how OOB changes are handled by the backend.

Description

The current implementation of how OOB changes can be improved to potentially reduce data loss and to greatly improve performance on long-running collaboration sessions.

Context: how OOB changes are handled currently

The handling of OOB changes is implemented in DocumentRoom, mostly in the _on_outofband_change() method. The relevant portions of its source are:

class DocumentRoom(YRoom):
    """A Y room for a possibly stored document (e.g. a notebook)."""

    def __init__(
        self, ...
    ):
        ...
        
        # Listen for document changes
        self._document.observe(self._on_document_change)
        self._file.observe(self.room_id, self._on_outofband_change)

    async def _on_outofband_change(self) -> None:
        """
        Called when the file got out-of-band changes.
        """
        self.log.info("Out-of-band changes. Overwriting the content in room %s", self._room_id)
        self._emit(LogLevel.INFO, "overwrite", "Out-of-band changes. Overwriting the room.")

        try:
            model = await self._file.load_content(self._file_format, self._file_type)
        except Exception as e:
            msg = f"Error loading content from file: {self._file.path}\n{e!r}"
            self.log.error(msg, exc_info=e)
            self._emit(LogLevel.ERROR, None, msg)
            return

        async with self._update_lock:
            self._document.source = model["content"]
            self._document.dirty = False

From the above source, we see that when an OOB change occurs, DocumentRoom will fetch the latest file contents. If that does not raise an exception, then the most important block is reached, which is highlighted below. When an OOB change occurs, this block sets the source of the YDoc to the latest file contents:

async with self._update_lock:
    self._document.source = model["content"]
    self._document.dirty = False

To understand this, we need to know what A) what self._document is, and B) what happens when self._document.source is set.

self._document is a Ydoc from jupyter_ydoc. jupyter_ydoc provides YDoc-like classes that are wrappers around pycrdt.Doc, each of which implement a unique interface for a specific file type.

jupyter_ydoc currently provides three types of Ydoc-like classes:

  1. YUnicode/YFile: a YDoc that represents a UTF-8 plaintext file
  2. YNotebook: a YDoc that represents a notebook file
  3. YBlob: a YDoc that represents a blob file

We will refer to these classes provided by jupyter_ydoc as Jupyter Ydocs, for specificity.

We now know what self._document is. It is one of the 3 Jupyter Ydocs, all of which inherit from jupyter_ydoc.YBaseDoc. Now we need to know what happens when self._document.source is set.

Here are the relevant portions of YBaseDoc's source that determine the behavior of setting its source attribute:

class YBaseDoc(ABC):
    """
    Base YDoc class.
    This class, defines the minimum API that any document must provide
    to be able to get and set the content of the document as well as
    subscribe to changes in the document.
    """
    
    ...
    
    @property
    def source(self) -> Any:
        """
        Returns the content of the document.

        :return: The content of the document.
        :rtype: Any
        """
        return self.get()

    @source.setter
    def source(self, value: Any):
        """
        Sets the content of the document.

        :param value: The content of the document.
        :type value: Any
        """
        return self.set(value)

    @abstractmethod
    def get(self) -> Any:
        """
        Returns the content of the document.

        :return: Document's content.
        :rtype: Any
        """

    @abstractmethod
    def set(self, value: Any) -> None:
        """
        Sets the content of the document.

        :param value: The content of the document.
        :type value: Any
        """

As shown above, get() and set() are implemented in the YDoc class directly inheriting from YBaseDoc. This means that ultimately, the handling of an OOB change depends on the file type, and is determined by the definition of set() in the corresponding Jupyter Ydoc.

Issue part 1: Verify that OOB change handling doesn't lead to data loss

I have looked through the latest implementation of the set() method on YUnicode, YNotebook, YBlob, and could not find any immediate concerns. Each implementation flags all of the existing items for deletion, and then inserts the new content from the beginning. This does not delete the Ydoc and therefore should not result in any data loss. I've also tested the OOB change handling in the latest version of jupyter_collaboration with this one-liner shell script:

sleep 5; cat Untitled.ipynb | jq '.cells[0].source[0] = "i am an out-of-band edit!"' > temp.ipynb && mv  temp.ipynb Untitled.ipynb

This script sleeps 5 seconds, then makes an OOB change that sets the content of the first cell to a fixed string. I ran this script and immediately began making edits to the first cell until the OOB change occurred. This did not lead to any data loss for me locally.

While I have not yet found any specific code paths that lead to data loss in the latest implementation, I think that it remains debatable whether the current OOB change handling leads to data loss in certain scenarios. This is supported by the fact that myself and others have noticed OOB changes being logged by the extension immediately prior to data loss incidents. See #219 as an example.

Issue part 2: Each OOB change grows the Ydoc by at least the new file size

The implementations of set(), while correct, have the unfortunate consequence of growing the Ydoc's size by a factor of the new file size. That is, an OOB change on a 0.5 MB notebook will grow the Ydoc by at least 0.5 MB.

You can confirm this locally with these steps:

  1. Save 1kb.txt locally.
  2. Open JupyterLab with jupyter_collaboration installed.
  3. Open 1kb.txt through JupyterLab.
  4. Record the file size of .jupyter_ystore.db.
    • You can do so via ls -lah . | grep 'ystore.db'.
  5. Make 5 OOB changes by running this one-liner script in your terminal:
    touchit() { sleep 1; touch -m 1kb.txt }; for i in {1..5}; do touchit; done
  6. Wait a few seconds, then record the file size of .jupyter_ystore.db again.

On my machine, the Ystore starts at 12K and grows to 20K after making 5 OOB changes. I expected a growth in size of at least 5K, and observed a growth of 8K. This test confirms that currently, each OOB change grows the Ydoc and Ystore by at least the new file size.

Larger Ydocs lead to poorer performance. The most significant consequence is that a larger Ydoc results in a slower connection for new users, as the entire Ydoc is streamed to new clients. I will elaborate on this in a future issue concerning the YStore specifically.

Proposed next steps

  1. Increase confidence in the claim that OOB changes do not lead to data loss. To do so, we should:

    • Add test coverage of OOB change handling.
    • Simplify OOB change handling code.
  2. Reduce the growth in YStore size on OOB changes

    • For example, for "small" changes, it is preferable to compute a diff and only apply that diff as a Yjs item. This would drastically reduce the growth in YStore size per OOB change. This may be implemented as a boolean configurable to avoid a breaking change, though I personally do not see this as a breaking change.
    • It is also worth considering a more general configurable that allows for other ways of handling OOB changes. For example, we may want DocumentRoom either to a) completely ignore OOB changes, or b) to delete the YDoc from memory, drop the YStore table, and begin anew.
    • Credit to @ellisonbg for some of these suggestions. Thank you!
@dlqqq dlqqq added the bug Something isn't working label Feb 29, 2024
@davidbrochart
Copy link
Collaborator

Reduce the growth in YStore size on OOB changes

  • For example, for "small" changes, it is preferable to compute a diff and only apply that diff as a Yjs item. This would drastically reduce the growth in YStore size per OOB change. This may be implemented as a boolean configurable to avoid a breaking change, though I personally do not see this as a breaking change.

What you call "computing a diff" would require interpreting OOB changes and map them to e.g. a notebook schema, which is not an easy task. In this issue I proposed a bridge between different document structures, that is basically solving the same issue.
To give you an idea of the complexity, say an OOB inserts a character in a notebook cell. Now you need to infer the cell index where the change applies, the position of the added character in the cell source, and translate that into a Y operation (fake API here):

ycell = ynotebook["cells"][idx]
ytext = ycell["source"]
ytext.insert(pos, char)

For more complex changes, it will even be ambiguous to infer changes. See for instance how git diff cannot sometimes infer that something has moved, and it will just mark the change as a deletion and an addition (and git works with lines of text, which is a very basic document structure).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants