Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SharedMatrix undefined cell values #18082

Open
SampoSyrjanen opened this issue Oct 31, 2023 · 7 comments
Open

SharedMatrix undefined cell values #18082

SampoSyrjanen opened this issue Oct 31, 2023 · 7 comments
Assignees
Labels
bug Something isn't working community-contribution

Comments

@SampoSyrjanen
Copy link

SharedMatrix from "@fluidframework/matrix" sometimes has temporary undefined cell values. This happens in a highly parallel situation, when a client reads all values of a row of a shared matrix.

To reproduce this, you can create a bunch a clients in parallel processes, connect them to the same container, and start adding columns to a row from each client simultaneously. At the same time, clients can periodically read all column values of the row and check if they got an undefined value.

The expected behavior would be that when the values of the matrix row are read the values would never be undefined.

@SampoSyrjanen
Copy link
Author

A node project to reproduce this issue can be found here https://github.com/SampoSyrjanen/shared-matrix-undefined-cell-test

@Abe27342 Abe27342 self-assigned this Jan 8, 2024
@Abe27342
Copy link
Contributor

Abe27342 commented Jan 9, 2024

Hello!

Thank you for the reproduction. I believe there are two issues going on here:

  1. The 3p Azure API currently uses FlushMode.Immediate to work around issues with large ops hitting the 1MB websocket limit. Since azure 1.0, we've put a lot of effort into fixing those issues, lots more details are documented here. We're currently planning on restoring this to the default FlushMode.TurnBased in the Azure 2.0 release. This means that if one client submits a series of ops synchronously, the Fluid runtime will also process those ops synchronously on all other clients, which helps ensure that application logic doesn't see "partially applied" states due to remote edits. In terms of the reproduction test you've provided, FlushMode.Immediate means that containers might yield between the op which inserts a column and the op which sets that column, despite that not happening on the local client. With FlushMode.TurnBased, that possibility is removed.

  2. Even with TurnBased flushing enabled, the test reproduces some cases where SharedMatrix is not eventually consistent, albeit much more rarely. I'm investigating these and will update this issue when I have more details.

FYI mostly for awareness: I had to tweak the logic around matrixTestSequenceEndPromise in the test cases; it was possible for one worker client to apply its edits locally and receive server ops from all other clients, then exit before its in-flight ops made it to the ordering service. This would deadlock the other clients, since they'd be waiting for ops that never made it to the server. The saved event on the container (and container.isDirty) allows doing this robustly, and would commonly be used in production scenarios to inform the user if they're trying to exit the application with unsaved data.

@vladsud
Copy link
Contributor

vladsud commented Jan 9, 2024

@Abe27342, with turn-based flushing enabled, did you repro it using 2.0 bits or 1.0 bits?
I'm not sure if it's the problem, but 1.0 (I think, maybe it was removed earlier) had some implicit batching at the driver layer. The problem with this layer - it was unpredictable, i.e. could flush ops when there were too many ops in the queue.
I do not remember exactly how all these layers worked together and if it could cause such behavior.
I'd rather focus on 2.0, but if we learn that 2.0 works flawlessly, and 1.0 is not, that might be a explanation.

@Abe27342
Copy link
Contributor

Abe27342 commented Jan 9, 2024

It repros on both, I ported the repro to our latest main branch and the behavior is basically the same AFAICT; maybe likelihood of occurrence is a bit different and I didn't notice but the same sort of behavior shows up. Agreed with trying to focus on 2.0. It's pretty clear with latest main bits and FlushMode switched to TurnBased that the remaining issues here are in the DDS realm.

@vladsud
Copy link
Contributor

vladsud commented Jan 9, 2024

@DLehenbauer - any chance you can take a look? I have not look at details, but it feels like we should start with assumption that the bug is on SharedMatrix side. That said, it could be a bug in overall op processing pipeline (or even a service implementation).
If it repros for one service, but not another, that is likely an indication it's a service issue.

@Abe27342
Copy link
Contributor

Abe27342 commented Jan 9, 2024

I'm already investigating the SharedMatrix side. But is it not true that even with SharedMatrix issues entirely fixed, if one has a client with FlushMode.Immediate doing:

while (true) {
    matrix.insertRow(0, 1); // insert row at start
    matrix.setCell(0, 1, "value"); // populate that row with a value
    await sleep(1000);
}

while another client observes, there's nothing stopping the observing client's op processing from yielding to application logic while in between a row insert and a cell set? This is the crux of my point that we'll need turn-based flushing for the repro provided here to work as expected (i.e. not observe 'partially applied states').

Abe27342 added a commit that referenced this issue Jan 16, 2024
…#19211)

## Description

Fixes a matrix resubmission bug, where zamboni might clean up
permutation segments which in-flight ops might need to refer to.
In general, using the minimum sequence number of the most recently
processed message in isolation to determine the collab window for a DDS
is not correct, since if that DDS has any in-flight ops and they get
resubmitted, they will need to be rebased from their refSeq to the last
processed deltaManager seq at the time of rebasing.

This bug seems to generally apply to DDSes which clean up collab-window
focused state. AB#6866 tracks fixing other DDSes / unifying the
mechanism used here.

While investigating this bug, I also found a bug related to the
double-reconnect scenario after adding matrix fuzz tests. This bug was
relatively easy to fix: the localOpMetadata used for a resubmitted op
needs to update its refSeq to be the current seq rather than reuse the
refSeq of the original submission time.

This resolves SharedMatrix-side issues for #18082.

AB#6356

---------

Co-authored-by: Abram Sanderson <absander@microsoft.com>
Co-authored-by: Connor Skees <39542938+connorskees@users.noreply.github.com>
@Abe27342
Copy link
Contributor

Hi, quick update here: #19211 fixes the SharedMatrix-side issue for this bug. This fixes bugs where row values may appear undefined indefinitely (the bug was an eventual consistency issue). It's still true that our azure APIs have FlushMode.Immediate set, and as long as that's true it will be the case that bits of your matrix may be temporarily undefined in the test case. With the example above which looks similar to the test repro:

while (true) {
    matrix.insertRow(0, 1); // insert row at start
    matrix.setCell(0, 1, "value"); // populate that row with a value
    await sleep(1000);
}

a client may yield between the insertRow op and the setCell op, in which case that cell will be undefined until the setCell op is sequenced and processed. @andre4i could you follow-up here once we have more details on our plan for FlushMode and azure client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community-contribution
Projects
None yet
Development

No branches or pull requests

3 participants