SharedMatrix undefined cell values #18082

SampoSyrjanen · 2023-10-31T09:15:25Z

SharedMatrix from "@fluidframework/matrix" sometimes has temporary undefined cell values. This happens in a highly parallel situation, when a client reads all values of a row of a shared matrix.

To reproduce this, you can create a bunch a clients in parallel processes, connect them to the same container, and start adding columns to a row from each client simultaneously. At the same time, clients can periodically read all column values of the row and check if they got an undefined value.

The expected behavior would be that when the values of the matrix row are read the values would never be undefined.

SampoSyrjanen · 2023-10-31T09:20:00Z

A node project to reproduce this issue can be found here https://github.com/SampoSyrjanen/shared-matrix-undefined-cell-test

Abe27342 · 2024-01-09T18:18:17Z

Hello!

Thank you for the reproduction. I believe there are two issues going on here:

The 3p Azure API currently uses FlushMode.Immediate to work around issues with large ops hitting the 1MB websocket limit. Since azure 1.0, we've put a lot of effort into fixing those issues, lots more details are documented here. We're currently planning on restoring this to the default FlushMode.TurnBased in the Azure 2.0 release. This means that if one client submits a series of ops synchronously, the Fluid runtime will also process those ops synchronously on all other clients, which helps ensure that application logic doesn't see "partially applied" states due to remote edits. In terms of the reproduction test you've provided, FlushMode.Immediate means that containers might yield between the op which inserts a column and the op which sets that column, despite that not happening on the local client. With FlushMode.TurnBased, that possibility is removed.
Even with TurnBased flushing enabled, the test reproduces some cases where SharedMatrix is not eventually consistent, albeit much more rarely. I'm investigating these and will update this issue when I have more details.

FYI mostly for awareness: I had to tweak the logic around matrixTestSequenceEndPromise in the test cases; it was possible for one worker client to apply its edits locally and receive server ops from all other clients, then exit before its in-flight ops made it to the ordering service. This would deadlock the other clients, since they'd be waiting for ops that never made it to the server. The saved event on the container (and container.isDirty) allows doing this robustly, and would commonly be used in production scenarios to inform the user if they're trying to exit the application with unsaved data.

vladsud · 2024-01-09T18:52:11Z

@Abe27342, with turn-based flushing enabled, did you repro it using 2.0 bits or 1.0 bits?
I'm not sure if it's the problem, but 1.0 (I think, maybe it was removed earlier) had some implicit batching at the driver layer. The problem with this layer - it was unpredictable, i.e. could flush ops when there were too many ops in the queue.
I do not remember exactly how all these layers worked together and if it could cause such behavior.
I'd rather focus on 2.0, but if we learn that 2.0 works flawlessly, and 1.0 is not, that might be a explanation.

Abe27342 · 2024-01-09T18:56:37Z

It repros on both, I ported the repro to our latest main branch and the behavior is basically the same AFAICT; maybe likelihood of occurrence is a bit different and I didn't notice but the same sort of behavior shows up. Agreed with trying to focus on 2.0. It's pretty clear with latest main bits and FlushMode switched to TurnBased that the remaining issues here are in the DDS realm.

vladsud · 2024-01-09T22:24:15Z

@DLehenbauer - any chance you can take a look? I have not look at details, but it feels like we should start with assumption that the bug is on SharedMatrix side. That said, it could be a bug in overall op processing pipeline (or even a service implementation).
If it repros for one service, but not another, that is likely an indication it's a service issue.

Abe27342 · 2024-01-09T22:59:44Z

I'm already investigating the SharedMatrix side. But is it not true that even with SharedMatrix issues entirely fixed, if one has a client with FlushMode.Immediate doing:

while (true) {
    matrix.insertRow(0, 1); // insert row at start
    matrix.setCell(0, 1, "value"); // populate that row with a value
    await sleep(1000);
}

while another client observes, there's nothing stopping the observing client's op processing from yielding to application logic while in between a row insert and a cell set? This is the crux of my point that we'll need turn-based flushing for the repro provided here to work as expected (i.e. not observe 'partially applied states').

…#19211) ## Description Fixes a matrix resubmission bug, where zamboni might clean up permutation segments which in-flight ops might need to refer to. In general, using the minimum sequence number of the most recently processed message in isolation to determine the collab window for a DDS is not correct, since if that DDS has any in-flight ops and they get resubmitted, they will need to be rebased from their refSeq to the last processed deltaManager seq at the time of rebasing. This bug seems to generally apply to DDSes which clean up collab-window focused state. AB#6866 tracks fixing other DDSes / unifying the mechanism used here. While investigating this bug, I also found a bug related to the double-reconnect scenario after adding matrix fuzz tests. This bug was relatively easy to fix: the localOpMetadata used for a resubmitted op needs to update its refSeq to be the current seq rather than reuse the refSeq of the original submission time. This resolves SharedMatrix-side issues for #18082. AB#6356 --------- Co-authored-by: Abram Sanderson <absander@microsoft.com> Co-authored-by: Connor Skees <39542938+connorskees@users.noreply.github.com>

Abe27342 · 2024-01-18T18:55:05Z

Hi, quick update here: #19211 fixes the SharedMatrix-side issue for this bug. This fixes bugs where row values may appear undefined indefinitely (the bug was an eventual consistency issue). It's still true that our azure APIs have FlushMode.Immediate set, and as long as that's true it will be the case that bits of your matrix may be temporarily undefined in the test case. With the example above which looks similar to the test repro:

while (true) {
    matrix.insertRow(0, 1); // insert row at start
    matrix.setCell(0, 1, "value"); // populate that row with a value
    await sleep(1000);
}

a client may yield between the insertRow op and the setCell op, in which case that cell will be undefined until the setCell op is sequenced and processed. @andre4i could you follow-up here once we have more details on our plan for FlushMode and azure client?

SampoSyrjanen added the bug Something isn't working label Oct 31, 2023

microsoft-github-policy-service bot added the community-contribution label Oct 31, 2023

Abe27342 self-assigned this Jan 8, 2024

Abe27342 mentioned this issue Jan 11, 2024

fix(matrix): Expand collab window to include refSeqs of in-flight ops #19211

Merged

andre4i mentioned this issue Mar 12, 2024

Azure client must use batching. #19738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SharedMatrix undefined cell values #18082

SharedMatrix undefined cell values #18082

SampoSyrjanen commented Oct 31, 2023

SampoSyrjanen commented Oct 31, 2023

Abe27342 commented Jan 9, 2024 •

edited

vladsud commented Jan 9, 2024

Abe27342 commented Jan 9, 2024

vladsud commented Jan 9, 2024

Abe27342 commented Jan 9, 2024

Abe27342 commented Jan 18, 2024

SharedMatrix undefined cell values #18082

SharedMatrix undefined cell values #18082

Comments

SampoSyrjanen commented Oct 31, 2023

SampoSyrjanen commented Oct 31, 2023

Abe27342 commented Jan 9, 2024 • edited

vladsud commented Jan 9, 2024

Abe27342 commented Jan 9, 2024

vladsud commented Jan 9, 2024

Abe27342 commented Jan 9, 2024

Abe27342 commented Jan 18, 2024

Abe27342 commented Jan 9, 2024 •

edited