Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misbehaving driver can cause Fluid to hang on container open #18430

Open
zagriswo opened this issue Nov 21, 2023 · 7 comments
Open

Misbehaving driver can cause Fluid to hang on container open #18430

zagriswo opened this issue Nov 21, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@zagriswo
Copy link

Describe the bug

We found a bug in our driver that resulted in Fluid effectively busy-looping and causing an app hang. We can fix the driver bug, but it would be good to also have the container loading code be a bit more defensive too.

Our driver returned all the messages the service had via the IDocumentDeltaConnection.initialMessages property, but this set of messages erroneously had a gap in the middle. DeltaManager would go through its fetchMissingDeltas path to try to retrieve the messages in the gap, but our implementation of IDocumentDeltaStorageService.fetchMessages would successfully return an empty stream (that is, no messages and done: true) when asked about that gap. This caused DeltaManager to try to keep fetching the gap over, and over, and over, without making forward progress as it would get back an empty stream each time it tried to fill in the gap.

@zagriswo zagriswo added the bug Something isn't working label Nov 21, 2023
@scarlettjlee
Copy link
Contributor

FYI, @rajatch-ff

@jatgarg
Copy link
Contributor

jatgarg commented Nov 21, 2023

There is no way for the container to progress until it fills the gap with the ops because the state would not be consistent then. We cannot skip ops and just proceed.

@zagriswo
Copy link
Author

Agreed on not "just proceed" but there should be some sort of additional check that results in load erroring out. If the driver returned success but no messages, should that be an immediate error? Should there be a limit on how many attempts are made for the same gap?

@kashms
Copy link
Contributor

kashms commented Mar 29, 2024

@zagriswo we'll improve the checks here, work backlogged.

@jatgarg
Copy link
Contributor

jatgarg commented Apr 3, 2024

Agreed on not "just proceed" but there should be some sort of additional check that results in load erroring out. If the driver returned success but no messages, should that be an immediate error? Should there be a limit on how many attempts are made for the same gap?

Question: In your original description, why did the service kept returning 0 messages and made the container to stuck? Did it ever proceed, or the service lost the messages somehow?

@zagriswo
Copy link
Author

zagriswo commented Apr 3, 2024

@jatgarg it was a bug uncovered by fuzzing. Basically, a hole was made in the op stream, so our driver returned 0 messages in perpetuity because those messages just didn't exist anymore.

@jatgarg
Copy link
Contributor

jatgarg commented Apr 13, 2024

In ODSP driver, we already handle this issue where if we don't make progress in fetching ops using delta storage service, then we give up after 30 secs and container closes.
We use this public utility api for that: requestOps()

You can see the usage of it in ODSP driver here:

Let me know if you have more questions. You should be able to use it with your driver easily.

In future, we will think if we want to move this thing higer up the stack and in loader/deltastream layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants