Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite waits while resolving data stores / requests #4508

Open
Tracked by #7347
vladsud opened this issue Dec 2, 2020 · 12 comments
Open
Tracked by #7347

Infinite waits while resolving data stores / requests #4508

vladsud opened this issue Dec 2, 2020 · 12 comments
Assignees
Labels
area: runtime Runtime related issues focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Dec 2, 2020

Problem Statement

ContainerRuntime.getDataStore() and ContainerRuntime.request() result in infinite wait when request hits non-existing data store.
It will be resolved if/when container eventually processes attach op for such data store. However if request is made for wrong (non-existing) data store, such request deadlocks.
This also shows up in end-to-end workflows where container is created paused, and URI resolution with waits will result in deadlock. Thus host may need to implement multi-stage process of doing initial request within waits, and on 404, resume container and retry request with infinite wait.

Proposed solutions

Use waitContainerToCatchUp() mechanism

Use it as a preparation step before resolving URI (or as a step in above mentioned algorithm after getting 404) as a wait to wait for container to catch up. This will fundamentally not change how resolution is made, but will allow us to remove waits from getDataStore() and request resolution.

Incorporate sequence number into URIs

When URI is generated, client can add (as an optional argument) current reference sequence number. This would allow (or rather - require) client resolving URI to get up to this sequence number before finishing resolving request. The nice pro of this design is that it has better user experience (or rather allows to provide better user experience) - we can decide to show progress UI or some other indication while we catch up, for user to better understand why it takes long to get to what this URI represents, and now show partial or previous state of document / component. The potential con of this approach is that URIs can be generated only for content that actually is in the file, i.e. was acked by server. While it looks like con, it will also not allow a scenario where user is offline, creates URI and sends it over Teams or email, where URI reaches consumers without referenced fluid content eve making it to a file (note that background sync of email will happen on some systems when laptop/phone looks off, so browser may be suspended and fluid content not synching, but URI may make it through just fine). As result user may never be able to resolve such URI. Pushing this problem to producer has certain pros and cons, but is in line with work in other areas, in particular sequential IDs to represent ranges of text in Scriptor and URIs being generated only for online content.

@vladsud vladsud added design-required This issue requires design thought area: runtime Runtime related issues labels Dec 2, 2020
@vladsud vladsud self-assigned this Dec 2, 2020
@vladsud
Copy link
Contributor Author

vladsud commented Dec 2, 2020

@leeviana , @montselozanod , @anthony-murphy , @curtisman

I think we should go with second approach. Would be great to add more people to discussion from app side to provide their feedback.

@vladsud vladsud added this to the December 2020 milestone Dec 2, 2020
@leeviana
Copy link
Contributor

leeviana commented Dec 2, 2020

I like the second approach. :) I think offline mode was always going to be a "con" for whatever solution we ended up here/with similar challenges and we'll need to work out our vision/support there in a separate larger effort/architecture design. Not sure who on the app side cares too much about this right now (mostly just affects future offline thinking), but lets start with @AndreiZe

@anthony-murphy
Copy link
Contributor

Personally, I like them both. I view sequence number as a hint, but for scenarios listed above, and because it's user modifiable we can never fully trust it, but if it's there and reasonable we can leverage it. However, we'll likely always want a fallback, and waiting for catch up seems appropriate

@curtisman
Copy link
Member

URL is the identity of the reference, so having the minimum sequence number that the URL Is valid make sense.
It is very likely that changing a URL will render it useless, so I don't think it necessary a hint.

@vladsud vladsud added the focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone label Dec 15, 2020
@danielroney danielroney removed the design-required This issue requires design thought label Jan 8, 2021
@vladsud vladsud modified the milestones: January 2021, February 2021 Feb 19, 2021
@vladsud vladsud modified the milestones: February 2021, April 2021 Feb 26, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Feb 26, 2021

I'll use this uber task for all related issues (and dedup the rest).

Here is what we need to do:

  1. Remove waits from getRootDataStore / ContainerRuntime.request() APIs (former Remove waits from getRootDataStore / ContainerRuntime.request() APIs #3875)
  2. Add UT coverage for waitContainerToCatchUp() (former Add UT coverage for waitContainerToCatchUp() #3874)
  3. Requests against paused container may result in infinite wait if component requests exists in future (ops only), not snapshot (former Requests against paused container may result in infinite wait if component requests exists in future (ops only), not snapshot #3085)

Not really related, but occasionally mentioned in similar topics:

  1. request for a URL with no handler throws: request for a URL with no handler throws #2859
  2. IContainerRuntime*.get*DataStoreRuntime: Supply optional AbortSignal for cases where we wait: IContainerRuntime*.get*DataStoreRuntime: Supply optional AbortSignal for cases where we wait #1192

@vladsud vladsud added this to the October 2021 milestone Aug 20, 2021
@vladsud vladsud modified the milestones: October 2021, November 2021 Sep 8, 2021
@vladsud vladsud modified the milestones: November 2021, Future Oct 9, 2021
@vladsud vladsud modified the milestones: Future, Next Jun 3, 2022
@ghost ghost added the status: stale label Dec 1, 2022
@ghost
Copy link

ghost commented Dec 1, 2022

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@leeviana
Copy link
Contributor

leeviana commented Dec 1, 2022

not stale

@microsoft-github-policy-service
Copy link
Contributor

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@leeviana
Copy link
Contributor

leeviana commented Feb 1, 2023

not stale

@microsoft-github-policy-service
Copy link
Contributor

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@leeviana
Copy link
Contributor

leeviana commented Aug 1, 2023

not stale

@leeviana
Copy link
Contributor

not stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone
Projects
None yet
Development

No branches or pull requests

5 participants