Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot, summary, Tree entries: Need standardization and reduction of concepts. #4683

Open
12 tasks done
vladsud opened this issue Dec 23, 2020 · 11 comments
Open
12 tasks done
Assignees
Labels
ado Issue is tracked in the internal Azure DevOps backlog api design-required This issue requires design thought epic An issue that is a parent to several other issues focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone kr refactor Code refactoring and cleanup
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Dec 23, 2020

Our File I/O types are very confusing - it's very hard to follow code on where we use certain types and why:

ISnapshotTree - used when loading from storage, could have name collisions between blobs & trees.
ITree / ITreeEntry - entries, could have name collisions; used when attaching DDS / data store, also used on legacy path
git.ITree - somewhat similar shape as above, mostly used in R11s, but leaks to various helper functions
ISummaryTree - summarization process, no collisions, supports symbolic reuse of content (through handle + path used as a reference)

We have a bunch of helper functions to convert from one type to another:
convertSnapshotTreeToSummaryTree: ISnapshotTree -> ISummaryTreeWithStats
convertToSummaryTree: ITree -> ISummarizeResult
convertSummaryTreeToITree : ISummaryTree -> ITree
buildSnapshotTree: ITree -> ISnapshotTree

While we can't completely remove all these types from runtime (as some are tied to file format), I believe we could reduce it to a minimum, by always converting any time to summary type and all the code working only with summaries.
This would substantially reduce mental workload when working with all these layers.

This work can be done incrementally, but adding more converter functions to support any to any conversions, and thus allowing us to address layer by layer, keeping system functional during this process.

Somewhat good example why not acting causes more types used even on same path:
We summarize using summaries. But because we load using snapshots, IRuntime.stop() method uses snapshot() path to get snapshot of container, forcing us to have both types of serialization. Ideally both save & load would use same format, but if they can't, we should serialize only in one format, and use helper functions (here - at Container level) to convert to write format.

Completed

Independent cleanup tasks

These can be done at any time.

Eliminate ITree usages

This gets us close to being able to remove ITree from runtime. There might still be some usages in tests and utils.

We could consider renaming since "snapshot" and "summary" indicate nothing about the directional distinction.

@vladsud vladsud added api design-required This issue requires design thought labels Dec 23, 2020
@vladsud
Copy link
Contributor Author

vladsud commented Dec 23, 2020

@arinwt - would love you taking a look and providing feedback,
@curtisman, @danielroney - I'd love to see some mechanism to get it scheduled and not sit in "Next" bucket forever.

@vladsud vladsud added the focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone label Dec 23, 2020
@vladsud
Copy link
Contributor Author

vladsud commented Jan 4, 2021

tagging @agarwal-navin

@agarwal-navin
Copy link
Contributor

Yes, I agree that we need to standardize summaries and snapshots. In my GC work, I came across these and got confused as well. More than that, I had to add the GC data across these formats and make it compatible and make sure there are no bugs. Standardizing this will help making changes across this space easier and less error prone.

IMO, there are 2 major chunks of work that need to be cleaned up -

  1. DDSs should start using generating ISummaryTree in summarizer.
  2. Container should stop exposing snapshot and expose summarize instead.

@ghost
Copy link

ghost commented Jul 12, 2021

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@vladsud vladsud added this to the January 2022 milestone Dec 6, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Jan 10, 2022

@scarlettjlee, since you have been poking in this area (indirectly), I hope you do not mind if I assign this item to you.
I've just added an item on your plate (#8652) to this Epic.

We can chat more about details, priority, and what else needs to be included into this Epic.

@vladsud vladsud assigned scarlettjlee and unassigned vladsud Jan 10, 2022
@agarwal-navin
Copy link
Contributor

@scarlettjlee I added #4728 to this list. Please feel free to remove it from here if it seems not relevant to this epic.

@scarlettjlee
Copy link
Contributor

@vladsud & @agarwal-navin, could you review the updated [sub-]issues list in the original comment up top?

@vladsud
Copy link
Contributor Author

vladsud commented Feb 25, 2022

Looks reasonable to me (though it does feel similar to what it used to be when I opened it, maybe minus sub-items).
I'd say the most interesting question is the ordering, and what would we do if we have only 1 month (or whatever amount of time we have). I'd use simple rule to make these decisions - if we do not do some work, does it become more expensive over time (because more usage cripples in), or does it make other developers spend more time to understand more concepts and deal with them? If the answer is yes, that's candidate for higher priority.

The opposite of that would be something like commits on ISnapshotTree. It's not used, and will not be used in runtime, so it's the cleanup that we would love to do, but it's not becoming more expensive over time.

I'd also optimize towards consistency. We should consistently use "summary" anywhere we mean upload, and snapshot anywhere we mean download format. That should be consistent in comments, names of functions, interfaces.

@scarlettjlee
Copy link
Contributor

@vladsud, I reorganized the task list to try to categorize them. A bunch of the desired work has already happened since this item was first opened. I'd focus any concerted effort toward getting rid of ITree, especially #9227. And then unifying snapshot and summary (which I haven't thought much about yet). The cleanup tasks can be done whenever and aren't causing much trouble (maybe some "good first issue"s that we could do opportunistically). There's some other minor misusage of terms in the codebase that I don't think are serious but would be nice to fix, (like "snapshot" in matrix DDS to get DDS content that is not a snapshot type).

@ssimic2
Copy link

ssimic2 commented Feb 26, 2022

Related: #8979

@vladsud
Copy link
Contributor Author

vladsud commented Feb 28, 2022

Thanks for clarification!
There is no way to unify snapshot & summary interface, as they have different qualities (snapshot does not have blob content, but otherwise self-contained, i.e., can't refer to content from other snapshots. while summaries always include blob content, unless they refer to blobs in storage / previous snapshot).

If you feel like new tasks are very clear and Ok for new hires, please mark them as first issue.

@scarlettjlee scarlettjlee modified the milestones: March 2022, Future Mar 18, 2022
@agarwal-navin agarwal-navin mentioned this issue May 26, 2022
8 tasks
@curtisman curtisman added the ado Issue is tracked in the internal Azure DevOps backlog label Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ado Issue is tracked in the internal Azure DevOps backlog api design-required This issue requires design thought epic An issue that is a parent to several other issues focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone kr refactor Code refactoring and cleanup
Projects
None yet
Development

No branches or pull requests

8 participants