-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WT-9529 Add doc reference guide and architecture guide entries for Tiered Storage #10558
Conversation
@sueloverso and @keitharnoldsmith, Attaching prebuilt pages for easier reading: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of minor and broad comments as is common with a large doc/text change.
src/docs/arch-index.dox
Outdated
@@ -289,6 +290,10 @@ The format of the data file is given by structures in \c block.h . | |||
|
|||
The cloud storage source extension which manage the flushing of the data files to and reading from different cloud object stores. | |||
|
|||
@subpage arch-tiered-storage | |||
|
|||
Tiered storage allows B-Trees to be split into multiple parts, some on local disk and some in cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: I suggest expanding/detailing the two uses of some
at the end of the sentence. Maybe something like:
...into multiple parts, more recently updated parts on local disk and less recent parts in cloud storage
or if you don't want to talk about time just some parts of the tree on local disk and some parts in cloud storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/docs/arch-tiered-storage.dox
Outdated
Tiered storage provides a way to split Btrees into multiple parts, | ||
with some set of parts stored in cloud storage objects and another | ||
set of parts stored in local files. | ||
We often use the term \a object to refer one of these written |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: should be "..to refer to one..."
src/docs/arch-tiered-storage.dox
Outdated
|
||
@section ts_intro Introduction and Definitions | ||
|
||
Tiered storage provides a way to split Btrees into multiple parts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor consistency nit: In the arch-index.dox
above you used B-trees
and here you are (consistently) using Btrees
. Should it have the hyphen or not? (Possibly a much broader comment than just your changes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a broader comment! It seems like a strong majority (70%) of uses in the regular reference doc are "btree". Maybe 20% are "Btree", the rest either b-tree or b+tree (with various capitalizations). It's closer in the arch guide, but "btree" is still winning. So I'll go with a consistent "btree" here and file a ticket to clean up the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WT-12956 for the broader question/solution.
src/docs/arch-tiered-storage.dox
Outdated
|
||
When in use, a tiered Btree, like a regular Btree, may have some | ||
recently used or modified data that resides in memory pages. | ||
This in memory representation is the same between a tiered Btree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should in memory
be in-memory
? (Also potentially a broader consistency issue.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch - I'd say "This in-memory" is correct, while "resides in memory pages." on the line before is correct. So not a global search/replace, but some other uses may need to be fixed up.
src/docs/arch-tiered-storage.dox
Outdated
with some set of parts stored in cloud storage objects and another | ||
set of parts stored in local files. | ||
We often use the term \a object to refer one of these written | ||
Btree parts, whether it resides on the local disk or in the cloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: I prefer to avoid vague it
type words generally and explicitly say the words. I suggest ...whether the object resides...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
src/docs/arch-tiered-storage.dox
Outdated
WT_STORAGE_SOURCE. The storage source can be thought of as a driver, | ||
or an abstraction of a cloud provider with operations. WiredTiger has a | ||
several instances of \c WT_STORAGE_SOURCE, these include the drivers | ||
for the AWS, GCP, and Azure clouds. A storage source can be asked to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is for an internal developer perhaps, it may be worth noting dir_store
as well. That storage source is often easier to understand.
ETA: I see you do mention it below. Maybe mention it here and say it will be discussed in more detail below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
-# queue a FLUSH_FINISH operation | ||
-# queue a REMOVE_LOCAL operation | ||
|
||
@section ts_future Future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stopped here. I'm not sure how useful it is to have all the future stuff listed here. It is years old and would likely change if/when it ever gets resuscitated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that the "future stuff" isn't needed -- I would approve a PR without it -- I don't see any harm in including it. So since it's written, I'm happy to leave it as is.
There are certainly many other parts of the system where I would have found it useful/helpful to understand a bit more of the thinking that motivated the decisions.
@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and | |||
column-store, with an underlying special case of column-store for | |||
bit-field values. All three formats are B+tree implementations of | |||
key/value stores. WiredTiger also supports @ref lsm, implemented as a | |||
tree of B+trees. | |||
tree of B+trees. In addition, there is experimental support for | |||
@ref tiered_storage, allowing a B+tree to span multiple files and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here's a 3rd form. B-tree, Btree and B+tree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use "btree" and "B-Tree" so ubiquitously that I would eliminate "B+Tree", even though there is a reasonable case that our trees are more "B+ like".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going with btree
, that's been the dominant form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this one is still there and not btree
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving this one for WT-12956. Sometimes we want to specifically state that the implementation is B+Tree, I'm not sure of this is one of those cases.
src/docs/tiered-storage.dox
Outdated
each in its own file. | ||
|
||
When in use, a tiered btree, like a regular btree, may have some recently used | ||
or modified data that resides in memory pages. This in memory representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment in the other document about whether it should be in-memory
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed the second one to be in-memory
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this Don!
I've mostly left suggestions, except for a couple comments (which I think will be obvious) that should definitely be addressed. Let me know if you have specific questions.
src/docs/arch-index.dox
Outdated
@@ -289,6 +290,10 @@ The format of the data file is given by structures in \c block.h . | |||
|
|||
The cloud storage source extension which manage the flushing of the data files to and reading from different cloud object stores. | |||
|
|||
@subpage arch-tiered-storage | |||
|
|||
Tiered storage allows B-Trees to be split into multiple parts, more recently updated parts on local disk and less recent parts in cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"split into multiple parts" suggests to me that we could be splitting the btree into sub-trees. I would be more explicit about the granularity of splitting here. Maybe:
Tiered storage allows B-Tree data to be stored in multiple places, more recently updated blocks on local disk and less recently updated blocks in cloud storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/docs/arch-tiered-storage.dox
Outdated
Tiered storage provides a way to split btrees into multiple parts, | ||
with some set of parts stored in cloud storage objects and another | ||
set of parts stored in local files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier comment.
"split btrees into multiple parts" --> "store btree data in multiple containers"
"set of parts" --> blocks
(Aside: is there a convention in the doc about whether we say "btree" or "B-Tree"?)
src/docs/arch-tiered-storage.dox
Outdated
with some set of parts stored in cloud storage objects and another | ||
set of parts stored in local files. | ||
We often use the term \a object to refer to one of these written | ||
btree parts, whether the object resides on the local disk or in the cloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"parts" --> "containers"
src/docs/arch-tiered-storage.dox
Outdated
@section ts_checkpoints Checkpoints | ||
|
||
A normal, non-tiered table, although sometimes thought of as a | ||
"single" btree, can also be thought of as an active or \a live btree, as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just say "single btree" -- i.e., no scare quotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/docs/arch-tiered-storage.dox
Outdated
new (\a N+1) current object, a checkpoint in the previous (\a N) file | ||
is guaranteed, because a WT_SESSION::checkpoint call with a \c | ||
flush_tier option is required to switch. A checkpoint in the \a N | ||
file refers to blocks in object \a N as well as previous (\a N-1, \a N-2, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be read as saying that a checkpoint will always include blocks in all previous objects. Maybe:
"as well as previous ... " --> "and may also refer to blocks in the previous..."
src/docs/arch-tiered-storage.dox
Outdated
file (and persist that information as well), and notice when all | ||
references to a file have reached zero. This may require enhancements | ||
to extent lists in the block manager. An asynchronous approach | ||
could work mostly separately from WiredTiger (in another process), and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in another process" --> "in another process, possibly on a different node"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and | |||
column-store, with an underlying special case of column-store for | |||
bit-field values. All three formats are B+tree implementations of | |||
key/value stores. WiredTiger also supports @ref lsm, implemented as a | |||
tree of B+trees. | |||
tree of B+trees. In addition, there is experimental support for | |||
@ref tiered_storage, allowing a B+tree to span multiple files and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use "btree" and "B-Tree" so ubiquitously that I would eliminate "B+Tree", even though there is a reasonable case that our trees are more "B+ like".
src/docs/tiered-storage.dox
Outdated
or checkpoint) are written to a designated file, called the active file. | ||
All other files and objects that are part of the tiered table are read-only. | ||
Each object or file is given an object number, which appears as part of its | ||
name on disk, cloud and in metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stumbled on "cloud" on first reading. Maybe "in cloud storage"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/docs/tiered-storage.dox
Outdated
-# eviction is temporarily disabled on this btree, while waiting for any writes to the active file to drain | ||
-# a new empty file is created named with the next object number, this will become the table's active file | ||
-# switch the table's active file | ||
-# enable eviction on this btree | ||
-# queue the old active file to be written to object storage in the background |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment about this sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keitharnoldsmith I think I missed the previous comment you're referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I thought I talked about this elsewhere, but maybe I didn't save that comment...
A flush_tier checkpoint has a slightly different sequence of operations. In particular, eviction is only disabled when we are checkpointing the tree.
Checkpoint prepare phase:
- Assemble list of all the dhandles that will be part of the checkpoint. For non-tiered tables this will be tables that have been modified since the last checkpoint. For tiered tables this will be tables modified since the last flush. See
checkpoint_flush_tier
called fromcheckpoint_prepare
. - For each tiered table, we create the new file, and queue a work item to flush the old one. The work item includes checkpoint generation information that is used by the tiered server thread to know when it is OK to flush the file to object storage (i.e., after the current checkpoint has completed). See
tiered_switch
, which is called fromcheckpoint_flush_tier
. - We do this work now so that it doesn't add time to the actual checkpoint when we have eviction disabled.
Data file checkpoint:
- When we do the actual checkpoint for each table we block eviction to the table. So checkpoint is the only thread updating the table. This happens in
wt_sync_file
. - At the end of the checkpoint, after writing the new checkpoint root but before re-enabling eviction, we switch the active file to use the new file created during the prepare phase. This ensures that everything that is part of the checkpoint is in the old file, and anything evicted after the checkpoint is in the next file.
- If the table is tiered, we fsync the old active file to ensure that all checkpoint updates are durable. (Side note: we don't have to do that here, we could do it after re-enabling eviction. But I'm just describing what the code does.)
- See
bm_checkpoint
for the active file switch and fsync. - After switching the file we allow eviction again, as we finish
wt_sync_file
, and we checkpoint the next dhandle.
Hope that makes sense. If not, feel free to ping me.
@sueloverso and @keitharnoldsmith Thank you for your extensive and very helpful comments. Sorry for the delay, this PR has suffered from skunk & back burner syndromes. I think I've finally resolved all outstanding comments, so please have at it again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did all files except arch-tiered-storage.dox. In that one I only got through a little bit. I will complete that tomorrow but wanted to get you these other comments now.
src/docs/arch-block.dox
Outdated
- an address cookie for the extent list with available entries | ||
- an address cookie for the extent list with discarded entries | ||
- the file size for the checkpoint | ||
- the checkpoint size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear to the reader what the distinction is between these last two items. Please elaborate what they are and how they differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - I had to figure out what the difference was myself.
src/docs/arch-block.dox
Outdated
|
||
As described in @ref block_address_cookie, an address cookie used with a tiered storage | ||
has an additional value (the \c object_id). As a result, the checkpoint cookie for a | ||
tiered btree will include additional \c object_id values that are not in other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it very clear there is one per address cookie and not just one for hte checkpoint cookie, I suggest "...for a tiered btree will include four addtional \c object_id values..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/docs/arch-index.dox
Outdated
@subpage arch-tiered-storage | ||
|
||
Tiered storage allows B-Trees to be stored into multiple places, more | ||
recently updated blocks on local disk and less recently updated blocks in cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence feels like it is missing verbs. "...more recently updated blocks are on local disk... blocks are in cloud storage."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and | |||
column-store, with an underlying special case of column-store for | |||
bit-field values. All three formats are B+tree implementations of | |||
key/value stores. WiredTiger also supports @ref lsm, implemented as a | |||
tree of B+trees. | |||
tree of B+trees. In addition, there is experimental support for | |||
@ref tiered_storage, allowing a B+tree to span multiple files and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this one is still there and not btree
.
@section example Example with timeline | ||
|
||
The figure below provides a high level overview of flush_tier, showing the state of a single tiered table over time. The green arrows at the top of the figure indicate write requests. These come from eviction, except during checkpoint processing. At T0, we start with two objects in object storage, Object 1, and Object 2, and File 3 as the writable active file. At T1, we start a flush checkpoint. At T2, that checkpoint completes, and WiredTiger switches the writable active file from File 3 to File 4. At this point there should be no outstanding writes to File 3 because the checkpoint has completed and eviction to the table is disabled. All future writes will go to File 4. So we can queue the copy of File 3 to object storage. The copy completes at T3. At any later time (T4) WiredTiger can remove File 3 from the local file system. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strangely very long line. Much longer than any other paragraph on a single line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
indicate on the WT_SESSION::create calls for that table. | ||
- modify the application's checkpoint thread to include \c "flush_tier=(enable=true)" on calls to WT_SESSION::checkpoint. | ||
\c flush_tier need not be specified on every checkpoint call; typically the cadence of \c flush_tier is much less than | ||
the cadence of ordinary checkpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this, it implies (or requires) that using tiered storage also requires using an application checkpoint thread. I.e. I don't think we provide any mechanism to flush tier via the WT internal checkpoint server. Probably useful to state that for now even if we could (and should) add it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a couple sentences to this effect.
src/docs/arch-tiered-storage.dox
Outdated
<em>object file</em>, or in the cloud, as a <em>cloud object</em>. | ||
The mechanism to create new objects is a WT_SESSION::checkpoint API call | ||
with the \c flush_tier configuration set. | ||
For brevity, is often called a \c flush_tier call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing word: "For brevity, it is called.." or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
(e.g. <code>tiered_storage=(name="s3_store"),bucket="..."</code> ). | ||
Once enabled on the connection, the configuration applies to all tables created. | ||
The configuration can be overridden on individual tables can be changed with | ||
configuration options for WT_SESSION::create . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is awkward. I think you should remove "can be changed" so it reads "...can be overridden on individual tables with configuration..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/docs/arch-tiered-storage.dox
Outdated
Once enabled on the connection, the configuration applies to all tables created. | ||
The configuration can be overridden on individual tables can be changed with | ||
configuration options for WT_SESSION::create . | ||
This allows the caller to specify a different storage provider or bucket name or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This "what" allows the caller..."? I think it should be "The related WT_SESSION::create configuration options allow...". That should be written out clearly. I am waffling about whether the create config options should be specified here. I don't think it is necessary, but it did cross my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. As to the create config options - this is the arch guide, so we I don't think we need to go over detailed stuff already in the reference manual. The stuff that is duplicated here is purposeful just to give a little context for the following discussion.
src/docs/arch-tiered-storage.dox
Outdated
object is created, all writes to the previous (\a N) object are | ||
completed and that \a N object becomes read-only, like all objects | ||
before it. At this point, the \a N object is queued to be copied to | ||
cloud storage. After that copy takes place, the local copy of \a N is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest replace "takes place" to "successfully completes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have lots of suggestions. I only need to rereview if there are large rewrites.
src/docs/arch-tiered-storage.dox
Outdated
|
||
A normal, non-tiered table, although sometimes thought of as a | ||
single btree, can also be thought of as an active or \a live btree, as well | ||
as zero or more checkpoints, that are fully represented in the single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest replace as well as
with with
and remove the comma after checkpoints
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
as zero or more checkpoints, that are fully represented in the single | ||
disk object file. Each checkpoint has its own root page, and so can be | ||
considered its own btree. Each of these btrees is a set of | ||
pages referencing each other as a tree, some in memory, some on disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this 3 sentence introduction kind of confusing. It took me 3 or 4 rereads to understand what you're trying to say here. It is correct but difficult. Maybe removing the first comma comment. What do you think of this?
A normal, non-tiered table can be thought of as a live btree with zero or more checkpoints that are fully represented in a single disk object file. Each checkpoint has its own root page and can be considered its own complete btree. Each of these btrees...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great. I like your re-wording.
src/docs/arch-tiered-storage.dox
Outdated
considered its own btree. Each of these btrees is a set of | ||
pages referencing each other as a tree, some in memory, some on disk. | ||
A tiered table is the same, | ||
having an \a live btree and a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an
should be a
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/docs/arch-tiered-storage.dox
Outdated
pages referencing each other as a tree, some in memory, some on disk. | ||
A tiered table is the same, | ||
having an \a live btree and a | ||
set of checkpoint btrees. However, both the active btree and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest keep using live
instead of active
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/docs/arch-tiered-storage.dox
Outdated
A tiered table is the same, | ||
having an \a live btree and a | ||
set of checkpoint btrees. However, both the active btree and the | ||
checkpoints may span multiple object files and/or cloud objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: ...may have pages that span multiple...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/docs/arch-tiered-storage.dox
Outdated
|
||
@subsection ts_block_local_file_removal Local File Removal | ||
|
||
When the tiered server determines that a local object file should be removed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comma can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/docs/arch-tiered-storage.dox
Outdated
to remove a local object file has several parts. | ||
|
||
When a cloud copy of an object is completed, the block manager | ||
is told, via \c BM::switch_object_end , what object id (and below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found (and below)
confusing thinking it was referring to text below on the document. How about something like:
...is told, via ..., that object id N (and therefore all ids < N) can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/docs/arch-tiered-storage.dox
Outdated
@subsection ts_metadata_non Non-tiered Tables | ||
|
||
When non-tiered table \c A is created (without named column groups), | ||
there are two entries in the metadata file, having these keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two
should be three
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/docs/arch-tiered-storage.dox
Outdated
------------- | -------------- | ------------- | -------------------------------------- | ||
\c table: | WT_TABLE | yes | the dhandle is cast to (WT_TABLE *) | ||
\c file: | WT_BTREE | yes | dhandle->handle is a (WT_BTREE *) | ||
\c colgroup: | | no | stored in an array in the WT_TABLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is great and well done. It feels like it should go into the arch-metadata.dox
file because someone wanting to know about how this works won't think to look in the tiered storage document. I have no idea how much of this is covered in the metadata arch file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good point. I moved the first section to the end of arch-metadata.dox as an example, and refer to it from arch-tiered-storage.dox . It's not a perfect fit as it talks a little about in-memory data structures, but think it's okay for the arch guide.
src/docs/arch-tiered-storage.dox
Outdated
After all pages in an object are no longer needed, an object can be | ||
removed. The trick is in knowing when this can happen. | ||
|
||
It is expected that future solutions to work either synchronously or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest: "A future garbage collection solution could work well either synchronously or asynchonously."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@sueloverso, thanks so much for your detailed reviews! I know this can be tedious, but every comment was appreciated and made the doc better. The one thing I didn't changes was the B+tree reference. There's a ticket WT-12956 to resolve all the btree spelling -- and this one may fall into the category of describing the specific implementation of btree that we use. I'll let WT-12956 decide in a uniform manner. |
to zero, the handle is removed and freed, and the underlying file | ||
handle is closed. | ||
|
||
When a \c flush_tier is done, each table that had changes written as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done --> performed (executed?)
done is ambiguous. It could also mean "when a flush_tier is complete"
(associated with a non-tiered table) behave almost identically in | ||
the WiredTiger system. In fact, the \c WT_BTREE_PREFIX macro checks | ||
to see if a URI matches either one of these prefix strings. The macro basically | ||
means "does this thing walk and talk like a btree?". In both cases, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove period after question mark.
src/docs/tiered-storage.dox
Outdated
-# eviction is temporarily disabled on this btree, while waiting for any writes to the active file to drain | ||
-# a new empty file is created named with the next object number, this will become the table's active file | ||
-# switch the table's active file | ||
-# enable eviction on this btree | ||
-# queue the old active file to be written to object storage in the background |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I thought I talked about this elsewhere, but maybe I didn't save that comment...
A flush_tier checkpoint has a slightly different sequence of operations. In particular, eviction is only disabled when we are checkpointing the tree.
Checkpoint prepare phase:
- Assemble list of all the dhandles that will be part of the checkpoint. For non-tiered tables this will be tables that have been modified since the last checkpoint. For tiered tables this will be tables modified since the last flush. See
checkpoint_flush_tier
called fromcheckpoint_prepare
. - For each tiered table, we create the new file, and queue a work item to flush the old one. The work item includes checkpoint generation information that is used by the tiered server thread to know when it is OK to flush the file to object storage (i.e., after the current checkpoint has completed). See
tiered_switch
, which is called fromcheckpoint_flush_tier
. - We do this work now so that it doesn't add time to the actual checkpoint when we have eviction disabled.
Data file checkpoint:
- When we do the actual checkpoint for each table we block eviction to the table. So checkpoint is the only thread updating the table. This happens in
wt_sync_file
. - At the end of the checkpoint, after writing the new checkpoint root but before re-enabling eviction, we switch the active file to use the new file created during the prepare phase. This ensures that everything that is part of the checkpoint is in the old file, and anything evicted after the checkpoint is in the next file.
- If the table is tiered, we fsync the old active file to ensure that all checkpoint updates are durable. (Side note: we don't have to do that here, we could do it after re-enabling eviction. But I'm just describing what the code does.)
- See
bm_checkpoint
for the active file switch and fsync. - After switching the file we allow eviction again, as we finish
wt_sync_file
, and we checkpoint the next dhandle.
Hope that makes sense. If not, feel free to ping me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I've added a couple minor comments.
I also have a bigger comment in response to a thread from my previous review.
Thanks again for all the work on this, Don!
…iered ref section for checkpoints.
@keitharnoldsmith, thanks for your detailed comments on checkpoint - I lifted that into the arch guide, as it seemed appropriate. I made the ref guide more terse (as we don't really describe much about regular checkpoints in the ref guide). Could you please do a quick review of parts I just updated:
Here are pdfs of the two pages, in case that's helpful: No rush, other than wanted to get this put to bed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Don! LGTM
No description provided.