Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WT-9529 Add doc reference guide and architecture guide entries for Tiered Storage #10558

Merged
merged 17 commits into from
Jun 6, 2024

Conversation

ddanderson
Copy link
Member

No description provided.

@ddanderson
Copy link
Member Author

@sueloverso and @keitharnoldsmith, Attaching prebuilt pages for easier reading:

arch-tiered-storage-20240506.pdf

tiered_storage-20240506.pdf

Copy link
Member

@sueloverso sueloverso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of minor and broad comments as is common with a large doc/text change.

@@ -289,6 +290,10 @@ The format of the data file is given by structures in \c block.h .

The cloud storage source extension which manage the flushing of the data files to and reading from different cloud object stores.

@subpage arch-tiered-storage

Tiered storage allows B-Trees to be split into multiple parts, some on local disk and some in cloud storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I suggest expanding/detailing the two uses of some at the end of the sentence. Maybe something like:
...into multiple parts, more recently updated parts on local disk and less recent parts in cloud storage
or if you don't want to talk about time just some parts of the tree on local disk and some parts in cloud storage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Tiered storage provides a way to split Btrees into multiple parts,
with some set of parts stored in cloud storage objects and another
set of parts stored in local files.
We often use the term \a object to refer one of these written
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: should be "..to refer to one..."


@section ts_intro Introduction and Definitions

Tiered storage provides a way to split Btrees into multiple parts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor consistency nit: In the arch-index.dox above you used B-trees and here you are (consistently) using Btrees. Should it have the hyphen or not? (Possibly a much broader comment than just your changes.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a broader comment! It seems like a strong majority (70%) of uses in the regular reference doc are "btree". Maybe 20% are "Btree", the rest either b-tree or b+tree (with various capitalizations). It's closer in the arch guide, but "btree" is still winning. So I'll go with a consistent "btree" here and file a ticket to clean up the rest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WT-12956 for the broader question/solution.


When in use, a tiered Btree, like a regular Btree, may have some
recently used or modified data that resides in memory pages.
This in memory representation is the same between a tiered Btree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should in memory be in-memory? (Also potentially a broader consistency issue.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch - I'd say "This in-memory" is correct, while "resides in memory pages." on the line before is correct. So not a global search/replace, but some other uses may need to be fixed up.

with some set of parts stored in cloud storage objects and another
set of parts stored in local files.
We often use the term \a object to refer one of these written
Btree parts, whether it resides on the local disk or in the cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I prefer to avoid vague it type words generally and explicitly say the words. I suggest ...whether the object resides...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

WT_STORAGE_SOURCE. The storage source can be thought of as a driver,
or an abstraction of a cloud provider with operations. WiredTiger has a
several instances of \c WT_STORAGE_SOURCE, these include the drivers
for the AWS, GCP, and Azure clouds. A storage source can be asked to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is for an internal developer perhaps, it may be worth noting dir_store as well. That storage source is often easier to understand.
ETA: I see you do mention it below. Maybe mention it here and say it will be discussed in more detail below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

-# queue a FLUSH_FINISH operation
-# queue a REMOVE_LOCAL operation

@section ts_future Future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped here. I'm not sure how useful it is to have all the future stuff listed here. It is years old and would likely change if/when it ever gets resuscitated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that the "future stuff" isn't needed -- I would approve a PR without it -- I don't see any harm in including it. So since it's written, I'm happy to leave it as is.

There are certainly many other parts of the system where I would have found it useful/helpful to understand a bit more of the thinking that motivated the decisions.

@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and
column-store, with an underlying special case of column-store for
bit-field values. All three formats are B+tree implementations of
key/value stores. WiredTiger also supports @ref lsm, implemented as a
tree of B+trees.
tree of B+trees. In addition, there is experimental support for
@ref tiered_storage, allowing a B+tree to span multiple files and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here's a 3rd form. B-tree, Btree and B+tree!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "btree" and "B-Tree" so ubiquitously that I would eliminate "B+Tree", even though there is a reasonable case that our trees are more "B+ like".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going with btree, that's been the dominant form.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this one is still there and not btree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving this one for WT-12956. Sometimes we want to specifically state that the implementation is B+Tree, I'm not sure of this is one of those cases.

src/docs/tiered-storage.dox Show resolved Hide resolved
each in its own file.

When in use, a tiered btree, like a regular btree, may have some recently used
or modified data that resides in memory pages. This in memory representation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment in the other document about whether it should be in-memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed the second one to be in-memory.

Copy link
Contributor

@keitharnoldsmith keitharnoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this Don!

I've mostly left suggestions, except for a couple comments (which I think will be obvious) that should definitely be addressed. Let me know if you have specific questions.

@@ -289,6 +290,10 @@ The format of the data file is given by structures in \c block.h .

The cloud storage source extension which manage the flushing of the data files to and reading from different cloud object stores.

@subpage arch-tiered-storage

Tiered storage allows B-Trees to be split into multiple parts, more recently updated parts on local disk and less recent parts in cloud storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"split into multiple parts" suggests to me that we could be splitting the btree into sub-trees. I would be more explicit about the granularity of splitting here. Maybe:

Tiered storage allows B-Tree data to be stored in multiple places, more recently updated blocks on local disk and less recently updated blocks in cloud storage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 8 to 10
Tiered storage provides a way to split btrees into multiple parts,
with some set of parts stored in cloud storage objects and another
set of parts stored in local files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier comment.

"split btrees into multiple parts" --> "store btree data in multiple containers"

"set of parts" --> blocks

(Aside: is there a convention in the doc about whether we say "btree" or "B-Tree"?)

with some set of parts stored in cloud storage objects and another
set of parts stored in local files.
We often use the term \a object to refer to one of these written
btree parts, whether the object resides on the local disk or in the cloud.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"parts" --> "containers"

@section ts_checkpoints Checkpoints

A normal, non-tiered table, although sometimes thought of as a
"single" btree, can also be thought of as an active or \a live btree, as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just say "single btree" -- i.e., no scare quotes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

new (\a N+1) current object, a checkpoint in the previous (\a N) file
is guaranteed, because a WT_SESSION::checkpoint call with a \c
flush_tier option is required to switch. A checkpoint in the \a N
file refers to blocks in object \a N as well as previous (\a N-1, \a N-2, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be read as saying that a checkpoint will always include blocks in all previous objects. Maybe:

"as well as previous ... " --> "and may also refer to blocks in the previous..."

src/docs/arch-tiered-storage.dox Show resolved Hide resolved
file (and persist that information as well), and notice when all
references to a file have reached zero. This may require enhancements
to extent lists in the block manager. An asynchronous approach
could work mostly separately from WiredTiger (in another process), and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in another process" --> "in another process, possibly on a different node"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and
column-store, with an underlying special case of column-store for
bit-field values. All three formats are B+tree implementations of
key/value stores. WiredTiger also supports @ref lsm, implemented as a
tree of B+trees.
tree of B+trees. In addition, there is experimental support for
@ref tiered_storage, allowing a B+tree to span multiple files and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "btree" and "B-Tree" so ubiquitously that I would eliminate "B+Tree", even though there is a reasonable case that our trees are more "B+ like".

or checkpoint) are written to a designated file, called the active file.
All other files and objects that are part of the tiered table are read-only.
Each object or file is given an object number, which appears as part of its
name on disk, cloud and in metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stumbled on "cloud" on first reading. Maybe "in cloud storage"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 50 to 54
-# eviction is temporarily disabled on this btree, while waiting for any writes to the active file to drain
-# a new empty file is created named with the next object number, this will become the table's active file
-# switch the table's active file
-# enable eviction on this btree
-# queue the old active file to be written to object storage in the background
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment about this sequence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keitharnoldsmith I think I missed the previous comment you're referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought I talked about this elsewhere, but maybe I didn't save that comment...

A flush_tier checkpoint has a slightly different sequence of operations. In particular, eviction is only disabled when we are checkpointing the tree.

Checkpoint prepare phase:

  • Assemble list of all the dhandles that will be part of the checkpoint. For non-tiered tables this will be tables that have been modified since the last checkpoint. For tiered tables this will be tables modified since the last flush. See checkpoint_flush_tier called from checkpoint_prepare.
  • For each tiered table, we create the new file, and queue a work item to flush the old one. The work item includes checkpoint generation information that is used by the tiered server thread to know when it is OK to flush the file to object storage (i.e., after the current checkpoint has completed). See tiered_switch, which is called from checkpoint_flush_tier.
  • We do this work now so that it doesn't add time to the actual checkpoint when we have eviction disabled.

Data file checkpoint:

  • When we do the actual checkpoint for each table we block eviction to the table. So checkpoint is the only thread updating the table. This happens in wt_sync_file.
  • At the end of the checkpoint, after writing the new checkpoint root but before re-enabling eviction, we switch the active file to use the new file created during the prepare phase. This ensures that everything that is part of the checkpoint is in the old file, and anything evicted after the checkpoint is in the next file.
  • If the table is tiered, we fsync the old active file to ensure that all checkpoint updates are durable. (Side note: we don't have to do that here, we could do it after re-enabling eviction. But I'm just describing what the code does.)
  • See bm_checkpoint for the active file switch and fsync.
  • After switching the file we allow eviction again, as we finish wt_sync_file, and we checkpoint the next dhandle.

Hope that makes sense. If not, feel free to ping me.

@ddanderson
Copy link
Member Author

@sueloverso and @keitharnoldsmith Thank you for your extensive and very helpful comments. Sorry for the delay, this PR has suffered from skunk & back burner syndromes. I think I've finally resolved all outstanding comments, so please have at it again!

Copy link
Member

@sueloverso sueloverso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did all files except arch-tiered-storage.dox. In that one I only got through a little bit. I will complete that tomorrow but wanted to get you these other comments now.

- an address cookie for the extent list with available entries
- an address cookie for the extent list with discarded entries
- the file size for the checkpoint
- the checkpoint size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to the reader what the distinction is between these last two items. Please elaborate what they are and how they differ.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - I had to figure out what the difference was myself.


As described in @ref block_address_cookie, an address cookie used with a tiered storage
has an additional value (the \c object_id). As a result, the checkpoint cookie for a
tiered btree will include additional \c object_id values that are not in other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it very clear there is one per address cookie and not just one for hte checkpoint cookie, I suggest "...for a tiered btree will include four addtional \c object_id values..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@subpage arch-tiered-storage

Tiered storage allows B-Trees to be stored into multiple places, more
recently updated blocks on local disk and less recently updated blocks in cloud storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence feels like it is missing verbs. "...more recently updated blocks are on local disk... blocks are in cloud storage."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and
column-store, with an underlying special case of column-store for
bit-field values. All three formats are B+tree implementations of
key/value stores. WiredTiger also supports @ref lsm, implemented as a
tree of B+trees.
tree of B+trees. In addition, there is experimental support for
@ref tiered_storage, allowing a B+tree to span multiple files and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this one is still there and not btree.

@section example Example with timeline

The figure below provides a high level overview of flush_tier, showing the state of a single tiered table over time. The green arrows at the top of the figure indicate write requests. These come from eviction, except during checkpoint processing. At T0, we start with two objects in object storage, Object 1, and Object 2, and File 3 as the writable active file. At T1, we start a flush checkpoint. At T2, that checkpoint completes, and WiredTiger switches the writable active file from File 3 to File 4. At this point there should be no outstanding writes to File 3 because the checkpoint has completed and eviction to the table is disabled. All future writes will go to File 4. So we can queue the copy of File 3 to object storage. The copy completes at T3. At any later time (T4) WiredTiger can remove File 3 from the local file system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strangely very long line. Much longer than any other paragraph on a single line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

indicate on the WT_SESSION::create calls for that table.
- modify the application's checkpoint thread to include \c "flush_tier=(enable=true)" on calls to WT_SESSION::checkpoint.
\c flush_tier need not be specified on every checkpoint call; typically the cadence of \c flush_tier is much less than
the cadence of ordinary checkpoints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, it implies (or requires) that using tiered storage also requires using an application checkpoint thread. I.e. I don't think we provide any mechanism to flush tier via the WT internal checkpoint server. Probably useful to state that for now even if we could (and should) add it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a couple sentences to this effect.

<em>object file</em>, or in the cloud, as a <em>cloud object</em>.
The mechanism to create new objects is a WT_SESSION::checkpoint API call
with the \c flush_tier configuration set.
For brevity, is often called a \c flush_tier call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing word: "For brevity, it is called.." or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

(e.g. <code>tiered_storage=(name="s3_store"),bucket="..."</code> ).
Once enabled on the connection, the configuration applies to all tables created.
The configuration can be overridden on individual tables can be changed with
configuration options for WT_SESSION::create .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is awkward. I think you should remove "can be changed" so it reads "...can be overridden on individual tables with configuration..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Once enabled on the connection, the configuration applies to all tables created.
The configuration can be overridden on individual tables can be changed with
configuration options for WT_SESSION::create .
This allows the caller to specify a different storage provider or bucket name or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This "what" allows the caller..."? I think it should be "The related WT_SESSION::create configuration options allow...". That should be written out clearly. I am waffling about whether the create config options should be specified here. I don't think it is necessary, but it did cross my mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. As to the create config options - this is the arch guide, so we I don't think we need to go over detailed stuff already in the reference manual. The stuff that is duplicated here is purposeful just to give a little context for the following discussion.

object is created, all writes to the previous (\a N) object are
completed and that \a N object becomes read-only, like all objects
before it. At this point, the \a N object is queued to be copied to
cloud storage. After that copy takes place, the local copy of \a N is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest replace "takes place" to "successfully completes".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@sueloverso sueloverso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have lots of suggestions. I only need to rereview if there are large rewrites.


A normal, non-tiered table, although sometimes thought of as a
single btree, can also be thought of as an active or \a live btree, as well
as zero or more checkpoints, that are fully represented in the single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest replace as well as with with and remove the comma after checkpoints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

as zero or more checkpoints, that are fully represented in the single
disk object file. Each checkpoint has its own root page, and so can be
considered its own btree. Each of these btrees is a set of
pages referencing each other as a tree, some in memory, some on disk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this 3 sentence introduction kind of confusing. It took me 3 or 4 rereads to understand what you're trying to say here. It is correct but difficult. Maybe removing the first comma comment. What do you think of this?
A normal, non-tiered table can be thought of as a live btree with zero or more checkpoints that are fully represented in a single disk object file. Each checkpoint has its own root page and can be considered its own complete btree. Each of these btrees...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great. I like your re-wording.

considered its own btree. Each of these btrees is a set of
pages referencing each other as a tree, some in memory, some on disk.
A tiered table is the same,
having an \a live btree and a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an should be a.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

pages referencing each other as a tree, some in memory, some on disk.
A tiered table is the same,
having an \a live btree and a
set of checkpoint btrees. However, both the active btree and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest keep using live instead of active here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

A tiered table is the same,
having an \a live btree and a
set of checkpoint btrees. However, both the active btree and the
checkpoints may span multiple object files and/or cloud objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: ...may have pages that span multiple...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@subsection ts_block_local_file_removal Local File Removal

When the tiered server determines that a local object file should be removed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comma can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

to remove a local object file has several parts.

When a cloud copy of an object is completed, the block manager
is told, via \c BM::switch_object_end , what object id (and below)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found (and below) confusing thinking it was referring to text below on the document. How about something like:
...is told, via ..., that object id N (and therefore all ids < N) can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@subsection ts_metadata_non Non-tiered Tables

When non-tiered table \c A is created (without named column groups),
there are two entries in the metadata file, having these keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two should be three

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

------------- | -------------- | ------------- | --------------------------------------
\c table: | WT_TABLE | yes | the dhandle is cast to (WT_TABLE *)
\c file: | WT_BTREE | yes | dhandle->handle is a (WT_BTREE *)
\c colgroup: | | no | stored in an array in the WT_TABLE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is great and well done. It feels like it should go into the arch-metadata.dox file because someone wanting to know about how this works won't think to look in the tiered storage document. I have no idea how much of this is covered in the metadata arch file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good point. I moved the first section to the end of arch-metadata.dox as an example, and refer to it from arch-tiered-storage.dox . It's not a perfect fit as it talks a little about in-memory data structures, but think it's okay for the arch guide.

After all pages in an object are no longer needed, an object can be
removed. The trick is in knowing when this can happen.

It is expected that future solutions to work either synchronously or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest: "A future garbage collection solution could work well either synchronously or asynchonously."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@ddanderson
Copy link
Member Author

@sueloverso, thanks so much for your detailed reviews! I know this can be tedious, but every comment was appreciated and made the doc better.

The one thing I didn't changes was the B+tree reference. There's a ticket WT-12956 to resolve all the btree spelling -- and this one may fall into the category of describing the specific implementation of btree that we use. I'll let WT-12956 decide in a uniform manner.

to zero, the handle is removed and freed, and the underlying file
handle is closed.

When a \c flush_tier is done, each table that had changes written as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done --> performed (executed?)

done is ambiguous. It could also mean "when a flush_tier is complete"

(associated with a non-tiered table) behave almost identically in
the WiredTiger system. In fact, the \c WT_BTREE_PREFIX macro checks
to see if a URI matches either one of these prefix strings. The macro basically
means "does this thing walk and talk like a btree?". In both cases,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove period after question mark.

Comment on lines 50 to 54
-# eviction is temporarily disabled on this btree, while waiting for any writes to the active file to drain
-# a new empty file is created named with the next object number, this will become the table's active file
-# switch the table's active file
-# enable eviction on this btree
-# queue the old active file to be written to object storage in the background
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought I talked about this elsewhere, but maybe I didn't save that comment...

A flush_tier checkpoint has a slightly different sequence of operations. In particular, eviction is only disabled when we are checkpointing the tree.

Checkpoint prepare phase:

  • Assemble list of all the dhandles that will be part of the checkpoint. For non-tiered tables this will be tables that have been modified since the last checkpoint. For tiered tables this will be tables modified since the last flush. See checkpoint_flush_tier called from checkpoint_prepare.
  • For each tiered table, we create the new file, and queue a work item to flush the old one. The work item includes checkpoint generation information that is used by the tiered server thread to know when it is OK to flush the file to object storage (i.e., after the current checkpoint has completed). See tiered_switch, which is called from checkpoint_flush_tier.
  • We do this work now so that it doesn't add time to the actual checkpoint when we have eviction disabled.

Data file checkpoint:

  • When we do the actual checkpoint for each table we block eviction to the table. So checkpoint is the only thread updating the table. This happens in wt_sync_file.
  • At the end of the checkpoint, after writing the new checkpoint root but before re-enabling eviction, we switch the active file to use the new file created during the prepare phase. This ensures that everything that is part of the checkpoint is in the old file, and anything evicted after the checkpoint is in the next file.
  • If the table is tiered, we fsync the old active file to ensure that all checkpoint updates are durable. (Side note: we don't have to do that here, we could do it after re-enabling eviction. But I'm just describing what the code does.)
  • See bm_checkpoint for the active file switch and fsync.
  • After switching the file we allow eviction again, as we finish wt_sync_file, and we checkpoint the next dhandle.

Hope that makes sense. If not, feel free to ping me.

Copy link
Contributor

@keitharnoldsmith keitharnoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've added a couple minor comments.

I also have a bigger comment in response to a thread from my previous review.

Thanks again for all the work on this, Don!

@ddanderson
Copy link
Member Author

@keitharnoldsmith, thanks for your detailed comments on checkpoint - I lifted that into the arch guide, as it seemed appropriate. I made the ref guide more terse (as we don't really describe much about regular checkpoints in the ref guide).

Could you please do a quick review of parts I just updated:

  • the checkpoint sections of the ref guide ("The flush_tier operation" first three paragraphs)
  • the arch guide ("Flush Checkpoint Operations")

Here are pdfs of the two pages, in case that's helpful:
ref-tiered-storage-20240604.pdf
arch-tiered-storage-20240604.pdf

No rush, other than wanted to get this put to bed.

Copy link
Contributor

@keitharnoldsmith keitharnoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Don! LGTM

@ddanderson ddanderson added this pull request to the merge queue Jun 6, 2024
Merged via the queue into develop with commit 20daa87 Jun 6, 2024
7 checks passed
@ddanderson ddanderson deleted the wt-9529-tiered-doc branch June 6, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants