WT-9529 Add doc reference guide and architecture guide entries for Tiered Storage #10558

ddanderson · 2024-05-06T14:55:19Z

No description provided.

… Storage.

ddanderson · 2024-05-06T15:10:57Z

@sueloverso and @keitharnoldsmith, Attaching prebuilt pages for easier reading:

arch-tiered-storage-20240506.pdf

tiered_storage-20240506.pdf

sueloverso

Lots of minor and broad comments as is common with a large doc/text change.

sueloverso · 2024-05-06T15:34:07Z

src/docs/arch-index.dox

@@ -289,6 +290,10 @@ The format of the data file is given by structures in \c block.h .

 The cloud storage source extension which manage the flushing of the data files to and reading from different cloud object stores.

+@subpage arch-tiered-storage
+
+Tiered storage allows B-Trees to be split into multiple parts, some on local disk and some in cloud storage.


Minor nit: I suggest expanding/detailing the two uses of some at the end of the sentence. Maybe something like:
...into multiple parts, more recently updated parts on local disk and less recent parts in cloud storage
or if you don't want to talk about time just some parts of the tree on local disk and some parts in cloud storage.

sueloverso · 2024-05-06T15:35:43Z

src/docs/arch-tiered-storage.dox

+Tiered storage provides a way to split Btrees into multiple parts,
+with some set of parts stored in cloud storage objects and another
+set of parts stored in local files.
+We often use the term \a object to refer one of these written


typo: should be "..to refer to one..."

sueloverso · 2024-05-06T15:38:03Z

src/docs/arch-tiered-storage.dox

+
+@section ts_intro Introduction and Definitions
+
+Tiered storage provides a way to split Btrees into multiple parts,


Minor consistency nit: In the arch-index.dox above you used B-trees and here you are (consistently) using Btrees. Should it have the hyphen or not? (Possibly a much broader comment than just your changes.)

It is a broader comment! It seems like a strong majority (70%) of uses in the regular reference doc are "btree". Maybe 20% are "Btree", the rest either b-tree or b+tree (with various capitalizations). It's closer in the arch guide, but "btree" is still winning. So I'll go with a consistent "btree" here and file a ticket to clean up the rest.

WT-12956 for the broader question/solution.

sueloverso · 2024-05-06T15:38:29Z

src/docs/arch-tiered-storage.dox

+
+When in use, a tiered Btree, like a regular Btree, may have some
+recently used or modified data that resides in memory pages.
+This in memory representation is the same between a tiered Btree


Should in memory be in-memory? (Also potentially a broader consistency issue.)

Yes, good catch - I'd say "This in-memory" is correct, while "resides in memory pages." on the line before is correct. So not a global search/replace, but some other uses may need to be fixed up.

sueloverso · 2024-05-06T15:46:46Z

src/docs/arch-tiered-storage.dox

+with some set of parts stored in cloud storage objects and another
+set of parts stored in local files.
+We often use the term \a object to refer one of these written
+Btree parts, whether it resides on the local disk or in the cloud.


Minor nit: I prefer to avoid vague it type words generally and explicitly say the words. I suggest ...whether the object resides...

sueloverso · 2024-05-06T19:05:25Z

src/docs/arch-tiered-storage.dox

+WT_STORAGE_SOURCE. The storage source can be thought of as a driver,
+or an abstraction of a cloud provider with operations. WiredTiger has a
+several instances of \c WT_STORAGE_SOURCE, these include the drivers
+for the AWS, GCP, and Azure clouds.  A storage source can be asked to


Since this is for an internal developer perhaps, it may be worth noting dir_store as well. That storage source is often easier to understand.
ETA: I see you do mention it below. Maybe mention it here and say it will be discussed in more detail below.

sueloverso · 2024-05-06T19:15:37Z

src/docs/arch-tiered-storage.dox

+    -# queue a FLUSH_FINISH operation
+    -# queue a REMOVE_LOCAL operation
+
+@section ts_future Future


I stopped here. I'm not sure how useful it is to have all the future stuff listed here. It is years old and would likely change if/when it ever gets resuscitated.

While I agree that the "future stuff" isn't needed -- I would approve a PR without it -- I don't see any harm in including it. So since it's written, I'm happy to leave it as is.

There are certainly many other parts of the system where I would have found it useful/helpful to understand a bit more of the thinking that motivated the decisions.

sueloverso · 2024-05-06T19:16:09Z

src/docs/file-formats.dox

@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and
 column-store, with an underlying special case of column-store for
 bit-field values. All three formats are B+tree implementations of
 key/value stores. WiredTiger also supports @ref lsm, implemented as a
-tree of B+trees.
+tree of B+trees. In addition, there is experimental support for
+@ref tiered_storage, allowing a B+tree to span multiple files and


And here's a 3rd form. B-tree, Btree and B+tree!

We use "btree" and "B-Tree" so ubiquitously that I would eliminate "B+Tree", even though there is a reasonable case that our trees are more "B+ like".

We're going with btree, that's been the dominant form.

AFAICT this one is still there and not btree.

I'm leaving this one for WT-12956. Sometimes we want to specifically state that the implementation is B+Tree, I'm not sure of this is one of those cases.

src/docs/tiered-storage.dox

sueloverso · 2024-05-06T19:19:40Z

src/docs/tiered-storage.dox

+each in its own file.
+
+When in use, a tiered btree, like a regular btree, may have some recently used
+or modified data that resides in memory pages. This in memory representation


Same comment in the other document about whether it should be in-memory.

Yes, fixed the second one to be in-memory.

…block manager arch section.

keitharnoldsmith

Thanks for all the work on this Don!

I've mostly left suggestions, except for a couple comments (which I think will be obvious) that should definitely be addressed. Let me know if you have specific questions.

keitharnoldsmith · 2024-05-08T14:48:45Z

src/docs/arch-index.dox

@@ -289,6 +290,10 @@ The format of the data file is given by structures in \c block.h .

 The cloud storage source extension which manage the flushing of the data files to and reading from different cloud object stores.

+@subpage arch-tiered-storage
+
+Tiered storage allows B-Trees to be split into multiple parts, more recently updated parts on local disk and less recent parts in cloud storage.


"split into multiple parts" suggests to me that we could be splitting the btree into sub-trees. I would be more explicit about the granularity of splitting here. Maybe:

Tiered storage allows B-Tree data to be stored in multiple places, more recently updated blocks on local disk and less recently updated blocks in cloud storage.

keitharnoldsmith · 2024-05-08T14:53:09Z

src/docs/arch-tiered-storage.dox

+Tiered storage provides a way to split btrees into multiple parts,
+with some set of parts stored in cloud storage objects and another
+set of parts stored in local files.


See earlier comment.

"split btrees into multiple parts" --> "store btree data in multiple containers"

"set of parts" --> blocks

(Aside: is there a convention in the doc about whether we say "btree" or "B-Tree"?)

keitharnoldsmith · 2024-05-08T14:54:06Z

src/docs/arch-tiered-storage.dox

+with some set of parts stored in cloud storage objects and another
+set of parts stored in local files.
+We often use the term \a object to refer to one of these written
+btree parts, whether the object resides on the local disk or in the cloud.


"parts" --> "containers"

keitharnoldsmith · 2024-05-08T14:56:29Z

src/docs/arch-tiered-storage.dox

+@section ts_checkpoints Checkpoints 
+
+A normal, non-tiered table, although sometimes thought of as a
+"single" btree, can also be thought of as an active or \a live btree, as well


I'd just say "single btree" -- i.e., no scare quotes.

keitharnoldsmith · 2024-05-08T14:58:52Z

src/docs/arch-tiered-storage.dox

+new (\a N+1) current object, a checkpoint in the previous (\a N) file
+is guaranteed, because a WT_SESSION::checkpoint call with a \c
+flush_tier option is required to switch.  A checkpoint in the \a N
+file refers to blocks in object \a N as well as previous (\a N-1, \a N-2, ...)


This could be read as saying that a checkpoint will always include blocks in all previous objects. Maybe:

"as well as previous ... " --> "and may also refer to blocks in the previous..."

src/docs/arch-tiered-storage.dox

keitharnoldsmith · 2024-05-08T21:59:22Z

src/docs/arch-tiered-storage.dox

+file (and persist that information as well), and notice when all
+references to a file have reached zero. This may require enhancements
+to extent lists in the block manager.  An asynchronous approach
+could work mostly separately from WiredTiger (in another process), and


"in another process" --> "in another process, possibly on a different node"

keitharnoldsmith · 2024-05-08T22:01:52Z

src/docs/file-formats.dox

@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and
 column-store, with an underlying special case of column-store for
 bit-field values. All three formats are B+tree implementations of
 key/value stores. WiredTiger also supports @ref lsm, implemented as a
-tree of B+trees.
+tree of B+trees. In addition, there is experimental support for
+@ref tiered_storage, allowing a B+tree to span multiple files and


We use "btree" and "B-Tree" so ubiquitously that I would eliminate "B+Tree", even though there is a reasonable case that our trees are more "B+ like".

keitharnoldsmith · 2024-05-08T22:03:50Z

src/docs/tiered-storage.dox

+or checkpoint) are written to a designated file, called the active file.
+All other files and objects that are part of the tiered table are read-only.
+Each object or file is given an object number, which appears as part of its
+name on disk, cloud and in metadata.


I stumbled on "cloud" on first reading. Maybe "in cloud storage"?

keitharnoldsmith · 2024-05-08T22:10:31Z

src/docs/tiered-storage.dox

+     -# eviction is temporarily disabled on this btree, while waiting for any writes to the active file to drain
+     -# a new empty file is created named with the next object number, this will become the table's active file
+     -# switch the table's active file
+     -# enable eviction on this btree
+     -# queue the old active file to be written to object storage in the background


See previous comment about this sequence.

@keitharnoldsmith I think I missed the previous comment you're referring to?

Hmm. I thought I talked about this elsewhere, but maybe I didn't save that comment...

A flush_tier checkpoint has a slightly different sequence of operations. In particular, eviction is only disabled when we are checkpointing the tree.

Checkpoint prepare phase:

Assemble list of all the dhandles that will be part of the checkpoint. For non-tiered tables this will be tables that have been modified since the last checkpoint. For tiered tables this will be tables modified since the last flush. See checkpoint_flush_tier called from checkpoint_prepare.

For each tiered table, we create the new file, and queue a work item to flush the old one. The work item includes checkpoint generation information that is used by the tiered server thread to know when it is OK to flush the file to object storage (i.e., after the current checkpoint has completed). See tiered_switch, which is called from checkpoint_flush_tier.

We do this work now so that it doesn't add time to the actual checkpoint when we have eviction disabled.

Data file checkpoint:

When we do the actual checkpoint for each table we block eviction to the table. So checkpoint is the only thread updating the table. This happens in wt_sync_file.

At the end of the checkpoint, after writing the new checkpoint root but before re-enabling eviction, we switch the active file to use the new file created during the prepare phase. This ensures that everything that is part of the checkpoint is in the old file, and anything evicted after the checkpoint is in the next file.

If the table is tiered, we fsync the old active file to ensure that all checkpoint updates are durable. (Side note: we don't have to do that here, we could do it after re-enabling eviction. But I'm just describing what the code does.)

See bm_checkpoint for the active file switch and fsync.

After switching the file we allow eviction again, as we finish wt_sync_file, and we checkpoint the next dhandle.

Hope that makes sense. If not, feel free to ping me.

ddanderson · 2024-05-23T22:00:55Z

@sueloverso and @keitharnoldsmith Thank you for your extensive and very helpful comments. Sorry for the delay, this PR has suffered from skunk & back burner syndromes. I think I've finally resolved all outstanding comments, so please have at it again!

sueloverso

I did all files except arch-tiered-storage.dox. In that one I only got through a little bit. I will complete that tomorrow but wanted to get you these other comments now.

sueloverso · 2024-05-28T19:39:48Z

src/docs/arch-block.dox

+  - an address cookie for the extent list with available entries
+  - an address cookie for the extent list with discarded entries
+  - the file size for the checkpoint
+  - the checkpoint size


It isn't clear to the reader what the distinction is between these last two items. Please elaborate what they are and how they differ.

Fixed - I had to figure out what the difference was myself.

sueloverso · 2024-05-28T19:43:24Z

src/docs/arch-block.dox

+
+As described in @ref block_address_cookie, an address cookie used with a tiered storage
+has an additional value (the \c object_id). As a result, the checkpoint cookie for a
+tiered btree will include additional \c object_id values that are not in other


To make it very clear there is one per address cookie and not just one for hte checkpoint cookie, I suggest "...for a tiered btree will include four addtional \c object_id values..."

sueloverso · 2024-05-28T19:45:02Z

src/docs/arch-index.dox

+@subpage arch-tiered-storage
+
+Tiered storage allows B-Trees to be stored into multiple places, more
+recently updated blocks on local disk and less recently updated blocks in cloud storage.


This sentence feels like it is missing verbs. "...more recently updated blocks are on local disk... blocks are in cloud storage."

sueloverso · 2024-05-28T19:45:48Z

src/docs/file-formats.dox

@@ -6,7 +6,9 @@ WiredTiger supports three underlying file formats: row-store and
 column-store, with an underlying special case of column-store for
 bit-field values. All three formats are B+tree implementations of
 key/value stores. WiredTiger also supports @ref lsm, implemented as a
-tree of B+trees.
+tree of B+trees. In addition, there is experimental support for
+@ref tiered_storage, allowing a B+tree to span multiple files and


AFAICT this one is still there and not btree.

sueloverso · 2024-05-28T19:47:01Z

src/docs/tiered-storage.dox

+@section example Example with timeline
+
+The figure below provides a high level overview of flush_tier, showing the state of a single tiered table over time. The green arrows at the top of the figure indicate write requests. These come from eviction, except during checkpoint processing. At T0, we start with two objects in object storage, Object 1, and Object 2, and File 3 as the writable active file. At T1, we start a flush checkpoint. At T2, that checkpoint completes, and WiredTiger switches the writable active file from File 3 to File 4. At this point there should be no outstanding writes to File 3 because the checkpoint has completed and eviction to the table is disabled. All future writes will go to File 4. So we can queue the copy of File 3 to object storage. The copy completes at T3. At any later time (T4) WiredTiger can remove File 3 from the local file system.
+


This is a strangely very long line. Much longer than any other paragraph on a single line.

sueloverso · 2024-05-28T20:04:54Z

src/docs/tiered-storage.dox

+    indicate on the WT_SESSION::create calls for that table.
+  - modify the application's checkpoint thread to include \c "flush_tier=(enable=true)" on calls to WT_SESSION::checkpoint.
+    \c flush_tier need not be specified on every checkpoint call; typically the cadence of \c flush_tier is much less than
+    the cadence of ordinary checkpoints.


Reading this, it implies (or requires) that using tiered storage also requires using an application checkpoint thread. I.e. I don't think we provide any mechanism to flush tier via the WT internal checkpoint server. Probably useful to state that for now even if we could (and should) add it later.

added a couple sentences to this effect.

sueloverso · 2024-05-28T20:12:01Z

src/docs/arch-tiered-storage.dox

+<em>object file</em>, or in the cloud, as a <em>cloud object</em>.
+The mechanism to create new objects is a WT_SESSION::checkpoint API call
+with the \c flush_tier configuration set.
+For brevity, is often called a \c flush_tier call.


Missing word: "For brevity, it is called.." or something.

sueloverso · 2024-05-28T20:19:44Z

src/docs/arch-tiered-storage.dox

+(e.g. <code>tiered_storage=(name="s3_store"),bucket="..."</code> ).
+Once enabled on the connection, the configuration applies to all tables created.
+The configuration can be overridden on individual tables can be changed with
+configuration options for WT_SESSION::create .


This sentence is awkward. I think you should remove "can be changed" so it reads "...can be overridden on individual tables with configuration..."

sueloverso · 2024-05-28T20:21:23Z

src/docs/arch-tiered-storage.dox

+Once enabled on the connection, the configuration applies to all tables created.
+The configuration can be overridden on individual tables can be changed with
+configuration options for WT_SESSION::create .
+This allows the caller to specify a different storage provider or bucket name or


"This "what" allows the caller..."? I think it should be "The related WT_SESSION::create configuration options allow...". That should be written out clearly. I am waffling about whether the create config options should be specified here. I don't think it is necessary, but it did cross my mind.

Fixed. As to the create config options - this is the arch guide, so we I don't think we need to go over detailed stuff already in the reference manual. The stuff that is duplicated here is purposeful just to give a little context for the following discussion.

sueloverso · 2024-05-28T20:23:57Z

src/docs/arch-tiered-storage.dox

+object is created, all writes to the previous (\a N) object are
+completed and that \a N object becomes read-only, like all objects
+before it.  At this point, the \a N object is queued to be copied to
+cloud storage.  After that copy takes place, the local copy of \a N is


I suggest replace "takes place" to "successfully completes".

sueloverso

I have lots of suggestions. I only need to rereview if there are large rewrites.

sueloverso · 2024-05-29T15:12:04Z

src/docs/arch-tiered-storage.dox

+
+A normal, non-tiered table, although sometimes thought of as a
+single btree, can also be thought of as an active or \a live btree, as well
+as zero or more checkpoints, that are fully represented in the single


I suggest replace as well as with with and remove the comma after checkpoints.

sueloverso · 2024-05-29T15:31:09Z

src/docs/arch-tiered-storage.dox

+as zero or more checkpoints, that are fully represented in the single
+disk object file. Each checkpoint has its own root page, and so can be
+considered its own btree.  Each of these btrees is a set of
+pages referencing each other as a tree, some in memory, some on disk.


I found this 3 sentence introduction kind of confusing. It took me 3 or 4 rereads to understand what you're trying to say here. It is correct but difficult. Maybe removing the first comma comment. What do you think of this?
A normal, non-tiered table can be thought of as a live btree with zero or more checkpoints that are fully represented in a single disk object file. Each checkpoint has its own root page and can be considered its own complete btree. Each of these btrees...

That's great. I like your re-wording.

sueloverso · 2024-05-29T15:32:32Z

src/docs/arch-tiered-storage.dox

+considered its own btree.  Each of these btrees is a set of
+pages referencing each other as a tree, some in memory, some on disk.
+A tiered table is the same,
+having an \a live btree and a


an should be a.

sueloverso · 2024-05-29T15:33:19Z

src/docs/arch-tiered-storage.dox

+pages referencing each other as a tree, some in memory, some on disk.
+A tiered table is the same,
+having an \a live btree and a
+set of checkpoint btrees. However, both the active btree and the


I suggest keep using live instead of active here.

sueloverso · 2024-05-29T15:34:21Z

src/docs/arch-tiered-storage.dox

+A tiered table is the same,
+having an \a live btree and a
+set of checkpoint btrees. However, both the active btree and the
+checkpoints may span multiple object files and/or cloud objects.


Suggestion: ...may have pages that span multiple...

sueloverso · 2024-05-29T15:45:28Z

src/docs/arch-tiered-storage.dox

+
+@subsection ts_block_local_file_removal Local File Removal
+
+When the tiered server determines that a local object file should be removed,


I think the comma can be removed.

sueloverso · 2024-05-29T15:48:32Z

src/docs/arch-tiered-storage.dox

+to remove a local object file has several parts.
+
+When a cloud copy of an object is completed, the block manager
+is told, via \c BM::switch_object_end , what object id (and below)


I found (and below) confusing thinking it was referring to text below on the document. How about something like:
...is told, via ..., that object id N (and therefore all ids < N) can be removed.

sueloverso · 2024-05-29T15:50:47Z

src/docs/arch-tiered-storage.dox

+@subsection ts_metadata_non Non-tiered Tables
+
+When non-tiered table \c A is created (without named column groups),
+there are two entries in the metadata file, having these keys:


two should be three

sueloverso · 2024-05-29T15:53:33Z

src/docs/arch-tiered-storage.dox

+------------- | -------------- | ------------- | --------------------------------------
+\c table:     | WT_TABLE       | yes           | the dhandle is cast to (WT_TABLE *)
+\c file:      | WT_BTREE       | yes           | dhandle->handle is a (WT_BTREE *)
+\c colgroup:  |                | no            | stored in an array in the WT_TABLE


This section is great and well done. It feels like it should go into the arch-metadata.dox file because someone wanting to know about how this works won't think to look in the tiered storage document. I have no idea how much of this is covered in the metadata arch file.

I think this is a good point. I moved the first section to the end of arch-metadata.dox as an example, and refer to it from arch-tiered-storage.dox . It's not a perfect fit as it talks a little about in-memory data structures, but think it's okay for the arch guide.

sueloverso · 2024-05-29T16:02:35Z

src/docs/arch-tiered-storage.dox

+After all pages in an object are no longer needed, an object can be
+removed.  The trick is in knowing when this can happen.
+
+It is expected that future solutions to work either synchronously or


I suggest: "A future garbage collection solution could work well either synchronously or asynchonously."

ddanderson · 2024-05-29T20:49:46Z

@sueloverso, thanks so much for your detailed reviews! I know this can be tedious, but every comment was appreciated and made the doc better.

The one thing I didn't changes was the B+tree reference. There's a ticket WT-12956 to resolve all the btree spelling -- and this one may fall into the category of describing the specific implementation of btree that we use. I'll let WT-12956 decide in a uniform manner.

keitharnoldsmith · 2024-05-31T18:24:53Z

src/docs/arch-tiered-storage.dox

+to zero, the handle is removed and freed, and the underlying file
+handle is closed.
+
+When a \c flush_tier is done, each table that had changes written as


done --> performed (executed?)

done is ambiguous. It could also mean "when a flush_tier is complete"

keitharnoldsmith · 2024-05-31T18:29:02Z

src/docs/arch-tiered-storage.dox

+(associated with a non-tiered table) behave almost identically in
+the WiredTiger system.  In fact, the \c WT_BTREE_PREFIX macro checks
+to see if a URI matches either one of these prefix strings.  The macro basically
+means "does this thing walk and talk like a btree?".  In both cases,


Remove period after question mark.

keitharnoldsmith · 2024-05-31T19:17:49Z

src/docs/tiered-storage.dox

+     -# eviction is temporarily disabled on this btree, while waiting for any writes to the active file to drain
+     -# a new empty file is created named with the next object number, this will become the table's active file
+     -# switch the table's active file
+     -# enable eviction on this btree
+     -# queue the old active file to be written to object storage in the background


Hmm. I thought I talked about this elsewhere, but maybe I didn't save that comment...

A flush_tier checkpoint has a slightly different sequence of operations. In particular, eviction is only disabled when we are checkpointing the tree.

Checkpoint prepare phase:

Assemble list of all the dhandles that will be part of the checkpoint. For non-tiered tables this will be tables that have been modified since the last checkpoint. For tiered tables this will be tables modified since the last flush. See checkpoint_flush_tier called from checkpoint_prepare.

For each tiered table, we create the new file, and queue a work item to flush the old one. The work item includes checkpoint generation information that is used by the tiered server thread to know when it is OK to flush the file to object storage (i.e., after the current checkpoint has completed). See tiered_switch, which is called from checkpoint_flush_tier.

We do this work now so that it doesn't add time to the actual checkpoint when we have eviction disabled.

Data file checkpoint:

When we do the actual checkpoint for each table we block eviction to the table. So checkpoint is the only thread updating the table. This happens in wt_sync_file.

At the end of the checkpoint, after writing the new checkpoint root but before re-enabling eviction, we switch the active file to use the new file created during the prepare phase. This ensures that everything that is part of the checkpoint is in the old file, and anything evicted after the checkpoint is in the next file.

If the table is tiered, we fsync the old active file to ensure that all checkpoint updates are durable. (Side note: we don't have to do that here, we could do it after re-enabling eviction. But I'm just describing what the code does.)

See bm_checkpoint for the active file switch and fsync.

After switching the file we allow eviction again, as we finish wt_sync_file, and we checkpoint the next dhandle.

Hope that makes sense. If not, feel free to ping me.

keitharnoldsmith

LGTM. I've added a couple minor comments.

I also have a bigger comment in response to a thread from my previous review.

Thanks again for all the work on this, Don!

…iered ref section for checkpoints.

ddanderson · 2024-06-04T13:59:17Z

@keitharnoldsmith, thanks for your detailed comments on checkpoint - I lifted that into the arch guide, as it seemed appropriate. I made the ref guide more terse (as we don't really describe much about regular checkpoints in the ref guide).

Could you please do a quick review of parts I just updated:

the checkpoint sections of the ref guide ("The flush_tier operation" first three paragraphs)
the arch guide ("Flush Checkpoint Operations")

Here are pdfs of the two pages, in case that's helpful:
ref-tiered-storage-20240604.pdf
arch-tiered-storage-20240604.pdf

No rush, other than wanted to get this put to bed.

keitharnoldsmith

Thanks Don! LGTM

WT-9529 Add reference guide and architecture guide entries for Tiered…

8aa1e0f

… Storage.

ddanderson requested review from sueloverso and keitharnoldsmith May 6, 2024 14:55

sueloverso reviewed May 6, 2024

View reviewed changes

ddanderson added 4 commits May 7, 2024 13:12

Use "btree" consistently, and fix a couple typos.

25945d6

Fixed in-memory.

f54dfd9

Doc fixups, including putting info about checkpoint cookies into the …

aa83d8b

…block manager arch section.

Additional cleanups.

4e91761

keitharnoldsmith reviewed May 8, 2024

View reviewed changes

ddanderson added 6 commits May 22, 2024 18:15

Numerous fixes from review comments.

e3a1640

Another round of wordsmithing.

83cd8b0

Added image file for flush_tier timeline.

5748687

More fixes from review.

def8ae7

Fixed formatting for wiredtiger_open options.

c9025e9

Doxygen formatting fix.

e4066b8

ddanderson requested review from sueloverso and keitharnoldsmith May 23, 2024 21:58

sueloverso reviewed May 28, 2024

View reviewed changes

ddanderson added 2 commits May 29, 2024 11:32

More fixes from proofreading and comments.

35dd0d6

Update wiredtiger.in with doc fixes.

0ce8f3a

sueloverso approved these changes May 29, 2024

View reviewed changes

ddanderson added 3 commits May 29, 2024 16:09

More fixes from reviews.

916b3ff

More fixes, and moved non-tiered metadata example to arch-metadata.dox

be05dc8

Fixed final comment.

f6d604c

keitharnoldsmith reviewed May 31, 2024

View reviewed changes

keitharnoldsmith approved these changes May 31, 2024

View reviewed changes

A rewrite of the checkpoint steps in the tiered arch guide, and the t…

4ba127a

…iered ref section for checkpoints.

ddanderson requested a review from keitharnoldsmith June 4, 2024 13:59

keitharnoldsmith approved these changes Jun 6, 2024

View reviewed changes

ddanderson added this pull request to the merge queue Jun 6, 2024

Merged via the queue into develop with commit 20daa87 Jun 6, 2024
7 checks passed

ddanderson deleted the wt-9529-tiered-doc branch June 6, 2024 19:11


		@section ts_intro Introduction and Definitions

		Tiered storage provides a way to split Btrees into multiple parts,

		@section example Example with timeline

		The figure below provides a high level overview of flush_tier, showing the state of a single tiered table over time. The green arrows at the top of the figure indicate write requests. These come from eviction, except during checkpoint processing. At T0, we start with two objects in object storage, Object 1, and Object 2, and File 3 as the writable active file. At T1, we start a flush checkpoint. At T2, that checkpoint completes, and WiredTiger switches the writable active file from File 3 to File 4. At this point there should be no outstanding writes to File 3 because the checkpoint has completed and eviction to the table is disabled. All future writes will go to File 4. So we can queue the copy of File 3 to object storage. The copy completes at T3. At any later time (T4) WiredTiger can remove File 3 from the local file system.


		@subsection ts_block_local_file_removal Local File Removal

		When the tiered server determines that a local object file should be removed,

WT-9529 Add doc reference guide and architecture guide entries for Tiered Storage #10558

WT-9529 Add doc reference guide and architecture guide entries for Tiered Storage #10558

Conversation

ddanderson commented May 6, 2024

ddanderson commented May 6, 2024

sueloverso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keitharnoldsmith left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddanderson commented May 23, 2024

sueloverso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sueloverso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keitharnoldsmith left a comment •

edited