transactional fast truncate

The Plan

The session.truncate method will atomically change a parent page's on-disk state to "deleted" if the page isn't already in memory; if there are problems, or this won't work for any reason (for example, there are other reading threads), we'll fall back to the slow method of deleting rows, that is, we'll instantiate the page in memory, and delete rows individually.

WT_REF

Add a new page state and a new transaction ID to the WT_REF structure:

struct __wt_ref {
        WT_PAGE *page;
        void *addr;
        union {
                uint64_t recno;
                void *key;
        } u;
        wt_txnid_t txnid;

#define WT_REF_DELETED 0x00    /* Deleted on-disk page */
        volatile WT_PAGE_STATE state;
};

The new page state acts mostly like WT_REF_DISK, but means the page is "deleted on disk".

The transaction ID is the ID of the transaction that deleted the page, used to determine if the change is visible. I originally added the additional transaction ID to the WT_REF union. That doesn't work because the union's recno/key fields are set for all internal pages (we still need a search key) even when the page isn't in memory. The good news is this doesn't increase the size of the WT_REF on 64-bit machines, it stays at 32B because of padding.

WT_TXN

Change the WT_TXN structure to include an additional list of modifications, the modifications the transaction has made to WT_REF structures:

struct __wt_txn {
        ...
        WT_REF **modref;
        size_t modref_alloc;
        u_int modref_count;
};

Add a new function:

__wt_txn_modify_ref(WT_SESSION_IMPL *session, WT_REF *ref);

that behaves like __wt_txn_modify but stores a pointer to a WT_REF structure instead of a pointer to a wt_txnid_t;

Truncate

The session.truncate operation takes the following steps to delete an on-disk leaf page:

if (WT_ATOMIC_CAS(ref.state, WT_REF_DISK, WT_REF_READING)) {
        ret = __wt_txn_modify_ref(ref);
        ret = __wt_page_modify_init(session, page);
        __wt_page_modify_set(page);
        WT_PUBLISH(ref.state, WT_REF_DELETED);
}

We have to "modify" the page while we have it "locked", we're not in a serialized routine.

Michael: We probably want to rename WT_REF_READING to be WT_REF_TRANSITION or something else, we're not using it exclusively for reading pages any more.

I've reviewed the WT_REF_LOCKED state and the only problem I see is the code in __evict_page_request_walk() that blindly sets WT_REF_LOCKED regardless of the previous state. If we changed that code to only set WT_REF_LOCKED if the previous state was WT_REF_MEM, I think we can think of WT_REF_LOCKED as a general "lock this WT_REF structure state", but I want you to review and agree. It also seems to me we would replace WT_REF_READING with WT_REF_LOCKED?

Do you mean this code in __evict_page_request_walk?

		WT_ASSERT(session, ref->state == WT_REF_EVICT_FORCE);
		ref->state = WT_REF_LOCKED;

It doesn't look blind to me -- we're always switching from WT_REF_EVICT_FORCE to WT_REF_LOCKED.

I think it should be fine to use WT_REF_LOCKED everywhere, including the current case of WT_REF_READING.

Truncate rollback:

Change __wt_txn_rollback to additionally walk the list of WT_REF's the operation has modified, and revert the pages to WT_REF_DISK if they have not been instantiated by a read, or clean up the pages if they have been instantiated by a read.

        /*
         * If the page is still marked deleted, it's as we left it, reset the
         * state to on-disk and we're done.
         */
        if (WT_ATOMIC_CAS(ref->state, WT_REF_DELETED, WT_REF_READING)) {
                WT_PUBLISH(ref->state, WT_REF_DISK);
                return;
        }

        /*
         * The page is either instantiated or being instantiated -- wait for
         * the page to settle down, as needed, and then clean up the update
         * structures.
         */
        while (ref->state != WT_REF_MEM)
                __wt_yield();
        page = ref->page;
        WT_ROW_FOREACH(page, rip, i)
                for (upd =
                    WT_ROW_UPDATE(page, rip); upd != NULL; upd = upd->next)
                        if (upd->txnid == ref->txnid)
                                upd->txnid = WT_TXN_ABORTED;
    }

Note: in the first case, there's no "aborted" flag, the change just magically disappears.

Note: in the second case, there's a lot more error-checking we could do, and there's an infinite loop if we have the page state checks wrong. That said, the page had better be transitioning to in-memory (it can't be evicted because there are uncommitted changes on it), or there are other problems. It's possible there are other changes on behalf of this same transaction in this same page, other than the delete (imagine a transaction that first truncates a range (changing the page's state to WT_REF_DELETED and tracking the WT_REF structure in the WT_TXN structure), then inserts/deletes rows in the range (forcing the instantiation of the page and potentially chains of WT_UPDATEs for the transaction), then aborts (reviewing all of the WT_UPDATE structures and aborting them). This is all OK: those WT_UPDATE structures should already be marked aborted, but there's no reason to check before we mark them aborted again.

Truncate commit

Truncate commit requires no changes.

Truncation overlap:

It would be reasonable to optimize multiple, overlapping truncate calls. We could test, and the page is already deleted and the delete is visible to us (the previous delete has been committed), we could skip the page instead of instantiating it and figuring out there are no rows in the page. We don't do this in the current code, in the current code, overlapping truncate calls will fail the fast-delete test, that is, one of the calls will encounter a page with WT_REF_DELETED set rather than WT_REF_DISK set, and that will force instantiation of the page and a fall-back to the slower solution. While that's a huge amount of work to no purpose, it's unclear to me that optimizing for overlapping range deletes where the first of the range deletes has already committed, is worth the effort.

Read operations

If a reading thread encounters a WT_REF structure in a WT_REF_DELETED state, it has to read the page and make the page look like each row was individually deleted. This happens if there's a random read (searching the tree), or a cursor iteration where the "delete" operation isn't committed and is not yet visible to the cursor, or even an overlapping delete.

Here's what the reader thread does:

        /*
         * Attempt to set the state to WT_REF_READING; if successful, we've
         * won the race, read the page.
         */
        if (WT_ATOMIC_CAS(ref->state, WT_REF_DISK, WT_REF_READING))
                previous_state = WT_REF_DISK;
        else if (WT_ATOMIC_CAS(ref->state, WT_REF_DELETED, WT_REF_READING))
                previous_state = WT_REF_DELETED;

        Read the page from disk
        Instantiate the page in memory

        /*
         * If the page was deleted, make the page look like each row was
         * individually deleted.
         */
        if (previous_state == WT_REF_DELETED) {
                Mark the page dirty.

                /* Record the transaction ID for the first update to the page. */
                page->modify->first_id = ref->txnid;

                for each entry on the page create a WT_UPDATE structure
                        the WT_UPDATE structure is marked "deleted"
                        copy the saved WT_REF.txnid to the WT_UPDATE structure
        }

        WT_PUBLISH(ref.state, WT_REF_MEM);

Michael, I'm not confident about the setting of the transaction ID for the first update to the page; is that correct and/or necessary?

Yes, that is correct -- it should be the earliest txnid of a WT_UPDATE on the page.

Cursor reads

Cursor iterators have the information available to avoid reading a page if the deleted records are visible to them: if walking a page and we find a WT_REF_DELETED slot, test if the deleted records are visible to us, in which case we skip the slot instead of instantiating it, at the cost of an atomic swap:

        /*
         * Do a simple test first, avoid the atomic operation unless it's
         * demonstrably necessary.
         */
        if (ref->state != WT_REF_DELETED)
                return (0);

        /*
         * It's possible the state is changing underneath us, we could race
         * between checking for a deleted state and looking at the stored
         * transaction ID to see if the delete is visible to us.  Lock down
         * the structure.
         */
        if (!WT_ATOMIC_CAS(ref->state, WT_REF_READING, WT_REF_DELETED))
                return (0);

        skip-the-page = __wt_txn_visible(session, ref->txnid) ? 1 : 0;

        WT_PUBLISH(ref->state, WT_REF_DELETED);

Reconciliation

If reconciliation encounters a WT_REF structure in a WT_REF_DELETED state, use the WT_REF.txnid value to determine if the change is visible. If the change is visible, treat the page as empty, if the change is not visible, treat the page as on-disk.

There is a general "is a child page modified" function reconciliation uses I think it needs to look something like this:

static inline int
__page_modified(
    WT_SESSION_IMPL *session, WT_REF *ref, WT_PAGE **rpp, int *skipp)
{
        WT_RECONCILE *r;

        r = session->reconcile;
        *rpp = ref->page;
        
        for (;; __wt_yield())
                switch (ref->state) {
                case WT_REF_DISK:
                case WT_REF_READING:
                        /* On disk or being read, not modified by definition. */
                        return (0);
                case WT_REF_DELETED:
                        /*
                         * Lock down the WT_REF and decide if any deleted page
                         * is visible to this reconciliation.
                         */
                        if (!WT_ATOMIC_CAS(
                            ref->state, WT_REF_DELETED, WT_REF_READING))
                                break;
                        if (__wt_txn_visible(session, ref->txnid))  
                                *skipp = 0;
                        else {
                                r->upd_skipped = 
                                *skipp = 1;
                        }
                        WT_PUBLISH(ref->state, WT_REF_DELETED);
                        return (1);
                case WT_REF_EVICT_FORCE:
                case WT_REF_EVICT_WALK:
                case WT_REF_MEM:
                        /* In-memory states, check page's modify structure. */
                        return (ref->page->modify == NULL ? 0 : 1);
                case WT_REF_LOCKED:
                        /* In transition, wait for page's state to settle. */
                default:
                        break;
                }
}

I've expanded on this code to include the spin until a page is resolved. The problem here is if we've done a truncate, then subsequently a reading thread is forcing us to instantiate the page, and then a checkpoint happens, the checkpoint potentially waits on the page to be read. I think I'm OK with that because the checkpoint is waiting on lots of write I/O, what's a bit more read I/O going to matter?

That seems okay, but shouldn't the read of ref->page move inside the loop?

Notes on concurrent operations

Phase	cursor access	checkpoint	LRU eviction	truncate
truncate in progress	if truncate not visible, instantiate page creating WT_UPDATEs	if truncate visible treat as empty, else previous on-disk version	ignore	if truncate not visible, instantiate page creating WT_UPDATEs
truncate rolling back	if page not in-memory: reset to disk if page in-memory: reset WT_UPDATEs	no special case	ignore	no special case
truncate committed	create empty page	treat as empty	evict	no special case

The "create an empty page" section is an unsolved problem; the current code frees the underlying deleted leaf page blocks as soon as the delete operation becomes "visible". This means a subsequent checkpoint can update the file such that those blocks will be immediately reused. This isn't correct, because a snapshot cursor looking for a previous version of the page might still be in the system and so would attempt to read blocks that have been re-purposed. Right now, the only solution I can think of is to fallback to a "slow" delete if there's a snapshot cursor in the system that might potentially want a previous version of the page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly