Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A compound entry slice type #69

Open
pav-kv opened this issue Jun 2, 2023 · 2 comments
Open

A compound entry slice type #69

pav-kv opened this issue Jun 2, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@pav-kv
Copy link
Contributor

pav-kv commented Jun 2, 2023

raft keeps entries in memory as a single slice:

entries []pb.Entry

In the happy path (when the leader is stable), it just keeps appending to this slice:

u.entries = append(u.entries, ents...)

In the unhappy path (when leader changes), it rewrites a suffix of the log by copying a prefix and appending the suffix:

raft/log_unstable.go

Lines 212 to 213 in 3e6cb62

keep := u.slice(u.offset, fromIndex) // NB: appending to this slice is safe,
u.entries = append(keep, ents...) // and will reallocate/copy it

As a result, we again have a contiguous slice of entries. We do this copying because we need entries[i] to be immutable (because it may have been sent up to Storage through the Ready struct, and is still in use / in flight).

Same reallocation/copying happens in MemoryStorage:

raft/storage.go

Lines 300 to 302 in 3e6cb62

// NB: full slice expression protects ms.ents at index >= offset from
// rewrites, as they may still be referenced from outside MemoryStorage.
ms.ents = append(ms.ents[:offset:offset], entries...)

Another place where we have to reallocate/copy the slice of entries is the slice method where we combine the entries returned from Storage with the entries returned from the unstable structure:

raft/log.go

Lines 494 to 497 in 3e6cb62

combined := make([]pb.Entry, len(ents)+len(unstable))
n := copy(combined, ents)
copy(combined[n:], unstable)
ents = combined

In all these cases, it is not strictly necessary to reallocate and copy []pb.Entry slices. Instead, it would be acceptable to operate on [][]pb.Entry, or a struct that has [][]pb.Entry inside, but provides an interface of a contiguous slice.

Most usages of []pb.Entry need to be able to iterate it, so it's fine if it's multiple slices instead of one. Some usages may need random access by entry.Index, so it's fine if there are a few slices, and it's good if access by Index is safely encapsulated.


Prototype:

// Entries represents a span of the log. Functionally it is
// equivalent to []pb.Entry.
//
// Internally, it consists of a few []pb.Entry slices, ordered
// by its 0-th entry Index. By properties of the Raft log, it is
// also ordered by the 0-th entry Term. Example:
//
//	Entries:  (-----------------------]
//	          .   .   .       .       .
//	t9: │     .   .   .       (-------]
//	t7: │     .   .   (---]-------]
//	t4: │     .   (---]
//	t3: │     (-----------------]
//	    └───────────────────────────────────► index
//	         10  20  30  40  50  60  70
//
// Most of the time, there will be one (when leader is stable) or
// two (when leader changes) slices in it. The struct is optimized
// for this case, and incurs zero allocations. Occasionally it may
// accumulate more slices (e.g. if the leader is unstable).
type Entries struct {
	buf [3][]pb.Entry // used to back `ent` if len(ent) <= 3
	ent [][]pb.Entry  // Invariant: len(ent[i]) != 0
}

// Bounds returns the [begin, end) range of log indices that this
// slice represents.
func (e Entries) Bounds() (uint64, uint64) {
	if len(e.ent) == 0 {
		return 0, 0
	}
	last := e.ent[len(e.ent)-1]
	return e.ent[0][0].Index, last[len(last)-1].Index+1
}

// Len returns the length of the log span.
func (e Entries) Len() int {
	begin, end := e.Bounds()
	return end - begin
}

// Iterate visits all entries in this log span.
func (e Entries) Iterate(visit func([]pb.Entry) bool) {
	l := len(e.ent)
	if l == 0 {
		return
	}
	for i := 0; i + 1 < l; i++ {
		// Compute the visible part of this slice, which is not
		// overlapped by the next slice.
		visible := e.ent[i+1].Index - e.ent[i].Index
		if !visit(e.ent[i][:visible:visible]) { // protect from appends
			return
		}
	}
	visit(e.ent[l-1])
}

// To converts Entries to the equivalent []pb.Entry.
// Just in case it's needed. For instance:
//
//	entries := e.To(make([]pb.Entry, 0, e.Len()))
func (e Entries) To(to []pb.Entry) []pb.Entry {
	e.Iterate(func(part []pb.Entry) bool {
		to = append(to, part...)
		return true
	})
	return to
}

// Need to also have mutators to append to and truncate this slice.

// Could also add a method to access a log entry by Index. This would
// be a binary search by e.ent[i][0].Index (or a linear search if
// len(e.ent) is low, which is normally true), followed by a direct
// access by index within the found sub-slice.

// Could also add a method to access a [begin, end) sub-slice of this
// slice. It would boil down to finding the begin and end-1 index, and
// then either returning a sub-slice with zero copy if it's in a single
// sub-slice, or returning a copy if it is compound.

// Squash can be used to "compact" the log range to a single slice.
// For example, it could be used in MemoryStorage occasionally, once
// len(e.ent) grows above a threshold and increases the cost of random
// access.
func (e Entries) Squash() Entries {
	if len(e.ent) <= 1 {
		return e
	}
	var res Entries
	res.buf[0] = e.To(make([]pb.Entry, 0, e.Len()))
	res.ent = res.buf[:1]
	return res
}

Why something like this struct is useful:

  • Avoids some unnecessary allocations. In the unstable struct, the overlapped entries on leader changes are short-lived, and will be released to the heap soon anyway. The overlapped entries are also likely still referenced because they were returned through Ready for being stored. So copying the slice (as of now) actually increases memory consumption.
  • It's possible to make this struct completely reallocation-free if we always start a new slice when the old slice is filled up.
  • It might be easier to track used memory this way, there would be no hidden refs to copies of slices.
  • Provides protected / type-safe access to a slice of log entries.
  • Can come with extra guarantees that the slices of log entries are verified, e.g. that Index is contiguous, and Term is non-decreasing.
@pav-kv pav-kv added the enhancement New feature or request label Jun 2, 2023
@pav-kv
Copy link
Contributor Author

pav-kv commented Jun 2, 2023

Alternatively, instead of [][]pb.Entry could just do all in one []pb.Entry, with a little “index” on a side. The "index" would be equivalent to the AppendTracker that would be needed for async storage appends.

Say, we have append [1 2 3 4 5], then an override [4 5 6 7], then append again [8, 9]. The “logical” evolution of the log is:

1 2 3 4  5
1 2 3 4' 5' 6 7
1 2 3 4' 5' 6 7 8 9

But physically it could be represented as:

1 2 3 4 5 4' 5' 6 7 8 9

And an “index”:

log index 1-3 is at array positions 0-2
log index 4-9 is at array positions 5-10

The unstable is quickly rotated away, so the backing array will be shrunken, and “index” will be usually 1-2 entries.

@Aditya-Sood
Copy link

hi @pavelkalinnikov, can I pick this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants