-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: new API RaftStorage::append_to_log_cb() #735
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
@schreter |
Seems good, except that the default implementation should call non-blocking The #[derive(Clone)]
struct Callback { ... }
impl Callback {
pub fn log_flushed_up_to(id: LogId);
} This would indicate that everything including the
For code robustness, one could send an error to the engine if the drop of the last callback finds out that log I/O is still in progress (likely a panic somewhere in log handling code => Raft must be shut down). BTW, ideally, it would be nice to get and clone the callback only once and/or guarantee that behind the callback is always the same object, so it can be clone only once and stored somewhere, but I don't know how we should express this on the API, so we can address that later (also based on measurements - probably it's not that relevant). |
Below is the upgraded code: /// Nonblocking version of `append_to_log`.
///
/// To ensure correctness:
///
/// - When this method returns, the entries are guaranteed to be readable, i.e., a `LogReader`
/// can read these entries.
///
/// - There must not be a **hole** in logs. Because Raft only examine the last log id to ensure
/// correctness.
///
/// But it does **NOT** have to guarantee that the entries are persisted on disk before
/// returning. Instead, an implementation should calls the `callback` when the entries are
/// persisted on disk. The `callback` can be called either before or after this method returns.
async fn append_to_log_cb(
&mut self,
entries: &[C::Entry],
callback: LogFlushed<C::NodeId>,
) -> Result<(), StorageError<C::NodeId>> {
let last = entries.last().map(|e| *e.get_log_id());
/// Default implementation that calls the flush-before-return `append_to_log`.
self.append_to_log(entries).await?;
callback.log_io_completed(Ok(last));
Ok(())
}
I'm afraid it's not easy to group commit at the storage API level, and not easy to let RaftStorage save a cloned callback for future use. |
Hm, maybe not for future use, but at least drop callbacks "in-between" and then do a single confirmation at the end.
Log index probably not, but log ID containing the term does. The truncation will necessarily require a new term (or term + leader). Therefore, it should be sufficient. If you get a confirmation for the old term, then confirm only up to the truncation location. A new append will be sent for a new term to append logs from the truncated location, ideally after the truncation (I think it's fine to await the truncation, since that should be an exceptional case).
Let's see how the implementation will look like. I don't see how a oneshot channel will help you, since it needs to be explicitly awaited somewhere. Today, we have a command The callback would be effectively just a thin wrapper over And, since there is a single |
In my opinion, having a sender inside a RaftStorage implementation is alternative possible approach. But it still relies on the assumption that IO progress is a linear value:
Given 3 nodes, it is possible the 3 nodes has their local store as the following:
If N1 and N3 alternately elected as leader, they will replicate their local log at index 2 to N2.
In this case, the IO operations that are submitted on N2 will be:
Consequently, a particular log id |
You are right. Basically it is the big picture. |
Uhm, the situation you are describing is IMHO impossible. How can N1 and N3 have different term for log index 2 and N2 not have this log entry persisted yet? That directly contradicts the quorum! N2 must already have either 2-2 or 3-2 persisted. Say, N2 has 3-2. If N1+N2 restarts, then N2 "wins" with higher term and replicates 3-2 to N1, overwriting its 2-2. If N2+N3 restarts, then either N2 or N3 wins the election, but both already have 3-2. If I understand it correctly, log ID at a certain index can only go "up", but never "down", i.e., the term can only increase, but never decrease (also in the optimized version with term + leader ID, this still holds with total ordering over term + leader ID). Or am I missing something? |
Such a event sequence will build the situation mentioned above:
|
OK, but then, it's only the question about orchestrating the truncation. The simplest solution on Raft side which comes to my mind is a local transient truncation counter, which is passed around in the callback. A callback called with an old truncation counter is simply ignored. In this case, the log implementation has to orchestrate writes to the log properly. Say, there is a log write for some log before truncation ongoing, then the truncation is done and a new write with same log indexes is triggered. This needs to be handled properly on the log side by either delaying new log write until the old write is done or by doing shadow paging or something else - but not your concern. Alternative would be to block the truncation until the log write up to the last log ID passed to the log is confirmed via the callback and only then proceed with the truncation. This way, there is no need for truncation counter and no need to implement anything special on the log side, but the code might be somewhat more complex to properly delay the truncation until log write finishes. In any case, this is orthogonal to the technology used for the callback, since the truncation still needs to be handled properly somehow. |
Yes. If it is a multi-shot callback handle, it should not send a log id to report IO progress. Instead the IO progress should be measure by the WAL of the local store. An alternative solution is to assign a unique id to every IO operation. |
So, either the RaftStorage implementation should provide an IO operation counter or, at the very least, a truncation counter to send back to the callback. |
Updated:
Original:
To parallelize the local store IO and network IO, appending log entries should provide the capability to return at once after the entries are readable, rather than waiting for them to be persisted.
This task can be done with a new
RaftStorage
methodappend_to_log_cb()
:Considerations
It is not proper to use a separate
flush_log(upto_index)
method for flushing log.There could be additional IO operations waiting to be processed before calling
flush_log()
, after an appending-log operation, such as append-log1,2,3
, then truncate-log[2,+oo)
, then append-log2,3,4
. In such a case,flush_log(upto=3)
may lead to ambiguity, and Openraft could operate on a incorrect state.Originally posted by @schreter in #702 (comment)
The text was updated successfully, but these errors were encountered: