Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add per-state hook for handling discarded events #25

Open
rkuhn opened this issue Apr 5, 2023 · 16 comments
Open

add per-state hook for handling discarded events #25

rkuhn opened this issue Apr 5, 2023 · 16 comments
Assignees

Comments

@rkuhn
Copy link
Member

rkuhn commented Apr 5, 2023

Any event type that is currently unexpected will be discarded. We add a hook that can be installed via the state designer DSL so that discarded events can be processed. This is important for user code in order to trigger compensating actions after the log merge has invalidated some previous states that were acted upon.

@Kelerchian
Copy link
Contributor

How would this be used and what is the difference between this and MachineRunner event audit.dropped?

@rkuhn
Copy link
Member Author

rkuhn commented Apr 6, 2023

The audit interface is just for auditing, it has no influence on the machine state. Unhandled events that are the result of a history merge (i.e. previously valid events) can often be useful for accumulating in the state a list of things to undo — a.k.a. decisions that were valid at the time but have become invalid in the meantime.

@Kelerchian
Copy link
Contributor

Ah, I think I understand. So, things like calling a command, state mutation, when unhandled happen?

@rkuhn
Copy link
Member Author

rkuhn commented Apr 11, 2023

The unhandled events hook still needs to be a pure function computing only the next state, so calling commands is out of the question. This is illustrated by the following event trace:

  1. reserveX
  2. XreservedBySomeoneElse
  3. doX
  4. abortedX

The doX would have been unhandled once the second event has been merged into the local log. The fourth event records that the compensation has been performed, so doing anything about doX at step 3 would now be a mistake.

@rkuhn
Copy link
Member Author

rkuhn commented May 25, 2023

Thinking more about this, we need to take a step back and evaluate the use-case:

  • assume state sequence A→B→C→D where in state B a machine knows whether it should do the work in the following states (as opposed to just observe or ignore)
  • assume an execution where the local machine is in state C or D in the active role (so in state B it was told to do the work)
  • now a new event comes in that changes the payload of state B to B' where this machine should be passive — i.e. there has been a mistake caused by a network partition that we now need to repair (a.k.a. compensate)

We need an API where the developer can easily formulate this constraint. Also note that the underlying events — let’s call them b and b' — are not enough to trigger compensation: b' might arrive before b, in which case there would be no confusion since the “you should do something” event will never take effect. So the problematic case is where c happened due to b but then b' invalidates b. The log will now be a,b',b,c plus later c',d,…

One solution is that the programmer writes code detecting that c — while handled — now needs to be compensated, but will also need to write code for the case that c' is ordered before c, so the compensation is triggered by c being discarded. Another solution is to track the immediately preceding event in every emission (e.g. in a tag) and inform the user code that event c is based on an event history that has been discarded. There may be more solutions, so what do you think @Kelerchian @jmg-duarte ?

@Kelerchian
Copy link
Contributor

My immediate reaction, there are 3 scenarios:

  • Takeover mechanism, stay at path B
  • Surrender mechanism, to convert B branch to B' branch
  • Fail-and-reset mechanism

Takeover and Surrender require taking note of the history of eventId-state. History-taking is costly, therefore it should be presented as a different type of machine-runner. Apart from that, I think we should be careful how to reconcile this and the for await API.

@Kelerchian Kelerchian self-assigned this May 26, 2023
@Kelerchian
Copy link
Contributor

Kelerchian commented May 26, 2023

Random related things to look at to build ideas on top of

  • Event-history: list of ordered ActyxEvents consumed by the machine-runner from the start, up to this point.
  • EventId-history: Similar to Event-history, but EventId (string) instead of ActyxEvent.
  • Discarded-Event-history: list of ordered ActyxEvents NOT consumed by the machine-runner from the start, up to this point.
  • All-Event-history: list of ordered ActyxEvents, both consumed by the machine-runner from the start, up to this point.
  • Last-EventId-before-time-travel: the last EventId consumed by machine-runner before a Time-travel happens
  • Divergence: When a Time-travel occurs, the EventId-history is not equal before and after the Time-travel
  • Divergence-Point: The point just before divergence happens. This can be in the form of ActyxEvent or ZERO Symbol (In a rare case where all events from the same subscription change before and after a Time-travel).
  • Divergence-detection-before-caught-up: When a Time-travel occurs, Last-EventId-before-time-travel is not listed in EventId-history after Time-travel
  • Async-iter-break-after-divergence: When a Divergence occur, The async-iterator-style machine-runner is destroyed.
  • Event-history-recording: The act of recording the Event-history

@Kelerchian
Copy link
Contributor

Kelerchian commented May 26, 2023

@rkuhn

I'm starting to reconsider:

  1. It might be better to separate machine-runner as an async iterator and machine-runner as an observable.
  2. Async-iter-break-after-divergence should be the default behavior.

------- Next topic -------

Idea for Event-history-recording

// normally is accessible as readonly array
// also comes in other flavors (e.g. readonly Set)
const eventHistory = MachineRunner.createEventStore();

const machine = createMachineRunner(..., {
  asyncIterBreakAfterDivergence: false, // the default, does not need to be written this way, just for demoing the API,
  eventHistory: {
    history: eventHistory,
    discarded: null,
    all: null
  }
});

for await (const state of machine){
  ...
}

const divergencePoint = machine.divergencePoint(); // null | ZERO | ActyxEvent
const rollbackedEvents = MachineRunner.utils.rollbackedEvents(eventHistory, divergencePoint) // ActyxEvent[]

while (true){
  const lastRollbackedEvent = rollbackedEvents.pop();
  if (!lastRollbackedEvent) return
  switch(lastRollbackedEvent.payload.type){
    case ....
  }
}

@rkuhn
Copy link
Member Author

rkuhn commented May 26, 2023

This opens up some interesting thoughts: we could also surface a time warp by returning it together with the state:

for await (const state of machine) {
  if (state.timeWarp) {
    // inspect states pre- and post-time-warp to see whether compensation is needed
  }
}

I still think that AsyncIterable is just a more resource-safe way of consuming an observable, their behaviour should be the same. Exiting the loop upon time warp would be confusing because then the state of the machine after the loop would be unclear (cleaned up or still running).

@rkuhn
Copy link
Member Author

rkuhn commented May 26, 2023

Another concern is noticing a compensation whose need arose while the local app was not running: seeing the event “I started this mission” after the event “someone else won the auction” should always trigger compensation — which needs to be recorded so that it isn’t done twice …

@Kelerchian
Copy link
Contributor

Kelerchian commented May 26, 2023

Exiting the loop upon time warp would be confusing because then the state of the machine after the loop would be unclear (cleaned up or still running).

That is correct. So ignore my first reconsideration

if (state.timeWarp) {
    // inspect states pre- and post-time-warp to see whether compensation is needed
  }

What do we want to expose with timeWarp?
Is it when timeTravel happens?
Is it only on divergence? (hence we ignore timeTravel)

Also, I suddenly remembered something:

  • Runner emits 'change' event to the outside world when caughtUp is detected instead of when there is a state.change. Is it possible that certain states are skipped because it happens between events before caughtUp?

@rkuhn
Copy link
Member Author

rkuhn commented May 26, 2023

Is it possible that certain states are skipped because it happens between events before caughtUp?

Ah, this is an interesting thought: we could allow the user to define that a certain state will always be emitted, also when not caughtUp. This would allow the consuming code to notice that a decision was reversed during a time warp. Handling the persistent compensation feels like a different issue that might need a different solution — or it might be covered by properly constructing the startup code of the application (i.e. while figuring out “what was I doing” it will automatically correct any mistakes).

A state emitted this way during a replay would have its .commands undefined.

@Kelerchian
Copy link
Contributor

What do you mean by persistent compensation?

@rkuhn
Copy link
Member Author

rkuhn commented May 26, 2023

What we’re mostly talking about here is compensation based on the ephemeral state of the application: we “saw” the wrong state and now we “see” the corrected state so we must compensate. We need to make this as easy and obvious as possible.

In the robot example, the ephemeral compensation consists of telling the robot controller to abort its current mission and then to start looking for a new mission.

The permanent state consists only of the event logs, nothing else. In there we might see events that are indicative of a mistake, which may need compensating events to rectify the situation — this is more relevant when considering other consumers of the event stream (later, or remote) so that they don’t get stuck with the mistake. An example here could be an analytics app counting robot missions and their duration. If only the ephemeral compensation happens then the erroneously started mission will be problematic for the analytics. Plus the programmer would likely want to count compensations as well.

What I’m saying is that we first only need to tackle the ephemeral part and make it work well. Regarding the permanent part we’ll need more product feedback as input.

@Kelerchian
Copy link
Contributor

I can't see a clear separation between the ephemeral part and the permanent part.
Let's say there are two different scenarios:

Scenario 1

[R1] think it is responsible for [mission]
[R2] think it is responsible for [mission]
[R1] fetches [item] from [source], [source] accepts the request because they agree [mission] is done by [R1]
[R2] fetches [item] from [source], [source] rejects the request because it thinks [mission] is done by [R1]
Reconciliation happens; [R1] and [R2] agree that [R1] works on [mission]
[R2]'s machine-runner state now states that [mission]'s worker is [R1], which does not match itself. Then it relinquishes the mission.

∴ Repairing the system's state needs [R2] to compare the machine state to its internal IAmWinner state, therefore requiring only in-memory recollection.

Scenario 2

[R1] think it is responsible for [mission]
[R2] think it is responsible for [mission]
[R1] fetches [item] from [source]
[R2] fetches [item] from [source]
Reconciliation happens; [R1] and [R2] agree that [R1] works on [mission]
[R2] needs to undo. It retrieves the history of its previous action
[R2] returns [item] to [source]
// While returning [item] to [source], [R2] may die and restart (forget what it does) so it might need to store its previous action, either using the same machine-runner or not.

∴ Repairing the system's state needs [R2] to know the history of its previous actions. In this scenario, it needs not only logs but also the history of the time-travel itself. (imagine git-reflog).

What I want to point out is: The length of what the actor needs to do to restore the real-world state to match its programmed model varies greatly, from simply "ignoring it" to "spending time to fix its mistake". In the latter case, there might be downtimes.

I'm having trouble finding the line where to cut it.

@Kelerchian
Copy link
Contributor

Kelerchian commented May 26, 2023

In the robot example, the ephemeral compensation consists of telling the robot controller to abort its current mission and then to start looking for a new mission.

In a simple case like this, "emit-on-divergence-detection" should suffice I think.

  • This should be doable by having the "last-consumed-event" before time travel.
  • After time travel, see if this "last-consumed-event" is consumed again.
  • If it is not consumed, then the runner "emits" with diverged flag

Or this could be resolved just by comparing payload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants