-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor caller orchestration #906
base: main
Are you sure you want to change the base?
Conversation
This refactors how caller orchestrators (e.g. serial, parallel, loop) are run and structured. The primary point of this is to make them simpler and more intuitive. Currently, Call methods use two mechanisms to support cancellation: go context cancellation and subscription to the pubsub and listening for CallEnded events. (This lead to a bug in my `mini-opctl` branch where certain parallel goroutines weren't cleaned up properly.) Now, Call methods don't depend on a pubsub and always return once their children return (from context cancellation or normal completion). This makes handling outputs and testing orchestration simpler since there are fewer dependencies. The tests could be refactored to not check event emission at all, since the caller is fully in charge of that. This also simplifies call event emission - `caller` still depends on the pubsub, but `containerCaller` now depends on `pubsub.EventPublisher`. `caller` emits start and end events, `containerCaller` emits container stdout/err events. Other callers don't use pubsub (they rely on the wrapped `caller`).
Codecov Report
@@ Coverage Diff @@
## main #906 +/- ##
==========================================
+ Coverage 65.30% 65.47% +0.17%
==========================================
Files 168 168
Lines 5949 5906 -43
==========================================
- Hits 3885 3867 -18
+ Misses 1833 1813 -20
+ Partials 231 226 -5
Continue to review full report at Codecov.
|
childCallOutputsByIndex[childCallIndex] = event.CallEnded.Outputs | ||
if event.CallEnded.Error != nil { | ||
isChildErred = true | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand the goals behind this set of changes. With this change, child calls must now run on the current node right? Also making this not powered off the event channel it seems like we're also losing the ability to resume/replay because all state is in process as opposed to the stateful pubsub. What if the current node dies or is killed.. If we're relying on in process state rather than the event channel then we don't have an ability to gracefully restart or resume. None of these are hard blockers but I don't necessarily understand the goal; it seems like it moves us farther away multi-node and HA/Fault tolerance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, child calls must now run on the current node right?
This is unaffected. Child calls are run in the same place they currently are, in a goroutine in the current execution context - line 101 in the new code, line 103 in old.
Also making this not powered off the event channel it seems like we're also losing the ability to resume/replay because all state is in process as opposed to the stateful pubsub.
The events being emitted are the same. The change for me is that it's now more clear that the caller is in charge of actually managing the call behavior, whether it's a kill, replay, restart or whatever, not the orchestrators as well (parallelCaller, etc) (they all delegate to the caller). It's much easier to manage state if the node dies or is killed, since it's all based off a single context object which is explicitly controlled by the caller, instead of being managed in multiple places. If in future calls are started on remote nodes, there's a single place for that logic to go - caller.Call
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that the current code calls caller directly, but the outputs are event driven which is where I'd propose we want to be. There's been an iterative ongoing effort to make the call graph supportive of multi-node. Some details still TBD, but In the case of recursive call types like parallel calls or serial calls, the idea was they wouldn't call caller in process, they'd fire some event like CallRequested and wait on some event like CallCompleted. This would allow those calls to be processed by some other node in entirety and async which has interesting characteristics like load balancing calls, the current recursive call being pausable and resumable from any node, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think this prevents us from moving in that direction. This definitely makes the call graph simpler and better tested. (#906 (comment) is a good example of where it's easier to actually return panics as errors, instead of just logging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing prevents anything ever; all changes are always just a PR away : ). All I'm saying is IMHO this undoes some work we've done to try to move us in that direction; specifically the getting child call results via events as opposed to expecting the child calls ran in our local golang process.
This refactors how caller orchestrators (e.g. serial, parallel, loop) are run and structured. The primary point of this is to make them simpler and more intuitive.
Currently,
Call
orchestration methods use two mechanisms to support cancellation: go context cancellation and subscription to the pubsub and listening for CallEnded events. (This lead to a bug in mymini-opctl
branch where certain parallel goroutines weren't cleaned up properly.) Now,Call
orchestration methods don't depend on a pubsub and always return once their children return (from context cancellation or normal completion). They also no longer rely on goto label expressions.This makes handling outputs and testing orchestration simpler since there are fewer dependencies. The tests could be refactored to not check event emission at all, since the caller is fully in charge of that.
It simplifies call event emission -
caller
still depends on the pubsub, butcontainerCaller
now depends onpubsub.EventPublisher
.caller
emits start and end events,containerCaller
emits container stdout/err events. Other callers don't use pubsub (they rely on the wrappedcaller
).Note that this is extracted from my mini-opctl branch, so there are a couple miscellaneous cleanup/fixes included.
Test coverage went down because of the deduplication and overall source code line reduction, so I've added explicit tests for output and scoping in these orchestrators and for
needs
in parallel calls. (Overall, the non-test source code has been reduced)