Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor caller orchestration #906

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Refactor caller orchestration #906

wants to merge 6 commits into from

Conversation

apexskier
Copy link
Contributor

@apexskier apexskier commented Mar 23, 2021

This refactors how caller orchestrators (e.g. serial, parallel, loop) are run and structured. The primary point of this is to make them simpler and more intuitive.

Currently, Call orchestration methods use two mechanisms to support cancellation: go context cancellation and subscription to the pubsub and listening for CallEnded events. (This lead to a bug in my mini-opctl branch where certain parallel goroutines weren't cleaned up properly.) Now, Call orchestration methods don't depend on a pubsub and always return once their children return (from context cancellation or normal completion). They also no longer rely on goto label expressions.

This makes handling outputs and testing orchestration simpler since there are fewer dependencies. The tests could be refactored to not check event emission at all, since the caller is fully in charge of that.

It simplifies call event emission - caller still depends on the pubsub, but containerCaller now depends on pubsub.EventPublisher. caller emits start and end events, containerCaller emits container stdout/err events. Other callers don't use pubsub (they rely on the wrapped caller).

Note that this is extracted from my mini-opctl branch, so there are a couple miscellaneous cleanup/fixes included.

Test coverage went down because of the deduplication and overall source code line reduction, so I've added explicit tests for output and scoping in these orchestrators and for needs in parallel calls. (Overall, the non-test source code has been reduced)

This refactors how caller orchestrators (e.g. serial, parallel, loop)
are run and structured. The primary point of this is to make them
simpler and more intuitive.

Currently, Call methods use two mechanisms to support cancellation: go
context cancellation and subscription to the pubsub and listening for
CallEnded events. (This lead to a bug in my `mini-opctl` branch where
certain parallel goroutines weren't cleaned up properly.) Now, Call
methods don't depend on a pubsub and always return once their children
return (from context cancellation or normal completion).

This makes handling outputs and testing orchestration simpler since
there are fewer dependencies. The tests could be refactored to not check
event emission at all, since the caller is fully in charge of that.

This also simplifies call event emission - `caller` still depends on the
pubsub, but `containerCaller` now depends on `pubsub.EventPublisher`.
`caller` emits start and end events, `containerCaller` emits container
stdout/err events. Other callers don't use pubsub (they rely on the
wrapped `caller`).
@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #906 (3fad502) into main (1fd644b) will increase coverage by 0.17%.
The diff coverage is 86.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #906      +/-   ##
==========================================
+ Coverage   65.30%   65.47%   +0.17%     
==========================================
  Files         168      168              
  Lines        5949     5906      -43     
==========================================
- Hits         3885     3867      -18     
+ Misses       1833     1813      -20     
+ Partials      231      226       -5     
Impacted Files Coverage Δ
sdks/go/node/core/containerCaller.go 47.05% <ø> (-0.77%) ⬇️
sdks/go/node/core/core.go 76.05% <40.00%> (-2.52%) ⬇️
sdks/go/node/core/opCaller.go 80.28% <44.44%> (-5.06%) ⬇️
sdks/go/node/core/parallelLoopCaller.go 85.08% <82.75%> (+3.95%) ⬆️
sdks/go/node/core/serialCaller.go 80.00% <87.50%> (-12.46%) ⬇️
sdks/go/node/core/parallelCaller.go 90.67% <90.62%> (+25.54%) ⬆️
sdks/go/node/core/caller.go 87.50% <100.00%> (-0.80%) ⬇️
sdks/go/node/core/serialLoopCaller.go 78.57% <100.00%> (-2.49%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fd644b...3fad502. Read the comment docs.

@apexskier apexskier marked this pull request as ready for review March 23, 2021 17:20
@apexskier apexskier requested a review from a team as a code owner March 23, 2021 17:20
childCallOutputsByIndex[childCallIndex] = event.CallEnded.Outputs
if event.CallEnded.Error != nil {
isChildErred = true
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the goals behind this set of changes. With this change, child calls must now run on the current node right? Also making this not powered off the event channel it seems like we're also losing the ability to resume/replay because all state is in process as opposed to the stateful pubsub. What if the current node dies or is killed.. If we're relying on in process state rather than the event channel then we don't have an ability to gracefully restart or resume. None of these are hard blockers but I don't necessarily understand the goal; it seems like it moves us farther away multi-node and HA/Fault tolerance.

Copy link
Contributor Author

@apexskier apexskier Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, child calls must now run on the current node right?

This is unaffected. Child calls are run in the same place they currently are, in a goroutine in the current execution context - line 101 in the new code, line 103 in old.

Also making this not powered off the event channel it seems like we're also losing the ability to resume/replay because all state is in process as opposed to the stateful pubsub.

The events being emitted are the same. The change for me is that it's now more clear that the caller is in charge of actually managing the call behavior, whether it's a kill, replay, restart or whatever, not the orchestrators as well (parallelCaller, etc) (they all delegate to the caller). It's much easier to manage state if the node dies or is killed, since it's all based off a single context object which is explicitly controlled by the caller, instead of being managed in multiple places. If in future calls are started on remote nodes, there's a single place for that logic to go - caller.Call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that the current code calls caller directly, but the outputs are event driven which is where I'd propose we want to be. There's been an iterative ongoing effort to make the call graph supportive of multi-node. Some details still TBD, but In the case of recursive call types like parallel calls or serial calls, the idea was they wouldn't call caller in process, they'd fire some event like CallRequested and wait on some event like CallCompleted. This would allow those calls to be processed by some other node in entirety and async which has interesting characteristics like load balancing calls, the current recursive call being pausable and resumable from any node, etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think this prevents us from moving in that direction. This definitely makes the call graph simpler and better tested. (#906 (comment) is a good example of where it's easier to actually return panics as errors, instead of just logging)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing prevents anything ever; all changes are always just a PR away : ). All I'm saying is IMHO this undoes some work we've done to try to move us in that direction; specifically the getting child call results via events as opposed to expecting the child calls ran in our local golang process.

@apexskier apexskier requested a review from psamaan March 30, 2021 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants