Refactor caller orchestration #906

apexskier · 2021-03-23T13:31:50Z

This refactors how caller orchestrators (e.g. serial, parallel, loop) are run and structured. The primary point of this is to make them simpler and more intuitive.

Currently, Call orchestration methods use two mechanisms to support cancellation: go context cancellation and subscription to the pubsub and listening for CallEnded events. (This lead to a bug in my mini-opctl branch where certain parallel goroutines weren't cleaned up properly.) Now, Call orchestration methods don't depend on a pubsub and always return once their children return (from context cancellation or normal completion). They also no longer rely on goto label expressions.

This makes handling outputs and testing orchestration simpler since there are fewer dependencies. The tests could be refactored to not check event emission at all, since the caller is fully in charge of that.

It simplifies call event emission - caller still depends on the pubsub, but containerCaller now depends on pubsub.EventPublisher. caller emits start and end events, containerCaller emits container stdout/err events. Other callers don't use pubsub (they rely on the wrapped caller).

Note that this is extracted from my mini-opctl branch, so there are a couple miscellaneous cleanup/fixes included.

Test coverage went down because of the deduplication and overall source code line reduction, so I've added explicit tests for output and scoping in these orchestrators and for needs in parallel calls. (Overall, the non-test source code has been reduced)

This refactors how caller orchestrators (e.g. serial, parallel, loop) are run and structured. The primary point of this is to make them simpler and more intuitive. Currently, Call methods use two mechanisms to support cancellation: go context cancellation and subscription to the pubsub and listening for CallEnded events. (This lead to a bug in my `mini-opctl` branch where certain parallel goroutines weren't cleaned up properly.) Now, Call methods don't depend on a pubsub and always return once their children return (from context cancellation or normal completion). This makes handling outputs and testing orchestration simpler since there are fewer dependencies. The tests could be refactored to not check event emission at all, since the caller is fully in charge of that. This also simplifies call event emission - `caller` still depends on the pubsub, but `containerCaller` now depends on `pubsub.EventPublisher`. `caller` emits start and end events, `containerCaller` emits container stdout/err events. Other callers don't use pubsub (they rely on the wrapped `caller`).

codecov · 2021-03-23T13:38:52Z

Codecov Report

Merging #906 (3fad502) into main (1fd644b) will increase coverage by 0.17%.
The diff coverage is 86.17%.

@@            Coverage Diff             @@
##             main     #906      +/-   ##
==========================================
+ Coverage   65.30%   65.47%   +0.17%     
==========================================
  Files         168      168              
  Lines        5949     5906      -43     
==========================================
- Hits         3885     3867      -18     
+ Misses       1833     1813      -20     
+ Partials      231      226       -5

Impacted Files	Coverage Δ
sdks/go/node/core/containerCaller.go	`47.05% <ø> (-0.77%)`	⬇️
sdks/go/node/core/core.go	`76.05% <40.00%> (-2.52%)`	⬇️
sdks/go/node/core/opCaller.go	`80.28% <44.44%> (-5.06%)`	⬇️
sdks/go/node/core/parallelLoopCaller.go	`85.08% <82.75%> (+3.95%)`	⬆️
sdks/go/node/core/serialCaller.go	`80.00% <87.50%> (-12.46%)`	⬇️
sdks/go/node/core/parallelCaller.go	`90.67% <90.62%> (+25.54%)`	⬆️
sdks/go/node/core/caller.go	`87.50% <100.00%> (-0.80%)`	⬇️
sdks/go/node/core/serialLoopCaller.go	`78.57% <100.00%> (-2.49%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fd644b...3fad502. Read the comment docs.

chrisdostert · 2021-03-27T00:09:24Z

sdks/go/node/core/parallelCaller.go

-				childCallOutputsByIndex[childCallIndex] = event.CallEnded.Outputs
-				if event.CallEnded.Error != nil {
-					isChildErred = true
+	for {


I'm trying to understand the goals behind this set of changes. With this change, child calls must now run on the current node right? Also making this not powered off the event channel it seems like we're also losing the ability to resume/replay because all state is in process as opposed to the stateful pubsub. What if the current node dies or is killed.. If we're relying on in process state rather than the event channel then we don't have an ability to gracefully restart or resume. None of these are hard blockers but I don't necessarily understand the goal; it seems like it moves us farther away multi-node and HA/Fault tolerance.

With this change, child calls must now run on the current node right?

This is unaffected. Child calls are run in the same place they currently are, in a goroutine in the current execution context - line 101 in the new code, line 103 in old.

Also making this not powered off the event channel it seems like we're also losing the ability to resume/replay because all state is in process as opposed to the stateful pubsub.

The events being emitted are the same. The change for me is that it's now more clear that the caller is in charge of actually managing the call behavior, whether it's a kill, replay, restart or whatever, not the orchestrators as well (parallelCaller, etc) (they all delegate to the caller). It's much easier to manage state if the node dies or is killed, since it's all based off a single context object which is explicitly controlled by the caller, instead of being managed in multiple places. If in future calls are started on remote nodes, there's a single place for that logic to go - caller.Call.

It's true that the current code calls caller directly, but the outputs are event driven which is where I'd propose we want to be. There's been an iterative ongoing effort to make the call graph supportive of multi-node. Some details still TBD, but In the case of recursive call types like parallel calls or serial calls, the idea was they wouldn't call caller in process, they'd fire some event like CallRequested and wait on some event like CallCompleted. This would allow those calls to be processed by some other node in entirety and async which has interesting characteristics like load balancing calls, the current recursive call being pausable and resumable from any node, etc...

I still don't think this prevents us from moving in that direction. This definitely makes the call graph simpler and better tested. (#906 (comment) is a good example of where it's easier to actually return panics as errors, instead of just logging)

Nothing prevents anything ever; all changes are always just a PR away : ). All I'm saying is IMHO this undoes some work we've done to try to move us in that direction; specifically the getting child call results via events as opposed to expecting the child calls ran in our local golang process.

sdks/go/node/core/parallelCaller.go

apexskier added 3 commits March 23, 2021 15:26

Add more direct caller behavior test

8a19058

Adding tests

2ec5d1f

Add parallel needs test

0c0750d

apexskier marked this pull request as ready for review March 23, 2021 17:20

apexskier requested a review from a team as a code owner March 23, 2021 17:20

chrisdostert reviewed Mar 27, 2021

View reviewed changes

apexskier requested a review from psamaan March 30, 2021 11:57

chrisdostert reviewed Apr 5, 2021

View reviewed changes

sdks/go/node/core/parallelCaller.go Show resolved Hide resolved

apexskier added 2 commits April 6, 2021 09:51

Handle panics in parallel callers

000c03b

Merge branch 'main' into callgraph-refactor

3fad502

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor caller orchestration #906

Refactor caller orchestration #906

apexskier commented Mar 23, 2021 •

edited

codecov bot commented Mar 23, 2021 •

edited

chrisdostert Mar 27, 2021 •

edited

apexskier Mar 30, 2021 •

edited

chrisdostert Apr 5, 2021 •

edited

apexskier Apr 6, 2021

chrisdostert Apr 7, 2021

Refactor caller orchestration #906

Are you sure you want to change the base?

Refactor caller orchestration #906

Conversation

apexskier commented Mar 23, 2021 • edited

codecov bot commented Mar 23, 2021 • edited

Codecov Report

chrisdostert Mar 27, 2021 • edited

Choose a reason for hiding this comment

apexskier Mar 30, 2021 • edited

Choose a reason for hiding this comment

chrisdostert Apr 5, 2021 • edited

Choose a reason for hiding this comment

apexskier Apr 6, 2021

Choose a reason for hiding this comment

chrisdostert Apr 7, 2021

Choose a reason for hiding this comment

apexskier commented Mar 23, 2021 •

edited

codecov bot commented Mar 23, 2021 •

edited

chrisdostert Mar 27, 2021 •

edited

apexskier Mar 30, 2021 •

edited

chrisdostert Apr 5, 2021 •

edited