Stop machine monitoring on Session shutdown #90

awiss · 2020-07-10T03:39:42Z

Creates a context for each invocation of Session.run and cancels it once the run completes. Also threads this context through to all calls down the chain.

Took another look and I think doing things this way will work out fine. There are still some bigmachine logs that linger after shutdown is called which I can clean up as well, will follow up with another PR there.

Still WIP, adding tests.

exec/bigmachine.go

jcharum · 2020-07-10T16:11:14Z

exec/session.go

@@ -272,6 +272,8 @@ var statusMu sync.Mutex

 func (s *Session) run(ctx context.Context, calldepth int, funcv *bigslice.FuncValue, args ...interface{}) (*Result, error) {
 	location := "<unknown>"
+	runContext, runContextCancel := context.WithCancel(ctx)
+	defer runContextCancel()


It seems like this will cause machine management to stop when this cancellation is executed when returning from this method, as this context is now the context passed to (*sliceMachine).Go. This is undesirable, as Sessions can be used to run multiple invocations.

I think we want to continue to log the lost machine... message as long as the session has not been shut down, as task results can be reused by future invocations, and it's reasonably informative to know that they will need to be recomputed. I think we should look to only clean up the messages that folks see after session shutdown, as lost tasks have no observable effect at that point.

Alright. This brings up another issue however. Do you think it is alright to store a context on the session struct (I earlier found "Do not store Contexts inside a struct type" in the documentation). If not do you think there is another way to do this? The only other way I could think of is to require the user to create their own context and pass it to each session method.

One possibility is to keep a channel that we close on shutdown.

I've just put up a channel based version of this. Let me know what you think.

I think the problem is unresolved. My reading is as follows:

(*Session).run now creates a Context, which I'll call ctx henceforth. It is cancelled when (*Session).run returns.

(*Session).run passes ctx to Eval.

Eval passes ctx to executor.Run. Suppose executor is a *bigmachineExecutor.

(*bigmachineExecutor).Run passes ctx to (*bigmachineExecutor).manager.

(*bigmachineExecutor).manager passes ctx to (*machineManager).Do.

(*machineManager).Do passes ctx to startMachines.

startMachines passes ctx to (*sliceMachine).Go.

When (*Session).run returns, both:

(*machineManager).Do will return, thereby no longer managing machines.

(*sliceMachine).Go will return (for all machines).

Neither of these things should happen, as the *Session needs to continue to be usable.

As of the current diff (*slicemachine).Go returns based on a channel which is closed on session shutdown. Pushing up a diff now which will make (*machineManager).Do exit based on this channel as well.

I've leaved the context plumbed through but it won't deactivate those processes when its closed.

When the context is cancelled, the channel returned by ctx.Done() will be closed, so the select will unblock. When we check the for-loop condition, ctx.Err() will be non-nil and (*sliceMachine).Go will return. Could you help me understand how this reasoning is invalid?

Ah I forgot about the for loop condition I was just looking at the select at the bottom. I just updated it to use background context and double checked on adhoc.

jcharum

Is there a still a reason for changing anything about context usage? If not, let's not change anything related to context. If so, could you explain?

awiss · 2020-07-14T18:19:37Z

No there is no reason anymore. The only change I've left in is moving the cancel from Eval to Session.run(), since it cleans up the call to maintainSliceGroup.

jcharum · 2020-07-14T18:24:07Z

No there is no reason anymore. The only change I've left in is moving the cancel from Eval to Session.run(), since it cleans up the call to maintainSliceGroup.

Why touch anything related to the contexts at all? Is it fixing some issue?

awiss · 2020-07-14T18:26:15Z

Why touch anything related to the contexts at all? Is it fixing some issue?

Just a refactor, trying to leave the code a bit nicer than I found it (no contexts will be canceled at any time different than before) I can revert those changes as well if you aren't comfortable with them.

jcharum · 2020-07-14T18:45:46Z

Why touch anything related to the contexts at all? Is it fixing some issue?

Just a refactor, trying to leave the code a bit nicer than I found it (no contexts will be canceled at any time different than before) I can revert those changes as well if you aren't comfortable with them.

In general, I would try to do any refactoring in a separate PR, as it makes it clear that there should be no behavior change and allows us to roll things back more easily.

In this specific case, I don't think it's actually the right thing to do. Eval makes a context that it cancels on return, because this context is used by all the goroutines that Eval spawns to monitor the tasks with which it is concerned. Eval internally wants to stop these goroutines on return, as the work they are doing is no longer necessary. With the proposed change, this control is moved out to the caller, which means that those goroutines will persist until the caller decides to cancel ctx. In the case of (*Session).run, that happens to be immediately after Eval returns, but there's nothing that requires that, e.g. another call of Eval leaves the context uncancelled for meaningfully longer.

awiss · 2020-07-14T19:58:56Z

Thanks for the explanation. I've removed those changes.

jcharum · 2020-07-15T00:47:59Z

(Executor).Start already returns a function that is meant to be called to shut down the Executor. This exposes a second way. Between the two, I prefer the existing one, as I think having a function to call is less awkward than a channel to close to signal shutdown. I think it's better to leave the mechanism of shut down entirely up to the Executor implementation.

Relatedly, the code is a bit difficult to reason about because we've now got functions that have both a context and a channel that affect control flow. I think we should instead look to make the interaction between these mechanisms as clear as possible.

I also think this implementation still has a race, albeit one that's probably not too harmful given current implementations. Specifically, on shutdown we both call the bigmachine shutdown and close the channel. I don't think anything guarantees that the close will be handled before the transition to bigmachine.Stopped is handled. I think we're just lucky right now that bigmachine takes its time before machines transition, for whatever reason.

Here's another possible approach. Instead of returning b.b.Shutdown directly from (*bigmachineExecutor).Start, do something like:

type bigmachineExecutor struct {
	...
	shutdownc <-chan struct{}
	machinesWG sync.WaitGroup // Things we should wait for before bigmachine shutdown.
}

func (b *bigmachineExecutor) Start() func() {
	...
	shutdownc := make(chan struct{})
	b.shutdownc = shutdownc
	return func() {
		close(shutdownc)
		b.machinesWG.Wait() // Address the race.
		b.b.Shutdown()
	}
}

func (b *bigmachineExecutor) manager(i int) *machineManager {
	...
	// Bridge to contexts.
	ctx, cancel := context.WithCancel(backgroundcontext.Get())
	go func() {
		<-b.shutdownc
		cancel()
	}
	b.managersWG.Add(1)
	go func() {
		defer b.managersWG.Done()
		b.managers[i].Do(ctx)
	}

Then when we run the manager loops, we can continue to manage them with a context, e.g. something like:

type machineManager struct {
	...
	machinesWG sync.WaitGroup
}

func (m *machineManager) Do(ctx context.Context) {
	...
	machines := startMachines(ctx, ...) // Lift the call to Go out of this; see below.
	for _, machine := range machines {
		m.machinesWG.Add(1)
		go func() {
			defer m.machinesWG.Done()
			machine.Go(ctx) // Clearly attach the machine lifetime to the manager loop.
			...
		}
	}
	...
}

Note that I haven't attempted to compile any of this code, so caveat emptor. This seems quite a bit cleaner to me though:

No change to Executor interface. I think the change only involves two files.
Continue consistent pattern of control loop methods just taking a context for external cancellation control.
Very limited scope for the channel signaled on shutdown.
Correct handling of the race mentioned above, no matter what we do with the (*bigmachine.B).Shutdown implementation.

Let me know what you think, as perhaps I'm missing something.

awiss · 2020-07-16T00:13:52Z

This is much cleaner, thank you. I've just pushed these changes.

exec/bigmachine.go

exec/slicemachine.go

Co-authored-by: Jaran Charumilind <src@me.jcharum.com>

awiss · 2020-07-21T02:50:54Z

With the code as is we will see the machine lost logs on shutdown immediately, meaning users will see this more often now. Should we do something else to hide those before landing this?

awiss requested a review from jcharum July 10, 2020 03:39

jcharum suggested changes Jul 10, 2020

View reviewed changes

awiss requested a review from jcharum July 11, 2020 00:24

awiss force-pushed the better-contexts branch from cac4019 to efd2705 Compare July 14, 2020 16:49

jcharum suggested changes Jul 14, 2020

View reviewed changes

awiss changed the title ~~Use a unique context for each call of Session.Run~~ Stop machine monitoring on Session shutdown Jul 14, 2020

mariusae reviewed Jul 16, 2020

View reviewed changes

exec/bigmachine.go Outdated Show resolved Hide resolved

exec/slicemachine.go Outdated Show resolved Hide resolved

exec/slicemachine.go Outdated Show resolved Hide resolved

awiss force-pushed the better-contexts branch from acb8c3a to c5e7805 Compare July 20, 2020 18:43

Alex Wissmann added 14 commits July 20, 2020 12:54

Use a unique context for each call of Session.Run

e97c750

Use shutdown channel instead

8d262d7

Put context as first argument

c8933b3

revert debug change to slicer main

2b4529b

Exit machineManager.Do based on shutdownc

9f1d839

Use background context again

ac25547

test fixes

0eea356

add shutdown test

f90c048

Remove some context plumbing

9486b15

goimports

dad8604

remove final context change

30aabc0

jerry proposal

ca309f3

lint

9483e48

remove defer

3749a4d

awiss force-pushed the better-contexts branch from faa1dd0 to 3749a4d Compare July 20, 2020 19:55

remove go.sum changes

993f92c

jcharum suggested changes Jul 20, 2020

View reviewed changes

exec/slicemachine.go Outdated Show resolved Hide resolved

exec/slicemachine.go Outdated Show resolved Hide resolved

awiss and others added 2 commits July 20, 2020 13:33

Use more idiomatic go func in loop

07f2241

Co-authored-by: Jaran Charumilind <src@me.jcharum.com>

Dont return on ctx.Done()

6dd809c

Co-authored-by: Jaran Charumilind <src@me.jcharum.com>

jcharum approved these changes Jul 20, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop machine monitoring on Session shutdown #90

Stop machine monitoring on Session shutdown #90

awiss commented Jul 10, 2020 •

edited

jcharum Jul 10, 2020

awiss Jul 10, 2020

jcharum Jul 10, 2020

awiss Jul 11, 2020

jcharum Jul 14, 2020 •

edited

awiss Jul 14, 2020

awiss Jul 14, 2020

jcharum Jul 14, 2020

awiss Jul 14, 2020

jcharum left a comment

awiss commented Jul 14, 2020

jcharum commented Jul 14, 2020

awiss commented Jul 14, 2020

jcharum commented Jul 14, 2020

awiss commented Jul 14, 2020

jcharum commented Jul 15, 2020

awiss commented Jul 16, 2020

awiss commented Jul 21, 2020

Stop machine monitoring on Session shutdown #90

Are you sure you want to change the base?

Stop machine monitoring on Session shutdown #90

Conversation

awiss commented Jul 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcharum Jul 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcharum left a comment

Choose a reason for hiding this comment

awiss commented Jul 14, 2020

jcharum commented Jul 14, 2020

awiss commented Jul 14, 2020

jcharum commented Jul 14, 2020

awiss commented Jul 14, 2020

jcharum commented Jul 15, 2020

awiss commented Jul 16, 2020

awiss commented Jul 21, 2020

awiss commented Jul 10, 2020 •

edited

jcharum Jul 14, 2020 •

edited