Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows: handle pending control actions #268

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

djdv
Copy link
Contributor

@djdv djdv commented Apr 24, 2021

tl;dr this should fix: #30


When running my own tests, I was encountering random failure/success related to sequences of service.Control functions.
Specifically the sequence ["install", "start", "restart"] failed more often than the rest.
I tracked it back to a call to Control within stopWait.

Before; if the service was in a StartPending state, that call to Control would fail.
Now; We query the status first, and wait for StartPending services to reach Running, before issuing a "stop" request for them.
And still as before; wait for the service to go from StopPending to Stopped after before returning.

In addition, there were 2 other bugs that may have been associated with this:

  1. The break within the timeout's select scope, breaks out of the select scope, but was likely meant to break out of the outer for's scope.
    As a result, I don't think these operations ever timed out.
  2. Run's signal channel was unbuffered, meaning it could potentially miss the signal from the OS.
    Leaving services running indefinitely.
    (The signal test was also calling test methods in a non-test routine - a goroutine within a test routine)

The wait loop was rewritten to use an upper and lower bound that matched the service example provided on MSDN.
And should use timeout hints if they're provided by the service.

The os signal was buffered, and the signal test was re-ordered so that the test failure happens inside the test routine scope.


In any case, my own tests of Control are passing consistently now. And the signal test passes too.

With the exception of any review remarks, this is probably good to go.
(I wrote this rather quickly, so things might be wrong or something could probably be expressed more elegantly - just let me know what should be changed and I'll push them through)

@coveralls
Copy link

coveralls commented Apr 24, 2021

Coverage Status

Coverage remained the same at 12.799% when pulling 7b85551 on djdv:fix/windows into ef35c56 on kardianos:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 12.799% when pulling 575d2ae on djdv:fix/windows into ef35c56 on kardianos:master.

@djdv djdv marked this pull request as draft May 10, 2021 14:43
@djdv
Copy link
Contributor Author

djdv commented May 10, 2021

Marked as a draft; there's a similar issue around Start that needs to be resolved.
service.Start() will return after the service control manager spawns the service process (ServiceMain), and sets the service state to SERVICE_START_PENDING, but that may be before the service interface's Start(Service) method is called, and before the service state is set to SERVICE_RUNNING.
(We can return from service.Start() between here and here)

This will cause problems for some of the other controls which don't check or wait for pending state.
And most important, if the caller expects the service's work routine to be running after Start() returns, that is not usually the case.
It just means the service executable is running and communicated with the service control manager, not that the service routine itself is ready yet.

On windows Status falls through for pending states, which makes it impossible for the caller to poll/distinguish themselves if just the service process has started, or if it's work routine has also started.

As such, I'm going to rework this patchset to make Start wait for the state to change before returning, like Stop does.
That is, make Start return after both the service process starts, and its work routine starts running too.

@djdv djdv force-pushed the fix/windows branch 3 times, most recently from ea2447d to a0e7b19 Compare May 12, 2021 17:17
@djdv
Copy link
Contributor Author

djdv commented May 12, 2021

This seems better now, and likely ready for review.

Originally I could run my own pkg's control tests once, sometimes passing and sometimes not.
After the previous patch, I could usually get go test -count 10 to succeed, but not always.
Currently, I can run go test -count 100 and only receive 1 failure but it may be related to my own pkg's service logic, not the control logic.

I'm going to investigate this more later.
Afterwards, I'll flip the PR back to "Ready", and post details about the changes made + some relevant notes that came up while testing.

@djdv djdv changed the title windows: wait for pending start before stop windows: handle pending control actions May 14, 2021
@djdv djdv force-pushed the fix/windows branch 2 times, most recently from 67cfdf2 to 7b85551 Compare May 15, 2021 14:01
@djdv
Copy link
Contributor Author

djdv commented May 15, 2021

I reworked the patchset so that communication with the service manager happens during Start and Stop. And just composed Restart from them.
Now the methods will issue a control request to the service control manager, but will also try to wait for the action to be completed before returning.

Some of the constants used are arbitrary, like the time value's and attempt count.
If there's recommended value's to use, we can use those instead, but as-is, these tolerances seem to be good enough on my machine within the tests for my own project.

Some of the tests from my pkg:

Each test was run 100 times without failure.
There are additional tests which further exercise control actions and Status, but I've omitted them here since they're mostly the same, just with pkg specific wrappings.

func TestServiceControl(t *testing.T) {
	t.Run("bad sequence", func(t *testing.T) {
		for _, test := range []struct {
			controlAction string
			shouldError   bool
		}{
			{"invalid control action", true},
			{"uninstall", true},
			{"install", false},
			{"start", false},
			{"start", true},
			{"stop", false},
			{"stop", true},
			{"uninstall", false},
		} {
			var (
				controlAction = test.controlAction
				shouldError   = test.shouldError
			)
			t.Run(controlAction, func(t *testing.T) {
				err := issueControlRequest(controlAction)
				if shouldError &&
					err == nil {
					t.Errorf("control \"%s\" was supposed to return an error, but did not",
						controlAction)
				}
				if !shouldError &&
					err != nil {
					t.Errorf("control \"%s\" returned error: %s",
						controlAction, err)
				}
			})
		}
		waitForUninstall(t)
	})
	t.Run("good sequence", func(t *testing.T) {
		for _, testControl := range []string{
			"install",
			"start",
			"restart",
			"stop",
			"uninstall",
		} {
			controlAction := testControl
			t.Run(controlAction, func(t *testing.T) {
				if err := issueControlRequest(controlAction); err != nil {
					t.Error(err)
				}
			})
		}
		waitForUninstall(t)
	})
}

Supplemental notes:

I further tested the Windows service checkpoint+waithint mechanisms.
As-is in this patch, both will be used if the service provides them in its status query results.

However, this package does not have a means for a service implementation to do so.
If anyone needs that, they'll need to modify Execute so that the service has some way to report its progress while it's in a pending state.

While testing this out, I had to modify Go's sys package because while it can provide this data, it currently (1.16) doesn't.

sys patch:
diff --git a/windows/svc/mgr/service.go b/windows/svc/mgr/service.go
index 7d735ca..cb7ff4f 100644
--- a/windows/svc/mgr/service.go
+++ b/windows/svc/mgr/service.go
@@ -54,8 +54,12 @@ func (s *Service) Control(c svc.Cmd) (svc.Status, error) {
 		return svc.Status{}, err
 	}
 	return svc.Status{
-		State:   svc.State(t.CurrentState),
-		Accepts: svc.Accepted(t.ControlsAccepted),
+		State:                   svc.State(t.CurrentState),
+		Accepts:                 svc.Accepted(t.ControlsAccepted),
+		CheckPoint:              t.CheckPoint,
+		WaitHint:                t.WaitHint,
+		Win32ExitCode:           t.Win32ExitCode,
+		ServiceSpecificExitCode: t.ServiceSpecificExitCode,
 	}, nil
 }
 
@@ -70,6 +74,8 @@ func (s *Service) Query() (svc.Status, error) {
 	return svc.Status{
 		State:                   svc.State(t.CurrentState),
 		Accepts:                 svc.Accepted(t.ControlsAccepted),
+		CheckPoint:              t.CheckPoint,
+		WaitHint:                t.WaitHint,
 		ProcessId:               t.ProcessId,
 		Win32ExitCode:           t.Win32ExitCode,
 		ServiceSpecificExitCode: t.ServiceSpecificExitCode,

After that, I just hardcoded sleeps and checkpoint responses within Execute to validate it actually works as expected.
The control actions seemed to respect the data sent from the service implementation, and waited accordingly.

@djdv djdv marked this pull request as ready for review May 15, 2021 14:44
@djdv
Copy link
Contributor Author

djdv commented May 15, 2021

Also, I previously made Uninstall try to wait until the service is actually uninstalled first, but ended up reverting that change because should likely be done caller side, since it can be.

Reverted diff:
a0e7b19?branch=a0e7b19284eace415a2a357ee11c7a5e90c9894c&diff=split#diff-c4a897e0bf69e5813fe60d120dae248e93b65dc1f27316ff333bac665006b0adR268

Partial client side implementation I'm using for the test posted above:
// The service uninstall control may return
// before the system service is fully stopped and deleted.
// As such, we want to explicitly wait between tests
// until the service control manager finishes removing the service,
// or we give up.
func waitForUninstall(t *testing.T) {
	t.Helper()

	const (
		checkInterval = 100 * time.Millisecond
		checkTimeout  = 10 * time.Second
	)
	var (
		ticker  = time.NewTicker(checkInterval)
		timeout = time.NewTimer(checkTimeout)
	)
	defer func() {
		ticker.Stop()
		timeout.Stop()
	}()
	for {
		t.Logf("waiting %s for uninstall...", checkInterval)
		select {
		case <-ticker.C:
			status, err := issueStatusRequest()
			if err != nil {
				t.Fatal(err)
			}
			if svcErr := status.ControllerError; svcErr != nil &&
				errors.Is(svcErr, hostservice.ErrNotInstalled) {
				return
			}
		case <-timeout.C:
			t.Fatalf("uninstall control did not finish in time (%s)", checkTimeout)
		}
	}
}

@djdv
Copy link
Contributor Author

djdv commented Jun 10, 2021

Small update, this still seems to be fine. However I think the original code and this patch still have an issue.
IIRC if your service interface implementation returns an error very early from myservice.Start, it does not always get returned from the Start control action from the service pkg.

I encountered this once due to an implementation error on my end, but didn't need to deal with it.
Planning to look into that more eventually, but just noting it here for now.
It's really only a problem if your Start method can error. The fix should just require changing how errors get returned from the Execute service/message loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test: TestInstallRunRestartStopRemove failed.
2 participants