Remove RESTART/PURGING states #3905

christoph-zededa · 2024-05-06T13:08:34Z

if an edge app upgrade is done while an EVE update is done,
the app status state stays in RESTARTING although EVE
will restart and therefore halt the app instance; therefore
the app status should go to HALT in this case.

OhmSpectator · 2024-05-06T14:15:58Z

Could you please explain why you decided to split the inactivation function?

Also, please test the snapshot functionality. This code path could be critical for the volume snap creation as we wait for the app to be deactivated during the purge. I can share some details in DM on how to perform it.

OhmSpectator · 2024-05-06T15:08:51Z

The places in the code I'm worrying about:

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 131 in 5fa2f87

domainStatus := lookupDomainStatus(ctx, uuidStr)
eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 61 in 5fa2f87

if !uninstall && domainStatus != nil && !domainStatus.Activated {

And here is why:
eve/pkg/pillar/docs/app-snapshot.md

Line 226 in a93b850

4. **Waiting for App Deactivation**: After the above steps, EVE is ready to
eve/pkg/pillar/docs/app-snapshot.md

Line 408 in a93b850

8. **Waiting for App Deactivation**: After marking the application for purge,
eve/pkg/pillar/docs/app-snapshot.md

Line 539 in a93b850

5. **Wait For Deferred Deactivation**

TL;DR: The code of the deactivation detection during the snap operations is pretty tricky, and I figured out the places where the wait should happen empirically. So it's pretty fragile.

I'll run the tests for snap manually.

OhmSpectator · 2024-05-06T15:30:07Z

We ran a simple Snap/Rollback/Delete test on an ext4-based app, and it looks working.

rouming · 2024-05-07T08:31:41Z

Do you understand what is a new state to which we transit from RESTARTING? I do not see how we can switch to HALTED from the doUpdate(), rather than these lines:

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 216 in c2ef643

status.State = types.HALTED

	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}

so we should set INSTALLED or START_DELAYED in the doInactivateHalt().

The other branch where we can transit to HALTED is doCleanup() called from the doInactivate(), but this is not our case I assume.

rouming · 2024-05-07T08:47:32Z

Also, what is so special about PURGING? You still keep PURGING in your check.
The original commit:

commit 6bf39d2b4453f841cff799bd63292ee191a5c627
Author: eriknordmark <erik@zededa.com>
Date:   Tue Sep 25 00:01:38 2018 -0700

    do not clobber RESTARTING and PURGING states

introduced the following:

+       switch status.State {
+       case types.RESTARTING, types.PURGING:
+               // Leave unchanged
+       default:
+               status.State = minState
+       }

for these callbacks: doInstall(), doPrepare(), doActivate(), but not for the doInactivateHalt().

and then this commit followed:

commit b24e1c2cae3e8b96934b57f1527862ec00ae4d06
Author: eriknordmark <erik@zededa.com>
Date:   Mon Oct 15 16:53:43 2018 -0700

    update from DomainStatus even if Pending

which covers the doInactivateHalt() with the same hunk.

@eriknordmark do we have a special meaning for the RESTARTING,PURGING in the doInactivateHalt()?

christoph-zededa · 2024-05-07T09:43:03Z

Do you understand what is a new state to which we transit from RESTARTING? I do not see how we can switch to HALTED from the doUpdate(), rather than these lines:

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 216 in c2ef643

status.State = types.HALTED
	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}
so we should set INSTALLED or START_DELAYED in the doInactivateHalt().

The other branch where we can transit to HALTED is doCleanup() called from the doInactivate(), but this is not our case I assume.

doUpdate calls doInactivateHalt, which calls doInactivateHaltFromDomainStatus and this sets the status.(*types.AppInstanceStatus).State to what is in ds.(*types.DomainStatus).State) .

Does this answer your question?

rouming · 2024-05-07T10:04:29Z

Does this answer your question?

Not quite. In order to see halted on the cloud you should set it somewhere, right? I see only two places in the code (specified in the comment below). But we should not take any of this paths. So I assume HALTED comes from the domainmgr and we update the applicationstatus with what comes from the domainmgr (now you removed RESTARTING, so can be updated with any state).

But! Take a look here:

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 716 in c2ef643

status.State = types.START_DELAYED

		// Check that we delay a not yet active VM or a VM in the bring-up state after restarting/purging
		if !status.Activated || status.RestartInprogress == types.BringUp || status.PurgeInprogress == types.BringUp {
			// If we try to activate it for the first time - mark is with the corresponding state
			if status.State != types.START_DELAYED {
				status.State = types.START_DELAYED
				return true
			}
			// if the VM is already in the START_DELAYED state - just return from the doActivate now
			return changed
		}

This is piece of code, which should eventually delay the start if we are in restart or purge.

Then eventually this code should do the final state transit to HALTED:

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 216 in c2ef643

status.State = types.HALTED

	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}

Either this two chunks are for something different, or they never worked as expected and you simply bypass them for only RESTARTING case (PURGING seems also should be covered). That's why my question: what exact state comes from domainmgr, which you set to the applicationstatus?

christoph-zededa · 2024-05-07T15:30:27Z

Does this answer your question?

Not quite. In order to see halted on the cloud you should set it somewhere, right? I see only two places in the code (specified in the comment below). But we should not take any of this paths. So I assume HALTED comes from the domainmgr and we update the applicationstatus with what comes from the domainmgr (now you removed RESTARTING, so can be updated with any state).

But! Take a look here:

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 716 in c2ef643

status.State = types.START_DELAYED
		// Check that we delay a not yet active VM or a VM in the bring-up state after restarting/purging
		if !status.Activated || status.RestartInprogress == types.BringUp || status.PurgeInprogress == types.BringUp {
			// If we try to activate it for the first time - mark is with the corresponding state
			if status.State != types.START_DELAYED {
				status.State = types.START_DELAYED
				return true
			}
			// if the VM is already in the START_DELAYED state - just return from the doActivate now
			return changed
		}
This is piece of code, which should eventually delay the start if we are in restart or purge.

Then eventually this code should do the final state transit to HALTED:

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 216 in c2ef643

status.State = types.HALTED
	if !effectiveActivate {
		if status.Activated || status.ActivateInprogress {
			c := doInactivateHalt(ctx, config, status)
			changed = changed || c
		}
		// Activated and ActivateInprogress flags may be changed during doInactivateHalt call
		if !status.Activated && !status.ActivateInprogress {
			// Since we are not activating we set the state to
			// HALTED to indicate it is not running since it
			// might have been halted before the device was rebooted
			if status.State == types.INSTALLED || status.State == types.START_DELAYED {
				status.State = types.HALTED
				changed = true
			}
		}
Either this two chunks are for something different, or they never worked as expected and you simply bypass them for only RESTARTING case (PURGING seems also should be covered). That's why my question: what exact state comes from domainmgr, which you set to the applicationstatus?

From direct discussion:
It only goes into START_DELAYED state if

eve/pkg/pillar/cmd/zedmanager/updatestatus.go

Line 711 in c2ef643

if time.Now().Before(status.StartTime) {

Therefore it cannot work like this and it is something different.

rouming · 2024-05-08T09:08:24Z

Yes, I thought the START_DELAYED is an intermediate state, thanks to @OhmSpectator for explaining.

rouming

This basically partially reverts the original commit:

commit b24e1c2cae3e8b96934b57f1527862ec00ae4d06
Author: eriknordmark <erik@zededa.com>
Date:   Mon Oct 15 16:53:43 2018 -0700

    update from DomainStatus even if Pending

Unfortunately not too much explanation there.

We spent quite some time understanding the states transitions. I do not see any bad side effects by removing the special check for restarting/purging and letting the transition happen. We have to be sure though, that the final state of the app has to be HALTED and we don't reboot the node somewhere in the middle, otherwise customer data can be corrupted, but @christoph-zededa verified that thoroughly.

eriknordmark · 2024-05-13T21:54:05Z

@eriknordmark do we have a special meaning for the RESTARTING,PURGING in the doInactivateHalt()?

They are special in that they are pending operations and not really state like the others. But we need to track that those operations are pending somewhere. I think that can be refactored so that we set HALTING, HALTED, BOOTING in State (hence report those to the controller) and we track the pending restart/purge in some separate field in AppInstanceStatus.
[Sorry, I haven't looked at the diffs in this PR yet - just got back from vacation.)

rouming · 2024-05-14T10:25:36Z

@eriknordmark do we have a special meaning for the RESTARTING,PURGING in the doInactivateHalt()?

They are special in that they are pending operations and not really state like the others. But we need to track that those operations are pending somewhere. I think that can be refactored so that we set HALTING, HALTED, BOOTING in State (hence report those to the controller) and we track the pending restart/purge in some separate field in AppInstanceStatus. [Sorry, I haven't looked at the diffs in this PR yet - just got back from vacation.)

We already have the following fields:

status.RestartInProgress == (NotInprogress | BringUp | BringDown)
status.PurgeInProgress == (NotInprogress | BringUp | BringDown)

So this information should not be lost.

eriknordmark · 2024-05-14T15:23:49Z

So this information should not be lost.

Can you all check whether there is any code which looks at the app instance status State field and does something different it is RESTARTING or PURGING (apart from the lines which avoid overwriting those values).
Or does all of the logic inside zedmanager look at RestartInProgress and PurgeInProgress? If so we can just stop using the RESTARTING and PURGING states - the reported state will then become RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING when someone does a restart or a purge.

"Ask not what you can add but what you can remove".

christoph-zededa · 2024-05-14T16:06:51Z

Can you all check whether there is any code which looks at the app instance status State field and does something different it is RESTARTING or PURGING (apart from the lines which avoid overwriting those values). Or does all of the logic inside zedmanager look at RestartInProgress and PurgeInProgress? If so we can just stop using the RESTARTING and PURGING states - the reported state will then become RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING when someone does a restart or a purge.

"Ask not what you can add but what you can remove".

Do you mean:

Only in this special case when app upgrade and eve node update are done at the same time? Then it should not use RESTARTING?
In general RESTARTING should not be used?
In this case what should we report to the controller when we do a restart?

RESTARTING is used (including reads):

 │  types.go:47 
 │  eve/pkg/pillar/cmd/zedagent/parseconfig.go:222
 │  eve/pkg/pillar/cmd/zedmanager/updatestatus.go:806
 │  eve/pkg/pillar/cmd/zedmanager/updatestatus.go:1178
 │  eve/pkg/pillar/types/types.go:101    
 │  eve/pkg/pillar/types/types.go:173

The IMO only interesting check is in parseconfig.go:222 in countRunningApps.

same for PURGING.

eriknordmark · 2024-05-14T20:37:55Z

2. In general RESTARTING should not be used?
In this case what should we report to the controller when we do a restart?

Yes, that is what I mean.

The user asked to see HALTED, which means the underlying behavior from domainmgr will report RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING as the restart proceeds. Same for purging.

christoph-zededa · 2024-05-16T13:06:13Z

Yes, that is what I mean.

The user asked to see HALTED, which means the underlying behavior from domainmgr will report RUNNING -> HALTING -> HALTED -> BOOTING -> RUNNING as the restart proceeds. Same for purging.

Hmm, I tried, but I don't think this is the way forward:

diff --git a/pkg/pillar/cmd/zedagent/handlemetrics.go b/pkg/pillar/cmd/zedagent/handlemetrics.go
index 2704d45582..d20c6c9b64 100644
--- a/pkg/pillar/cmd/zedagent/handlemetrics.go
+++ b/pkg/pillar/cmd/zedagent/handlemetrics.go
@@ -1034,10 +1034,50 @@ func encodeProxyStatus(proxyConfig *types.ProxyConfig) *info.ProxyStatus {
        return status
 }
 
+var ignoreSendingOutFollowingRestarts map[string]struct{}
+
+func init() {
+       ignoreSendingOutFollowingRestarts = make(map[string]struct{})
+}
+
+func PublishAppInfoToZedCloud(ctx *zedagentContext, uuid string,
+       aiStatus *types.AppInstanceStatus,
+       aa *types.AssignableAdapters, iteration int, dest destinationBitset) {
+       if aiStatus.State != types.RESTARTING {
+               publishAppInfoToZedCloud(ctx, uuid, aiStatus, aa, iteration, dest)
+               delete(ignoreSendingOutFollowingRestarts, uuid)
+               return
+       }
+
+       _, found := ignoreSendingOutFollowingRestarts[uuid]
+       if found {
+               return
+       }
+
+       for _, state := range []types.SwState{types.HALTING, types.HALTED, types.BOOTING} {
+               aiStatusCopy := *aiStatus
+               aiStatusCopy.State = state
+
+               publishAppInfoToZedCloud(ctx, uuid, &aiStatusCopy, aa, iteration, dest)
+
+               // do not overwrite states, but send them out first!
+               if dest&ControllerDest != 0 {
+                       deferredCtx := zedcloudCtx.DeferredEventCtx
+                       deferredCtx.HandleDeferredNow()
+               }
+               locConfig := ctx.getconfigCtx.sideController.locConfig
+               if dest&LOCDest != 0 && locConfig != nil {
+                       zedcloudCtx.DeferredPeriodicCtx.HandleDeferredNow()
+               }
+
+               ignoreSendingOutFollowingRestarts[uuid] = struct{}{}
+       }
+}
+
 // This function is called per change, hence needs to try over all management ports
 // When aiStatus is nil it means a delete and we send a message
 // containing only the UUID to inform zedcloud about the delete.
-func PublishAppInfoToZedCloud(ctx *zedagentContext, uuid string,
+func publishAppInfoToZedCloud(ctx *zedagentContext, uuid string,
        aiStatus *types.AppInstanceStatus,
        aa *types.AssignableAdapters, iteration int, dest destinationBitset) {
        log.Functionf("PublishAppInfoToZedCloud uuid %s", uuid)
diff --git a/pkg/pillar/zedcloud/deferred.go b/pkg/pillar/zedcloud/deferred.go
index 5e2610c0c0..34039a82d2 100644
--- a/pkg/pillar/zedcloud/deferred.go
+++ b/pkg/pillar/zedcloud/deferred.go
@@ -132,6 +132,10 @@ func (ctx *DeferredContext) processQueueTask(ps *pubsub.PubSub,
        }
 }
 
+func (ctx *DeferredContext) HandleDeferredNow() {
+       ctx.handleDeferred()
+}
+

It basically checks if the state is RESTARTING and then sends out (and immediately flushes) the states HALTING, HALTED, BOOTING.

The only way that RESTARTING can be set in domainStatus.State (afaik) is in zedmanager.go under

  1224      if config.RestartCmd.Counter != oldConfig.RestartCmd.Counter ||                                                                                                   
  1225          config.LocalRestartCmd.Counter != oldConfig.LocalRestartCmd.Counter {

christoph-zededa · 2024-05-16T13:50:29Z

The only way that RESTARTING can be set in domainStatus.State (afaik) is in zedmanager.go under

  1224      if config.RestartCmd.Counter != oldConfig.RestartCmd.Counter ||                                                                                                   
  1225          config.LocalRestartCmd.Counter != oldConfig.LocalRestartCmd.Counter {

@rouming pointed out that all this is not necessary and that I can just remove the

 1238              status.State = types.RESTARTING

eriknordmark · 2024-05-16T15:31:03Z

Hmm, I tried, but I don't think this is the way forward:

I'm not saying you should artificially generate those states, but instead use the existing logic in zedmanager to send those states. This implies that you'd never set the state to RESTARTING or PURGING. Thus the code can not read those but should rely on the restart/purge inprogress fields to have zedmanager go through exactly the same steps as it does today. Only visible change should be that zedagent never sees, hence doesn't send, RESTARTING or PURGING.

We can chat tomorrow if that is helpful - and note that I haven't looked at the existing code recently so I might be missing things.

state transition from RESTARTING to HALT(ING) if an edge app upgrade is done while an EVE update is done, the app status state stays in RESTARTING although EVE will restart and therefore halt the app instance; therefore the app status should go to HALT in this case. Signed-off-by: Christoph Ostarek <christoph@zededa.com>

'lock' is used for locking deferredItems, so let's clarify this! Signed-off-by: Christoph Ostarek <christoph@zededa.com>

OhmSpectator · 2024-05-17T16:18:07Z

I will retest the PR manually... I wanna be sure it does not break my stuff.

include removing dead code Signed-off-by: Christoph Ostarek <christoph@zededa.com> Signed-off-by: Christoph Ostarek <christoph@zededa.com>

doing a purge; otherwise HALTED state would be reported directly Signed-off-by: Christoph Ostarek <christoph@zededa.com>

OhmSpectator · 2024-05-17T16:56:28Z

I've tested what I wanted, and it looks fine...

But we should also announce this change loud, I believe:

The test team may have some tests that expect the states as they are now.
As it's seen in the UI in the events flow, we should inform users about the change. Users may get used to the current events flow and expect this to remain the same. In the worst case, it can break some of their automation.

eriknordmark · 2024-05-20T15:52:58Z

I've tested what I wanted, and it looks fine...

But we should also announce this change loud, I believe:

The test team may have some tests that expect the states as they are now.

As it's seen in the UI in the events flow, we should inform users about the change. Users may get used to the current events flow and expect this to remain the same. In the worst case, it can break some of their automation.

In addition to testing and annoucing, please check the docs we have in the eve repo and check with the ZEDEDA commercial controller documentation if there are things which need to be updated there. (We don't have documentation at this level of detail for Adam/Eden so no need to check there.)

eriknordmark

LGTM

christoph-zededa requested a review from rouming as a code owner May 6, 2024 13:08

christoph-zededa marked this pull request as draft May 6, 2024 13:08

github-actions bot requested review from eriknordmark and OhmSpectator May 6, 2024 13:08

christoph-zededa force-pushed the doInactivateHaltTest branch from ea9e22e to e75add8 Compare May 6, 2024 13:16

christoph-zededa marked this pull request as ready for review May 6, 2024 15:30

christoph-zededa force-pushed the doInactivateHaltTest branch from e75add8 to 51b96f5 Compare May 7, 2024 15:27

rouming approved these changes May 8, 2024

View reviewed changes

christoph-zededa force-pushed the doInactivateHaltTest branch from 51b96f5 to 2628cce Compare May 17, 2024 16:00

christoph-zededa requested a review from milan-zededa as a code owner May 17, 2024 16:00

github-actions bot requested review from rouming and uncleDecart May 17, 2024 16:00

christoph-zededa force-pushed the doInactivateHaltTest branch from 2628cce to 4c059a3 Compare May 17, 2024 16:10

christoph-zededa added 2 commits May 17, 2024 18:11

zedcloud/deferred: clarify variable name

31d0d4e

'lock' is used for locking deferredItems, so let's clarify this! Signed-off-by: Christoph Ostarek <christoph@zededa.com>

christoph-zededa force-pushed the doInactivateHaltTest branch from 4c059a3 to e8cc0a6 Compare May 17, 2024 16:11

christoph-zededa changed the title ~~Allow HALT status to be sent to controller when in RESTART state~~ Remove RESTART/PURGING states May 17, 2024

christoph-zededa added 2 commits May 17, 2024 18:28

zedmanager: do not report RESTARTING or PURGING

cfd5c60

include removing dead code Signed-off-by: Christoph Ostarek <christoph@zededa.com> Signed-off-by: Christoph Ostarek <christoph@zededa.com>

zedmanager: report HALTING state correctly when

888a81d

doing a purge; otherwise HALTED state would be reported directly Signed-off-by: Christoph Ostarek <christoph@zededa.com>

christoph-zededa force-pushed the doInactivateHaltTest branch from e8cc0a6 to 888a81d Compare May 17, 2024 16:28

eriknordmark approved these changes May 20, 2024

View reviewed changes

rouming approved these changes May 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove RESTART/PURGING states #3905

Remove RESTART/PURGING states #3905

christoph-zededa commented May 6, 2024 •

edited

OhmSpectator commented May 6, 2024

OhmSpectator commented May 6, 2024

OhmSpectator commented May 6, 2024

rouming commented May 7, 2024

rouming commented May 7, 2024

christoph-zededa commented May 7, 2024

rouming commented May 7, 2024

christoph-zededa commented May 7, 2024

rouming commented May 8, 2024

rouming left a comment

eriknordmark commented May 13, 2024

rouming commented May 14, 2024

eriknordmark commented May 14, 2024

christoph-zededa commented May 14, 2024

eriknordmark commented May 14, 2024

christoph-zededa commented May 16, 2024

christoph-zededa commented May 16, 2024

eriknordmark commented May 16, 2024

OhmSpectator commented May 17, 2024

OhmSpectator commented May 17, 2024

eriknordmark commented May 20, 2024

eriknordmark left a comment

Remove RESTART/PURGING states #3905

Are you sure you want to change the base?

Remove RESTART/PURGING states #3905

Conversation

christoph-zededa commented May 6, 2024 • edited

OhmSpectator commented May 6, 2024

OhmSpectator commented May 6, 2024

OhmSpectator commented May 6, 2024

rouming commented May 7, 2024

rouming commented May 7, 2024

christoph-zededa commented May 7, 2024

rouming commented May 7, 2024

christoph-zededa commented May 7, 2024

rouming commented May 8, 2024

rouming left a comment

Choose a reason for hiding this comment

eriknordmark commented May 13, 2024

rouming commented May 14, 2024

eriknordmark commented May 14, 2024

christoph-zededa commented May 14, 2024

eriknordmark commented May 14, 2024

christoph-zededa commented May 16, 2024

christoph-zededa commented May 16, 2024

eriknordmark commented May 16, 2024

OhmSpectator commented May 17, 2024

OhmSpectator commented May 17, 2024

eriknordmark commented May 20, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

christoph-zededa commented May 6, 2024 •

edited