Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup from staged deployments #2510

Open
dbnicholson opened this issue Jan 6, 2022 · 9 comments · May be fixed by #2511
Open

Cleanup from staged deployments #2510

dbnicholson opened this issue Jan 6, 2022 · 9 comments · May be fixed by #2511

Comments

@dbnicholson
Copy link
Member

I finally got around to changing our updater to use staged deployments and one thing we lose is pruning of the rollback deployment. Since the ref isn't removed until the new deployment is finalized, the objects are still on disk until some later process prunes. Our updater runs a full cleanup after staging, so the old rollback deployment would get pruned when a new update comes in. However, that may not happen for a long time and it effectively means that you always have 3 deployments on disk.

We could solve this downstream in Endless, but fixing this is something that any user of ostree staged deployments could benefit from. My idea is to have a sysroot autocleanup mode that only runs the cleanup if a known file exists and then delete it when the cleanup completes.

For example, /sysroot/.cleanup is written out when the deployment is finalized. Add an API (or a cmdprivate) that runs ostree_sysroot_cleanup only when /sysroot/.cleanup exists and deletes it when done. Call this from ostree admin cleanup --auto. Add a systemd unit like ostree-sysroot-auto-cleanup.service with ConditionPathExists=/sysroot/.cleanup and ExecStart=/usr/bin/ostree admin cleanup --auto.

WDYT?

@cgwalters
Copy link
Member

Hmm. I am not opposed to this, but so far the vision for ostree had been more of as a library. That said this has come up a few times but it would be really nice to try to have more shared daemon code. I've had this offhand thought that we could try to start that daemon code in Rust in ostree-rs-ext?

OTOH, we could also ship some services like this as a build-time (off by default?) option?

The problem I see here is if we suddenly start shipping more on-by-default systemd services we could be interfering with user code.

@dbnicholson
Copy link
Member Author

Right, there are definitely some competing interests here.

  • Any updater daemon may want handle this kind of cleanup themselves and then the systemd service is going to get in the way.

  • The logic to encode this state is really best to live in the staging finalization. When we were discussing this for eos-updater, @wjt suggested the stamp file has to be in /etc so you don't get into a situation where eos-updater created the stamp file, you have an unclean shutdown, and then the cleanup mechanism triggers on the next boot even though you didn't actually write out the deployment yet. Having it in ostree after the call to write out the deployment means it could only exist when a staged deployment has been finalized but not pruned.

  • I had trouble figuring out where to actually put this in eos-updater. Startup? A random idle time? Having an independent service that handles this during boot seemed nice even though it might lock the repo and block eos-updater.

I'm just about done with a PR to implement this, but I think with just the mechanism in place any downstream can decide how to handle it. I.e., if /sysroot/.cleanup (or whatever) is written out during finalization and ostree_sysroot_auto_cleanup exists, then it's trivial to call it. That could be from a daemon or a script in a systemd service or whatever.

@cgwalters
Copy link
Member

One thing that has come up in the past too is that in some cases in ostree core, we may want a more generic post-update service which could handle anything that we needed to defer to the next boot. The way I've been thinking of this is that it'd actually run in the initrams, before or after the pivot root. This would allow us to perform "fixups" which could include cleaning.

Having this in the initramfs would avoid any race conditions with update services in the real root.

But, OTOH I think we want to get out the initramfs as fast as possible, and this has the potential to block for a while.

I suspect in your case users would prefer to get a usable desktop as fast as possible after an update, with a GC operation running in the background, rather than block bootup. Which is related to this concern:

Having an independent service that handles this during boot seemed nice even though it might lock the repo and block eos-updater.

I hadn't considered this issue much; for the most part the space leakage from the rollback deployment hasn't been a problem in our cases.

Short term, though given the above issues it seems like it makes the most sense to have higher level code (eos-updater in this case) own this problem?

@cgwalters
Copy link
Member

I had trouble figuring out where to actually put this in eos-updater. Startup? A random idle time?

I think that's just it though - there's a clear need for configurability and control here. And, likely the ability to cancel it. That bit relates to e.g. coreos/rpm-ostree#2969 - rpm-ostree internally has this concept of a single "transaction" operation that can operate on the repo at a time, but it's cancellable. So for rpm-ostree it'd be a much better fit to do this kind of thing internally because then it's more cleanly cancellable. (That said, we could systemctl stop of course too)

It may also help for the desktop use case to make this an explicit "background" operation, i.e. ionice etc. (Though IME the ostree case can be filesystem metadata i.e. journal heavy, which causes contention with other users even though we're niced. xref openshift/machine-config-operator#1897 where I did a ton of investigation into trying to do "background" updates on the openshift control plane nodes which run etcd which really wants all the I/O it can get)

@dbnicholson
Copy link
Member Author

Fair points. I'll post my PR as a proof of concept, but I'll carry on handling this downstream. One thing I realized is that I can entirely emulate this now by adding a drop-in for ostree-finalize-staged.service that just has:

[Service]
ExecStop=-/bin/touch /sysroot/.cleanup

That would run only after ostree admin finalize-staged succeeded. And then we can just add our own systemd unit that runs ostree admin cleanup with ConditionPathExists=/sysroot/.cleanup.

@dbnicholson dbnicholson linked a pull request Jan 7, 2022 that will close this issue
@dbnicholson
Copy link
Member Author

My POC is in #2511. Let me know what you think.

@lucab
Copy link
Member

lucab commented Jan 10, 2022

I'm not intimately familiar with eos-updater flow, so please bear with me if my questions below are imprecise.

I finally got around to changing our updater to use staged deployments and one thing we lose is pruning of the rollback deployment. Since the ref isn't removed until the new deployment is finalized, the objects are still on disk until some later process prunes.
Our updater runs a full cleanup after staging, so the old rollback deployment would get pruned when a new update comes in. However, that may not happen for a long time and it effectively means that you always have 3 deployments on disk.

I'm not fully understanding what happens here and what are your concerns.

The 3 deployments should be only present between the time an update is received/staged and the corresponding reboot/finalization happens, correct? Before an update is staged, I'd only expect 1 (or 2, if rollback already exists) to be present, right? After the reboot/finalization, the 3 deployments should rotate back to 2, I think?

If so, that sounds similar to how rpm-ostree works, which allows to both 1) rollback to the previous deployment, or 2) finalize the pending one. Or did I misunderstand that?
Which problems does this bring in your context? Which deployment would you want to see disappear, at which point and under which conditions?
Would maybe you prefer dropping the rollback one as soon as the new update is staged? Is that closer to your current updater flow (prior to the new staged logic)?

@dbnicholson
Copy link
Member Author

Currently the old deployment is deleted and the repo is pruned right after the new deployment is written out with simple_write_deployment. With staged deployments, that all happens at shut down with the design decision not to prune the repo since that could block shut down for a long time.

I agree with that decision, but it means the repo is still holding the objects from that old deployment. Effectively you have 3 commits on disk even though only 2 are actual deployments. You can try this now if you already have 2 deployments. Try pruning the repo immediately after booting into a new deployment and you'll find there are objects pruned even if you haven't pulled anything.

Many of our users are on much lower spec hardware, so they might not have piles of disk space around to waste. Furthermore, many of our users may not actually upgrade that often, so that old old commit may actually be quite different and have a significant number of objects that would be pruned. So for Endless I'd consider it a regression to leave dangling objects from an old OS commit indefinitely.

dbnicholson added a commit to endlessm/eos-updater that referenced this issue Jan 11, 2022
When OSTree staged deployments are used, the old rollback deployment is
deleted during system shutdown. To keep from slowing down shutdown, the
OSTree repo is not pruned at that time. That means that even though the
deployment was deleted, the objects are still on disk. Since that may be
a significant amount of wasted disk space, the full cleanup with repo
pruning needs to be run at some time after rebooting. See
ostreedev/ostree#2510 for details.

To detect when cleanup is necessary, a systemd drop in is added to touch
the `/sysroot/.cleanup` file after `ostree-finalize-staged.service` has
finalized the new deployment. The reason to use a drop-in for
`ostree-finalize-staged.service` rather then creating the file from
`eos-updater` is to avoid the situation where an unclean shutdown occurs
and the new deployment is not finalized. In that case, cleanup would be
run unnecessarily on the next boot.

A new systemd service, `eos-updater-autocleanup.service`, is added to
run `ostree admin cleanup` when `/sysroot/.cleanup` exists and then
delete it afterwards. This adds a dependency on the `ostree` CLI but a
separate program could be provided calling the `ostree_sysroot_cleanup`
API and deleting the `/sysroot/.cleanup` file itself.

https://phabricator.endlessm.com/T5658
dbnicholson added a commit to endlessm/eos-updater that referenced this issue Jan 12, 2022
When OSTree staged deployments are used, the old rollback deployment is
deleted during system shutdown. To keep from slowing down shutdown, the
OSTree repo is not pruned at that time. That means that even though the
deployment was deleted, the objects are still on disk. Since that may be
a significant amount of wasted disk space, the full cleanup with repo
pruning needs to be run at some time after rebooting. See
ostreedev/ostree#2510 for details.

To detect when cleanup is necessary, a systemd drop in is added to touch
the `/sysroot/.cleanup` file after `ostree-finalize-staged.service` has
finalized the new deployment. The reason to use a drop-in for
`ostree-finalize-staged.service` rather then creating the file from
`eos-updater` is to avoid the situation where an unclean shutdown occurs
and the new deployment is not finalized. In that case, cleanup would be
run unnecessarily on the next boot.

A new systemd service, `eos-updater-autocleanup.service`, is added to
run `ostree admin cleanup` when `/sysroot/.cleanup` exists and then
delete it afterwards. This adds a dependency on the `ostree` CLI but a
separate program could be provided calling the `ostree_sysroot_cleanup`
API and deleting the `/sysroot/.cleanup` file itself.

https://phabricator.endlessm.com/T5658
@jlebon
Copy link
Member

jlebon commented Jan 17, 2022

Hmm, here's another half-baked idea: at staging time, we also perform a mock prune where we gather the list of files to be deleted and store it somewhere under /run (as part of the -staged object?). Then at finalization time, we use the list to know what to delete so that we don't have to incur another reachability crawl.

We would need to handle invalidation of the list correctly (e.g. include a state SHA of all refs or something) but it should make the operation much less I/O intensive (though obviously there's still some base cost in deleting files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants