Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic repos constantly getting locked, and piling up jobs until manually unlocked. #1042

Open
erenfro opened this issue Dec 19, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@erenfro
Copy link

erenfro commented Dec 19, 2023

Describe the bug
I'm constantly seeing restic repos being in a locked state, logging showing there is an exclusive lock on the repository, and yet, there's no backups running. Seeing upwards of 7 backup jobs in red (on k9s's view), all of which logs show:

Starting container
VolSync restic container version: unknown
backup
restic 0.16.2 compiled with go1.21.3 on linux/amd64
Testing mandatory env variables
== Checking directory for content ===
== Initialize Dir =======
repo already locked, waiting up to 0s for the lock
unable to create lock in backend: repository is already locked exclusively by PID 48 on volsync-src-mealie-96mvk by (UID 0, GID 0)
lock was created at 2023-12-19 12:02:00 (5h24m42.719315996s ago)
storage ID b27ebdc3
the unlock command can be used to remove stale locks
ERROR: failure checking existence of repository
Stream closed EOF for selfhosted/volsync-src-mealie-vrl4b (restic)

Steps to reproduce
Setup volsync to backup using restic, and wait. That's all I did, and it started happening frequently. You can see my setup example here:
Template: https://github.com/erenfro/homelab-flux/tree/main/kubernetes/templates/volsync
Applocation: https://github.com/erenfro/homelab-flux/tree/main/kubernetes/apps/selfhosted/mealie

Expected behavior
Locks either don't happen at all, or volsync at least waits longer than 0s for a lock to be cleared, and if not, do some sanity checking and/or attempt to unlock on it's own to automatically remove a stale lock and try again, rather than outright failing.

Actual results
Repo becomes locked, stale, backup jobs queue up, error out, and backups cease to continue until manually cleared.

Additional context

Not sure how to belp you, but I hope I've provided enough information to identify the problem itself.

@erenfro erenfro added the bug Something isn't working label Dec 19, 2023
@tesshuflower
Copy link
Contributor

@erenfro Do you see any of the mover pods get killed or fail prior to the lock getting left behind?

FYI you can run an unlock from a replicationsource if you don't want to do it manually - see unlock under Backup options here: https://volsync.readthedocs.io/en/v0.8.0/usage/restic/index.html

@erenfro
Copy link
Author

erenfro commented Dec 19, 2023

That, per the documentation, makes it like it requires a string value and if it's the same value as before, then it will not continue to do so. This seems like it would work once, but then never again, unless once again updated. Since this stale lock is happening multiple times a day, this would not help.

@JohnStrunk
Copy link
Member

That's correct, the unlock in the ReplicationSource will only unlock once per value. The reason we don't actively unlock the repo is that restic uses those locks to prevent corrupting the repository.

The exclusive lock happens during the prune operation, so there's something about the environment that is leading to restic failing during that operation. Instead of unlocking aggressively, let's figure out what's causing the failures...

Things that come to mind are:

  • Potential OOM errors during the prune operation (does the node have sufficient memory?)
  • Out of disk space for the Restic cache volume (you can check to see if the cache PVC is full)
  • Damaged repository (can you prune or check the repo manually?)
  • Does the repository have sufficient free space to grow (is your object store full?)

@erenfro
Copy link
Author

erenfro commented Dec 20, 2023

That's correct, the unlock in the ReplicationSource will only unlock once per value. The reason we don't actively unlock the repo is that restic uses those locks to prevent corrupting the repository.

The exclusive lock happens during the prune operation, so there's something about the environment that is leading to restic failing during that operation. Instead of unlocking aggressively, let's figure out what's causing the failures...

This I very much do understand. I actually run resticprofile to backup all my Proxmox VE instances as well, on a nightly basis. My resticprofile backups uses the same MinIO S3 storage server running directly on my Synology NAS, and never has issues with locking.

Things that come to mind are:

* Potential OOM errors during the prune operation (does the node have sufficient memory?)

I'm assuming you mean OOM managed by the kernel? No nodes show any OOM action taken. Furthermore, each of the K3S nodes running are running at 70% or less memory capacity, which is roughly about 6GB available RAM. Often times more available RAM per node.
In the past, I had seen issues with descheduler, I found this to be an issue, ripped descheduler out until I can understand it better and optimize it better, but the volsync lock issues persisted beyond it's removal.

* Out of disk space for the Restic cache volume (you can check to see if the cache PVC is full)

My Restic Cache volumes are generally setup to use 8Gi, but they're also using local-path, and for local-path volumes those have 100Gi or more available. The snapshot volume's are setup with 8Gi, and my average origin volume right now is only 2Gi. My Nextcloud instance is the only one that was using higher than 2Gi and I've since moved that on to direct NFS methods instead.

* Damaged repository (can you `prune` or `check` the repo manually?)

I've setup a script that uses s3cmd and generates a resticprofile configuration with all pvc- backups in them, so I can easily run resticprofile pvc-. on them.
I checked a few, and the locks are upwards of 4~5 hours before I've noticed them... sometimes more! Restic itself has literally considered it stale, but still requires a manual "unlock" command issued. I ran unlock, check, and prune, without any issues, each time.

* Does the repository have sufficient free space to grow (is your object store full?)

The NAS server has 3 TiB free space.

This is why I'm scratching my head at all this, is that. I'm running restic without issue elsewhere. Granted the main difference is it's a nightly backup rather than hourly backup. And volsync is an hourly backup, often with little change between each backup.

@JohnStrunk
Copy link
Member

Sounds like you've checked the obvious stuff. My next thought is that you're going to have to catch it when it actually happens. Probably the easiest way to do that is to use the manual trigger and capture logs of the mover pod failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

3 participants