Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup stops working without showing any errors #243

Closed
mattiarainieri opened this issue Apr 23, 2024 · 7 comments
Closed

Backup stops working without showing any errors #243

mattiarainieri opened this issue Apr 23, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@mattiarainieri
Copy link

Good Morning

Describe the bug
After a while the backup is running, it stops working without showing any errors. The network usage becomes 0% and to resume I need to delete the backup and force a new one. I'm using rclone to sync using WebDav as protocol; i was not able to see/enable any logs 🥲
Thanks everyone for your help

To Reproduce
Steps to reproduce the behavior:

  1. Create an Hetzner Storage Box
  2. Init it using rclone config using WebDav
  3. Start a new Backup

Screenshots
Screenshot 2024-04-23 at 10 59 12

Platform Info

  • OS and Architecture [Docker on Ubuntu 22.04 x86]
  • Backrest Version [e.g. 0.17.2]
@mattiarainieri mattiarainieri added the bug Something isn't working label Apr 23, 2024
@mattiarainieri mattiarainieri changed the title Add rclone logs Backup stops working without showing any errors Apr 23, 2024
@garethgeorge
Copy link
Owner

Hey, you should have logs available in systemd or you can try running backrest manually (e.g. outside systemd) and capturing its output to a file.

@mattiarainieri
Copy link
Author

I'm running it using the official docker container, but inside it, both systemd and journalctl are missing. Should you suggest running on the host operating system?

@garethgeorge
Copy link
Owner

Ah, sorry missed that you're using docker. docker logs <container name> should return the logs for the container.

@wirehiker
Copy link

Platform:

  • docker-ce 5:26.1.0-1~ debian.12~bookworm on OMV 7.0.5-1 (Sandworm)
  • Backrest 0.17.2

I experienced the same problem with slight variations: the backup repository was on an internal HDD, hence I didn't use the rclone config to set up the repository.
The backup task stalled at ~8% after a few minutes had passed, and it didn't progress in the next 10 hours. Therefore, I considered it stalled and cancelled the process.

To Reproduce
Steps to reproduce the behavior:

  • Create a repository on an internal folder
  • Create a backup plan
  • Start a new Backup

The log file doesn't show any error or warning :
�[2Kbackrest | 2024-04-27T19:34:08.521+0200 �[35mDEBUG�[0m loading auth secret from file �[2Kbackrest | 2024-04-27T19:34:08.530+0200 �[34mINFO�[0m scheduling default tasks, waiting for task queue reset. �[2Kbackrest | 2024-04-27T19:34:08.530+0200 �[34mINFO�[0m reset task queue, scheduling new task set. �[2Kbackrest | 2024-04-27T19:34:08.530+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-27T19:34:13+02:00"} �[2Kbackrest | 2024-04-27T19:34:08.531+0200 �[34mINFO�[0m starting orchestrator loop �[2Kbackrest | 2024-04-27T19:34:08.533+0200 �[34mINFO�[0m starting web server :9898 �[2Kbackrest | 2024-04-27T19:34:13.533+0200 �[34mINFO�[0m running task {"task": "collect garbage"} �[2Kbackrest | 2024-04-27T19:34:13.539+0200 �[34mINFO�[0m collecting garbage {"forgotten_snapshots": 0, "operations_removed": 0} �[2Kbackrest | 2024-04-27T19:34:13.539+0200 �[34mINFO�[0m task finished {"task": "collect garbage", "duration": "5.859073ms"} �[2Kbackrest | 2024-04-27T19:34:13.539+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-28T19:34:13+02:00"} �[2Kbackrest | 2024-04-27T19:35:14.649+0200 �[34mINFO�[0m scheduling default tasks, waiting for task queue reset. �[2Kbackrest | 2024-04-27T19:35:14.649+0200 �[35mDEBUG�[0m queued task cancelled due to config change {"task": "collect garbage"} �[2Kbackrest | 2024-04-27T19:35:14.649+0200 �[34mINFO�[0m reset task queue, scheduling new task set. �[2Kbackrest | 2024-04-27T19:35:14.649+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-27T19:35:19+02:00"} �[2Kbackrest | 2024-04-27T19:35:19.654+0200 �[34mINFO�[0m running task {"task": "collect garbage"} �[2Kbackrest | 2024-04-27T19:35:19.662+0200 �[34mINFO�[0m collecting garbage {"forgotten_snapshots": 0, "operations_removed": 0} �[2Kbackrest | 2024-04-27T19:35:19.662+0200 �[34mINFO�[0m task finished {"task": "collect garbage", "duration": "7.951121ms"} �[2Kbackrest | 2024-04-27T19:35:19.662+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-28T19:35:19+02:00"} �[2Kbackrest | 2024-04-27T19:36:16.960+0200 �[35mDEBUG�[0m updating config {"version": 1} �[2Kbackrest | 2024-04-27T19:36:16.968+0200 �[35mDEBUG�[0m applying config {"version": 1} �[2Kbackrest | 2024-04-27T19:36:16.968+0200 �[34mINFO�[0m scheduling default tasks, waiting for task queue reset. �[2Kbackrest | 2024-04-27T19:36:16.968+0200 �[35mDEBUG�[0m queued task cancelled due to config change {"task": "collect garbage"} �[2Kbackrest | 2024-04-27T19:36:16.968+0200 �[34mINFO�[0m reset task queue, scheduling new task set. �[2Kbackrest | 2024-04-27T19:36:16.968+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-27T19:36:21+02:00"} �[2Kbackrest | 2024-04-27T19:36:16.969+0200 �[35mDEBUG�[0m scheduling index snapshots task �[2Kbackrest | 2024-04-27T19:36:16.969+0200 �[34mINFO�[0m scheduling task {"task": "index snapshots for plan \"matteo.local.repo\"", "runAt": "2024-04-27T19:36:16+02:00"} �[2Kbackrest | 2024-04-27T19:36:16.969+0200 �[35mDEBUG�[0m done add repo �[2Kbackrest | 2024-04-27T19:36:16.969+0200 �[34mINFO�[0m running task {"task": "index snapshots for plan \"matteo.local.repo\""} �[2Kbackrest | 2024-04-27T19:36:17.743+0200 �[35mDEBUG�[0m indexed known snapshot IDs for repo matteo.local.repo in 56.001µs �[2Kbackrest | 2024-04-27T19:36:17.748+0200 �[35mDEBUG�[0m indexed snapshots {"repo": "matteo.local.repo", "duration": "4.261685ms", "alreadyIndexed": 0, "newlyAdded": 0, "markedForgotten": 0} �[2Kbackrest | 2024-04-27T19:36:17.748+0200 �[34mINFO�[0m task finished {"task": "index snapshots for plan \"matteo.local.repo\"", "duration": "778.73811ms"} �[2Kbackrest | 2024-04-27T19:36:21.972+0200 �[34mINFO�[0m running task {"task": "collect garbage"} �[2Kbackrest | 2024-04-27T19:36:21.978+0200 �[34mINFO�[0m collecting garbage {"forgotten_snapshots": 0, "operations_removed": 0} �[2Kbackrest | 2024-04-27T19:36:21.978+0200 �[34mINFO�[0m task finished {"task": "collect garbage", "duration": "5.806488ms"} �[2Kbackrest | 2024-04-27T19:36:21.978+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-28T19:36:21+02:00"} �[2Kbackrest | 2024-04-27T19:40:45.238+0200 �[34mINFO�[0m scheduling default tasks, waiting for task queue reset. �[2Kbackrest | 2024-04-27T19:40:45.238+0200 �[35mDEBUG�[0m queued task cancelled due to config change {"task": "collect garbage"} �[2Kbackrest | 2024-04-27T19:40:45.238+0200 �[34mINFO�[0m reset task queue, scheduling new task set. �[2Kbackrest | 2024-04-27T19:40:45.238+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-27T19:40:50+02:00"} �[2Kbackrest | 2024-04-27T19:40:45.244+0200 �[34mINFO�[0m scheduling task {"task": "backup for plan \"matteo.partial.plan\"", "runAt": "2024-04-27T19:42:00+02:00"} �[2Kbackrest | 2024-04-27T19:40:50.240+0200 �[34mINFO�[0m running task {"task": "collect garbage"} �[2Kbackrest | 2024-04-27T19:40:50.250+0200 �[34mINFO�[0m collecting garbage {"forgotten_snapshots": 0, "operations_removed": 0} �[2Kbackrest | 2024-04-27T19:40:50.250+0200 �[34mINFO�[0m task finished {"task": "collect garbage", "duration": "9.699785ms"} �[2Kbackrest | 2024-04-27T19:40:50.250+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-28T19:40:50+02:00"} �[2Kbackrest | 2024-04-27T19:42:00.001+0200 �[34mINFO�[0m running task {"task": "backup for plan \"matteo.partial.plan\""} �[2Kbackrest | 2024-04-27T19:42:00.006+0200 �[34mINFO�[0m Starting backup {"plan": "matteo.partial.plan", "opId": 1797510550246129665} �[2Kbackrest | 2024-04-27T19:42:00.006+0200 �[35mDEBUG�[0m repo orchestrator starting backup {"repo": "matteo.local.repo"} �[2Kbackrest | 2024-04-27T19:42:06.871+0200 �[35mDEBUG�[0m got snapshots for plan {"repo": "matteo.local.repo", "repo": "matteo.local.repo", "count": 0, "plan": "matteo.partial.plan", "tag": "plan:matteo.partial.plan"}

After I redeployed the container, the same plan on the same repository successfully completed the backup, flawlessly according to expectations:
�[2Kbackrest | 2024-04-28T15:10:02.889+0200 �[35mDEBUG�[0m loading auth secret from file �[2Kbackrest | 2024-04-28T15:10:02.904+0200 �[35mDEBUG�[0m unlocking repo {"repo": "matteo.local.repo", "repo": "matteo.local.repo"} �[2Kbackrest | 2024-04-28T15:10:03.521+0200 �[34mINFO�[0m scheduling default tasks, waiting for task queue reset. �[2Kbackrest | 2024-04-28T15:10:03.521+0200 �[34mINFO�[0m reset task queue, scheduling new task set. �[2Kbackrest | 2024-04-28T15:10:03.521+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-28T15:10:08+02:00"} �[2Kbackrest | 2024-04-28T15:10:03.525+0200 �[34mINFO�[0m scheduling task {"task": "backup for plan \"matteo.partial.plan\"", "runAt": "2024-04-28T19:42:00+02:00"} �[2Kbackrest | 2024-04-28T15:10:03.525+0200 �[34mINFO�[0m starting orchestrator loop �[2Kbackrest | 2024-04-28T15:10:03.527+0200 �[34mINFO�[0m starting web server :9898 �[2Kbackrest | 2024-04-28T15:10:08.523+0200 �[34mINFO�[0m running task {"task": "collect garbage"} �[2Kbackrest | 2024-04-28T15:10:08.532+0200 �[34mINFO�[0m collecting garbage {"forgotten_snapshots": 0, "operations_removed": 0} �[2Kbackrest | 2024-04-28T15:10:08.532+0200 �[34mINFO�[0m task finished {"task": "collect garbage", "duration": "8.528664ms"} �[2Kbackrest | 2024-04-28T15:10:08.532+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-29T15:10:08+02:00"} �[2Kbackrest | 2024-04-28T15:10:39.226+0200 �[35mDEBUG�[0m unlocking repo {"repo": "matteo.local.repo", "repo": "matteo.local.repo"} �[2Kbackrest | 2024-04-28T15:15:26.296+0200 �[34mINFO�[0m scheduling default tasks, waiting for task queue reset. �[2Kbackrest | 2024-04-28T15:15:26.300+0200 �[35mDEBUG�[0m queued task cancelled due to config change {"task": "backup for plan \"matteo.partial.plan\""} �[2Kbackrest | 2024-04-28T15:15:26.301+0200 �[35mDEBUG�[0m queued task cancelled due to config change {"task": "collect garbage"} �[2Kbackrest | 2024-04-28T15:15:26.301+0200 �[34mINFO�[0m reset task queue, scheduling new task set. �[2Kbackrest | 2024-04-28T15:15:26.301+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-28T15:15:31+02:00"} �[2Kbackrest | 2024-04-28T15:15:26.305+0200 �[34mINFO�[0m scheduling task {"task": "backup for plan \"matteo.partial.plan\"", "runAt": "2024-04-28T15:16:00+02:00"} �[2Kbackrest | 2024-04-28T15:15:31.303+0200 �[34mINFO�[0m running task {"task": "collect garbage"} �[2Kbackrest | 2024-04-28T15:15:31.311+0200 �[34mINFO�[0m collecting garbage {"forgotten_snapshots": 0, "operations_removed": 0} �[2Kbackrest | 2024-04-28T15:15:31.311+0200 �[34mINFO�[0m task finished {"task": "collect garbage", "duration": "7.697248ms"} �[2Kbackrest | 2024-04-28T15:15:31.311+0200 �[34mINFO�[0m scheduling task {"task": "collect garbage", "runAt": "2024-04-29T15:15:31+02:00"} �[2Kbackrest | 2024-04-28T15:16:00.028+0200 �[34mINFO�[0m running task {"task": "backup for plan \"matteo.partial.plan\""} �[2Kbackrest | 2024-04-28T15:16:00.035+0200 �[34mINFO�[0m Starting backup {"plan": "matteo.partial.plan", "opId": 1797584454996197379} �[2Kbackrest | 2024-04-28T15:16:00.035+0200 �[35mDEBUG�[0m repo orchestrator starting backup {"repo": "matteo.local.repo"} �[2Kbackrest | 2024-04-28T15:16:06.904+0200 �[35mDEBUG�[0m got snapshots for plan {"repo": "matteo.local.repo", "repo": "matteo.local.repo", "count": 0, "plan": "matteo.partial.plan", "tag": "plan:matteo.partial.plan"} �[2Kbackrest | 2024-04-28T18:12:29.537+0200 �[35mDEBUG�[0m Backup completed {"repo": "matteo.local.repo", "repo": "matteo.local.repo", "duration": "2h56m22.630435411s"} �[2Kbackrest | 2024-04-28T18:12:29.543+0200 �[34mINFO�[0m Backup complete {"plan": "matteo.partial.plan", "duration": "2h56m29.506771428s", "summary": "summary:{files_new:93721 dirs_new:19695 data_blobs:251450 tree_blobs:18592 data_added:253023600353 total_files_processed:93721 total_bytes_processed:334081958305 total_duration:10582.065976532 snapshot_id:\"711374edf671e83223a2c0ce8d62820429fb8358a0c56b49dc027f3701da8115\"}"} �[2Kbackrest | 2024-04-28T18:12:29.562+0200 �[34mINFO�[0m scheduling task {"task": "forget for plan \"matteo.partial.plan\"", "runAt": "2024-04-28T18:12:29+02:00"} �[2Kbackrest | 2024-04-28T18:12:29.562+0200 �[34mINFO�[0m scheduling task {"task": "index snapshots for plan \"matteo.local.repo\"", "runAt": "2024-04-28T18:12:29+02:00"} �[2Kbackrest | 2024-04-28T18:12:29.574+0200 �[35mDEBUG�[0m wrote operation log to 2024-04-28-logs.tar/0 �[2Kbackrest | 2024-04-28T18:12:29.579+0200 �[34mINFO�[0m task finished {"task": "backup for plan \"matteo.partial.plan\"", "duration": "2h56m29.550845644s"} �[2Kbackrest | 2024-04-28T18:12:29.584+0200 �[34mINFO�[0m scheduling task {"task": "backup for plan \"matteo.partial.plan\"", "runAt": "2024-04-29T15:16:00+02:00"} �[2Kbackrest | 2024-04-28T18:12:29.584+0200 �[34mINFO�[0m running task {"task": "forget for plan \"matteo.partial.plan\""} �[2Kbackrest | 2024-04-28T18:12:30.494+0200 �[35mDEBUG�[0m forget snapshots {"plan": "matteo.partial.plan", "count": 0, "policy": "policy_time_bucketed:{hourly:24 daily:7 weekly:4 monthly:3}"} �[2Kbackrest | 2024-04-28T18:12:30.497+0200 �[35mDEBUG�[0m wrote operation log to 2024-04-28-logs.tar/2560 �[2Kbackrest | 2024-04-28T18:12:30.502+0200 �[34mINFO�[0m task finished {"task": "forget for plan \"matteo.partial.plan\"", "duration": "917.45117ms"} �[2Kbackrest | 2024-04-28T18:12:30.502+0200 �[34mINFO�[0m running task {"task": "index snapshots for plan \"matteo.local.repo\""} �[2Kbackrest | 2024-04-28T18:12:31.284+0200 �[35mDEBUG�[0m indexed known snapshot IDs for repo matteo.local.repo in 776.485µs �[2Kbackrest | 2024-04-28T18:12:31.289+0200 �[35mDEBUG�[0m indexed snapshots {"repo": "matteo.local.repo", "duration": "4.757505ms", "alreadyIndexed": 0, "newlyAdded": 1, "markedForgotten": 0} �[2Kbackrest | 2024-04-28T18:12:31.289+0200 �[34mINFO�[0m task finished {"task": "index snapshots for plan \"matteo.local.repo\"", "duration": "787.750105ms"}

Hopefully this helps, and thank you for your time and the good work ;)

@mattiarainieri
Copy link
Author

I am continuing to investigate... I've used this time an s3 buket, so insted using rclone i've used the built-in restic, but the operation got stuck anyway. I would like to emphasize that the blockage occurs only for long time operations, for example, a first backup to a new memory with not-too-fast internet connections, such as mine (~24h).
Nothing relevant in the logs, here the final snippet:

2024-05-01T09:12:16.024Z	INFO	task finished	{"task": "index snapshots for plan \"cubbit-test\"", "duration": "3.191874876s"}
2024-05-01T09:12:17.830Z	INFO	running task	{"task": "collect garbage"}
2024-05-01T09:12:17.832Z	INFO	collecting garbage	{"forgotten_snapshots": 0, "operations_removed": 0}
2024-05-01T09:12:17.832Z	INFO	task finished	{"task": "collect garbage", "duration": "2.047841ms"}
2024-05-01T09:12:17.832Z	INFO	scheduling task	{"task": "collect garbage", "runAt": "2024-05-02T09:12:17Z"}
2024-05-01T09:14:21.901Z	INFO	scheduling default tasks, waiting for task queue reset.
2024-05-01T09:14:21.901Z	DEBUG	queued task cancelled due to config change	{"task": "collect garbage"}
2024-05-01T09:14:21.902Z	DEBUG	queued task cancelled due to config change	{"task": "backup for plan \"raid-backup\""}
2024-05-01T09:14:21.902Z	INFO	reset task queue, scheduling new task set.
2024-05-01T09:14:21.902Z	INFO	scheduling task	{"task": "collect garbage", "runAt": "2024-05-01T09:14:26Z"}
2024-05-01T09:14:21.903Z	INFO	scheduling task	{"task": "backup for plan \"cubbit-backup\"", "runAt": "2024-05-06T00:00:00Z"}
2024-05-01T09:14:21.905Z	INFO	scheduling task	{"task": "backup for plan \"raid-backup\"", "runAt": "2024-05-06T01:00:00Z"}
2024-05-01T09:14:24.708Z	INFO	scheduling task	{"task": "onetime backup for plan \"cubbit-backup\"", "runAt": "2024-05-01T09:14:24Z"}
2024-05-01T09:14:24.708Z	INFO	running task	{"task": "onetime backup for plan \"cubbit-backup\""}
2024-05-01T09:14:24.709Z	INFO	Starting backup	{"plan": "cubbit-backup", "opId": 1797841081815007299}
2024-05-01T09:14:24.710Z	DEBUG	repo orchestrator starting backup	{"repo": "cubbit-test"}
2024-05-01T09:14:28.856Z	DEBUG	got snapshots for plan	{"repo": "cubbit-test", "repo": "cubbit-test", "count": 0, "plan": "cubbit-backup", "tag": "plan:cubbit-backup"}

@garethgeorge
Copy link
Owner

Hey, mind trying to kill backrest (or possibly it's process group if restic is what's hung) with signal 7? Syntax should be something along the lines of

ps -aux | grep backrest

to find the PID,

kill -7 <pid>

or to kill the process group it'll be the negative pid e.g.

kill -7 -pid

This should dump stack traces (which will appear in the systemd logs). I suspect restic itself is hanging, I'm curious to cross post this issue on the restic forum / get some input from the devs over there re: how this should be handled e.g. how can Backrest know if restic is hung.

@garethgeorge
Copy link
Owner

Closing this for lack of activity, feel free to ping if you're still hitting issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants