-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sftp backend does not reconnect #353
Comments
Sorry, forgot to add, I'm running latest git:
on a Debian wheezy amd64 system, using the following command line:
|
Thanks for your report, the backend infrastructure does not yet handle connection errors. For sftp, just one persistent connection is made to the backup server, which does not reconnect. That's the underlying problem. I suspect that after the error is received, the connection is closed and restic does not store anything at all after that. However, you should be able to resume the backup by restarting restic. Do you know what server the Hetzner backup uses? |
Ah that explains it. :) Thanks for your fast reply! Please let me know if I could be of any help with testing, would love to use restic also for the Hetzner use case. :) |
Side note: It seems unlikely to me, that Hetzner's ssh backend is broken. However, the following observation might be valuable for you, too. A few times I observed stalling tcp over IPv6 connections that where related to faulty routers on the way to the target host. What happend: Some router on the way had Sequence Number Randomization enabled. This should be transparent to both Server and client, however they did not do it properly, as they did not rewrite SACK (selective acknowledgement) tcp header option. Once the first packet loss occured, there was a high chance for the sack option to contain an illegal (out of expected range) value causing netfilter/conntrack to drop the response package. IPv4 could theoretically affected, too, but I never observed this myself. You can log packets flagged invalid by the conntrac module: Sack is used only, when both client and server support it. It can be disabled per host using the following netfilter roule (requires netfilter + TCPOPTSTRIP kernel support): You could also try to (temporarily) disable sack for all connections (ipv4+ipv6): |
Interesting hints, thanks. No matter whether I'm running with I got runtime panics though, full log available in http://michael-prokop.at/tmp/restic_panic.txt jftr. |
Thanks for the log, the panic was expected. Error handling and reporting back to the user deserves a lot more work ;) |
I'd like to note that I receive the same error message when backing up from my linux (debian jessie) systems to hetzner backup space. Is there a workaround yet? |
Not yet, but you can just run the backup again and it will resume. |
|
Hm, I find that rather unexpected. Restic starts, reads files, splits them into chunks, bundles the chunks together to pack files and uploads the pack files to the backend. The pack files are never the same (neither size nor content), so the server always sees different data. @debe Did you create a debug log yet? Maybe that contains another hint... |
ssh loglevel DEBUG3. https://0bin.link/paste/66DtwHnN#NEDc92vKhiQANv2ESS5WJTYOcWNGe0jQe+yHpWu1Buh |
FYI: It was just reported that Hetzner is still running ProFTPD 1.3.5b. The fix should be in 1.3.6, which just released a few weeks ago on 9 April. |
A user just reported that Hetzner seems to have installed a newer version of the ftp service, so it works much better now. Can anybody else confirm this? |
@fd0 - jepp, did some tests today (using v0.7.1) and it looks much better \o/ |
Well, maybe I was too fast with my reply, just had the following failure with a 103GB directory (hetzner storagebox as repository target via sftp, restic v0.7.1 locally):
It seems different from the problem I initially reported here back then though. |
Same problem again, this time after ~37 minutes:
Running |
Yes, |
FTR, retried with latest restic from git (v0.7.1-183-g0f25ef94) with debug enabled, last lines from debug log:
(Running it from a Debian jessie system towards Hetzner's your-storagebox.de, IP 88.99.48.239 in my case) |
And another update of mine :) With latest restic from git and running it under |
Is restic prune safe to use over unreliable connections? |
It is "safe" in terms of safety: It won't break the repo. But you'll have to start from the beginning and the repo will keep growing until you manage to finish one prune run. That's a safety feature (restic will only remove data once it is sure that the new copies of still referenced data are safely saved). |
It seems to me at least that it might be more reliable to use the rclone backend and let rclone handle the sftp until reconnect support is implemented, particularly if uploading over residential broadband. |
For your information: We've recently figured out that some sftp servers disconnect clients after a while without activity, you can prevent that with restic by adding the following settings to the
See: https://restic.readthedocs.io/en/latest/030_preparing_a_new_repo.html#sftp |
Hello. Is this still being worked on? I have implemented a repo check job and from time to time it won't finish it seems because of this problem read group #20 of 7086 data packs (out of total 197033 packs in 28 groups) It seems it will continue spilling those error messages forever, the mentioned blob references are repeatedly shown. The problem is
So I would at first notice this happens, when the next backup job is scheduled, which will notify me that the repo is still in use and I have to manually terminate the job and remove the lock. It would already be a great improvement if restic at least terminated on those errors (so I get the error notice immediately) regards |
This is not actively being worked on as far as I know. Somebody would need to step up and implement reconnection/restarting code into the |
I assume this issue is still open? I (not infrequently) get failed prunes like this from my home network to hetzner. Previously I spun up an AWS instance to run prune and it finished without difficulties.
|
Hello there! Still having this problem with hetzner... my backups are failing after some minutes (13 mins, 16 mins... it varies).
No combination of ssh |
For me this sounds like the following SFTP issue discussed in the documentation (https://restic.readthedocs.io/en/stable/030_preparing_a_new_repo.html#sftp):
|
Thanks Michael - I actually have those values already configured in my .ssh/config file |
@johnflan Oh, that response was mainly intended for SimoneLazzaris. In your case I wonder whether this might be a result of an internet reconnection? With a daily reconnection as for example for DSL, it wouldn't be too unexpected if that interrupted a command running for 17 hours. |
@MichaelEischer I've tried that, and also disabling/enabling TCPKeepAlive, but without success. I suspect that something is wrong on Hetzner side actually, because also plain ssh connections are sometime dropped. I managed to make it work using rclone as intermediate interface. So I've configured rclone to have a backend on hetzner and then I told restic to use it. It's been uploading for 24 hours now; sometime I get some errors but the connection is immediately reestablished and the backup can go on: Note that before, with direct sftp, my connection was dropped every 10/20 minutes |
No worries @MichaelEischer - I had considered that - but don't every have a problem performing the initial backup just the prune. On a superficial glance it would seem that in the case of a network drop restic should attempt to reinstate the scp stream to the remote target. |
@johnflan I also had tons of troubles with restic and Hetzner, what was also relevant for me was https://docs.hetzner.com/robot/dedicated-server/troubleshooting/performance-intel-i218-nic - if you're running a dedicated server at Hetzner, consider giving this a try |
Took a first stab at this in a draft PR, approach is to separate out the ssh connection code from the backend and then establish a new connection when it exits. Not ready for maintainers to review yet, but if anyone wants to give feedback on the approach I'd welcome it. |
What is the state of this feature request? |
My pr kind of worked, but needed cleanup + more extensive manual testing before it’d be merge ready. Personally didn’t need this functionality that much because I made my wrapper script that calls restic just kill the process on disconnect and retry later. |
I'm having the same problem with Hetzner.
@ibash, could you share that script? |
Unfortunately my backups code is kind of messy / bespoke. I have a branch where I'm working on a cleanup. At some point later I'll publish that. But the relevant bit is in here: In the code below child.stderr.on('data', (data) => {
// unfortunately restic never recovers from this, so end early s.t. the
// backup can be retried again sooner
//
// ref: https://github.com/restic/restic/issues/353
if (data.includes('returned error, retrying after') && data.includes('connection lost')) {
isConnectionLost = true
reject(new Error('connection lost'))
}
}) |
Similar problems here, apparently also a Hetzner problem (using a 1TB storage box over there) but it was working until 07Mar2024, then it broke. Some minutes into the backup (photos, ~350GB) I get the usual:
The command line is unspectacular: Did someone already ask over in the Hetzner forum or their tech support? Can't be a restic issue if it was working without any changes before, right? edit: Changed the backend to rclone and it works now, even with the whole 380GB. No idea what broke it and I only asked in Hetzner's forum, did anyone ask tech support yet? |
I'm trying to back up data to Hetzner's backup space (http://wiki.hetzner.de/index.php/Backup/en). It starts nicely with ~15-20MiB/sec, but as soon as I'm receiving a
Write failed: Broken pipes
orConnection to uXXX.your-backup.de closed by remote host.
from Hetzner's backup space then restic is permanently decreasing its performance to <1MiB/s and doesn't seem to be able to recover from it. While this clearly seems to be a problem from Hetzner's backend storage I can't find a ssh configuration (TCPKeepAlive, ServerAliveInterval,...) which would work around this problem. It would be nice if restic could recover from this situation itself.Thanks for restic! :)
The text was updated successfully, but these errors were encountered: