Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master issue for "error 23" (rsync returns with exit code 23: Partial transfer due to errors) #1587

Open
aryoda opened this issue Dec 14, 2023 · 22 comments
Assignees

Comments

@aryoda
Copy link
Contributor

aryoda commented Dec 14, 2023

This issue is meant to centralize all "error 23" issues to discuss a common solution...

Background

BiT Version 1.4.0 (2023-09-14) introduced the evaluation of rsync exit codes for better error recognition:

backintime/CHANGES

Lines 32 to 42 in 726bffa

* Fix bug: takeSnapshot() recognizes errors now by also evaluating the rsync exit code (#489)
Fixes related problem: Killing `rsync` was not handled gracefully (by ignoring the rsync exit code)
* Fix bug: The error user-callback is now always called if an error happened while taking a snapshot (#1491)
* Fix bug: D-Bus serviceHelper error "LimitExceeded: Maximum length of command line reached (100)":
Max command length is now 120 instead of 100 (#1027)
* Feature: Introduce new error codes for the "error" user callback (as part of #1491):
5: Error while taking a snapshot.
6: New snapshot taken but with errors.
* Feature: The `rsync` exit code is now contained in the snapshot log (part of #489). Example:
[E] Error: 'rsync' ended with exit code -9 (negative values are signal numbers, see 'kill -l')
* Fix bug: Treat rsync exit code 24 as INFO instead of ERROR (#1506)

Before this change rsync errors reported only via exit code were hidden and left users with the feeling that everything worked well (but didn't).

Collected reasons for rsync exit code 23

The final error message in the snapshot log is always

[I] Take snapshot (rsync: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7])
[E] 'rsync' ended with exit code 23: See 'man rsync' for more details

Here are typical snapshot log entries causing exit code 23 (shall be extended here to be extended step by step to get a holistic picture):

  • [I] Take snapshot (rsync: symlink has no referent: "/home/user/Documents/dead-link")
  • [E] Error: rsync: [sender] send_files failed to open "/home/user/Documents/root_only_file.txt": Permission denied (13)

Other known reasons: See #1587 (comment)

Current impact

Some of the reasons for exit code 23 may be out of control of the BiT user or are OK to be ignored since not considered as "severe" (eg. dead links).

Despite that BiT snapshots do always fail with an error now.

Even though rsync treats exit code 23 always as error the BiT user should have the freedom to treat some reasons for exit code 23 as INFO or WARNING only...

Possible solutions

  1. Add exit code 23 to the ignore list (as before) and let the user think everything went OK (until looking into the snapshot log)

    backintime/common/snapshots.py

    Lines 1247 to 1252 in 726bffa

    # dict of exit codes (as keys) that are treated as INFO only by BiT
    # (not as ERROR). The values are message strings for the snapshot log.
    rsync_non_error_exit_codes = {
    0: _("Success"),
    24: _("Partial transfer due to vanished source files (see 'man rsync')")
    }

  2. Find and use rsync options so that we can ignore minor exit code 23 reasons:
    https://unix.stackexchange.com/questions/598736/how-to-skip-copying-broken-links-with-rsync-copy-links

  3. Introduce a new snapshot result state "Warning" (currently we have only "OK" and "with errors)

    backintime/common/snapshots.py

    Lines 1110 to 1131 in 726bffa

    def takeSnapshot(self, sid, now, include_folders):
    """
    This is the main backup routine. It will take a new snapshot and store
    permissions of included files and folders into ``fileinfo.bz2``.
    Args:
    sid (SID): snapshot ID which the new snapshot
    should get
    now (datetime.datetime): date and time when this snapshot was
    started
    include_folders (list): folders to include. list of
    tuples (item, int) where ``int`` is 0
    if ``item`` is a folder or 1 if ``item``
    is a file
    Returns:
    list: list of two bool
    (``ret_val``, ``ret_error``)
    where ``ret_val`` is ``True`` if a new
    snapshot has been created and
    ``ret_error`` is ``True`` if there was
    an error during taking the snapshot

Challenges

rsync has several reasons for exit code 23 and always treats it as an error.

Related issues

@aryoda aryoda self-assigned this Dec 14, 2023
@buhtz
Copy link
Member

buhtz commented Dec 14, 2023

It might help or give us some more ideas if we would involve the rsync-folks and -maintainer via their mailing list.

I assume there is a reason why multiple problems are summarized with one single error code.

@aryoda
Copy link
Contributor Author

aryoda commented Dec 14, 2023

FYI: I have just posted my questions at the rsync mailing list
and also looked into the relevant (C) source code of rsync

https://github.com/WayneD/rsync/blob/2f9b963abaa52e44891180fe6c0d1c2219f6686d/flist.c#L1271C1-L1282C28

but without debugging rsync (which I won't do ;-) I see no way to find out how to suppress this error (one of many that leads to exit code 23):

symlink has no referent: %s

@emtiu
Copy link
Member

emtiu commented Dec 15, 2023

Ideally, I would prefer the solution where an rsync-23 marks a snapshot as "with warnings" (in addition to "with errors"). Users could then clear the warnings (e.g. by renaming the snapshot).

I don't think it's worth the trouble of second-guessing rsync by trying to figure out why exactly a 23 was returned. Users can do that by looking at the logs.

@buhtz
Copy link
Member

buhtz commented Dec 15, 2023

With my current but limited knowledge I also do prefer "with warnings".
But we need to add enough documentation about it. Because this will trigger user to ask questions in Issues and on the mailing list.

They have to investigate the log files etc pp

This is also related to our log-file-parsing issue. When we are able to parse the log file then we are able to give better feedback to the user which type of code 23 caused the warning. But in a first place I would say a link to a docu/faq would be enough in the first place.

@aryoda
Copy link
Contributor Author

aryoda commented Dec 15, 2023

Ideally, I would prefer the solution where an rsync-23 marks a snapshot as "with warnings" (in addition to "with errors").
[...]
Users can do that by looking at the logs.

This would be my preferred (and perfect) solution too.

Unfortunately

  • our snapshot logs do currently support only [I] and [E] (no warnings)

  • the snapshot log viewer does not yet allow filtering on warnings

  • the above mentioned 21 reasons for exit code 23 are contained as [I] only in the snapshot logs

    [I] Take snapshot (rsync: symlink has no referent: "/home/user/Documents/dead-link")
    
  • rsync does not provide a reliable way to allow our log parser to recognize the 21 reasons for exit code 23
    except parsing the message text
    so we are not able to easily and reliably mark them as [W]
    (without false positives or missed entries due to rsync msg text changes or reasons I have overseen in the code)

    => this means users cannot even filter and check the log entries causing exit code 23 easily

    User had to scroll through the complete snapshot log or search for keywords (which are determined by the msg strings of the 21 rsync reasons)

  • I am not sure if rsync's output may be translated into other languages (which would effectively render log parsing into nightmare ;-)

    Lukily in the source code I could find no indication for that but I will ask at the rsync mailing list.

All this is IMHO too risky and time consuming ATM (I prefer to prepare the next release for Jan, 2023 and the Qt6 release then).

I propose to

  • DONE: Add a FAQ entry for exit code 23 (as @buhtz suggested)
  • DONE: Add exit code 23 to the exclusion list for until we have introduced warnings.
    This effectively re-introduces the same behavior as before checking the rsync exit codes for errors

aryoda added a commit to aryoda/backintime that referenced this issue Dec 22, 2023
…it-team#1587

- Corrects some other typos in code comments
- Add global locking details to dev doc of control files usage
aryoda added a commit that referenced this issue Dec 27, 2023
FAQ and dev doc: Closes #1166, #1582. Contributes to #1555 and #1587

- Add FAQ entry: Snapshot "WITH ERRORS": [E] 'rsync' ended with exit code 23
- Add FAQ entry: Does BiT support full system backups?
- Add FAQ entry: Does BiT support backups on cloud storage like OneDrive or Google Drive?
- Corrects some other typos in code comments
- Add details about global locking to dev doc of control files usage
@aryoda aryoda added this to the 1.4.2 (upcoming release) milestone Jan 5, 2024
@aryoda aryoda pinned this issue Jan 23, 2024
aryoda added a commit to aryoda/backintime that referenced this issue Jan 26, 2024
* rsync exit code 23 is now in the ignore list
* Add experimental "rsync transfer failures" filter to snapshot log view
* Exclude 'SingletonLock' and 'SingletonCookie' (Discord) and 'lock' (Mozilla Firefox) files by default (part of bit-team#1555)
aryoda added a commit that referenced this issue Jan 29, 2024
…napshot log filter (#1621)

* Work around for rsync exit code 23: Is no longer an error by adding it to the rsync exit code ignore list (#1587)
* Add experimental "rsync transfer failures" filter to snapshot log view
* Add links to rsync source code for rsync transfer failure filter
* Exclude 'SingletonLock' and 'SingletonCookie' (Discord) and 'lock' (Mozilla Firefox) files by default (part of #1555)
* Travis CI config file adjustments since coverage does fail with absolute path to Python for unknown reason (only relative path works)
* CI: Remove sudo because travis says so ("no effect anymore")

---------

Co-authored-by: aryoda <11374410+aryoda@users.noreply.github.com>
Co-authored-by: Christian Buhtz <c.buhtz@posteo.jp>
@aryoda
Copy link
Contributor Author

aryoda commented Feb 1, 2024

FYI: The new BiT release v1.4.3 does

  • reintroduce handling rsync exit code 23 as info only (no longer an error when taking a snapshot)
  • has an experimental filter now to make finding resync transfer errors easier (even if the are "hidden" as [I]nfo in length snapshot logs:
    image

@jean-christophe-manciot
Copy link

jean-christophe-manciot commented Feb 6, 2024

@aryoda

  • reintroduce handling rsync exit code 23 as info only (no longer an error when taking a snapshot)

1.4.3 continues to make the snapshot fail:

[I] Take snapshot (rsync: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1338) [sender=3.2.7])
[I] 'rsync' ended with exit code 23: Partial transfer due to error
[E] Failed to take snapshot 2024-02-06 20:03:01.

What's troubling is that the lock file has not been deleted within 300 seconds after the error has been detected in my user-callback (SLEEP_THRESHOLD=300 and the script sent an email about that issue):

case "$rsync_return_code" in
        0|24)
                /usr/bin/nice -n 19 /usr/bin/ionice -c2 -n7 /usr/bin/backintime --profile "another_profile" backup
                ;;
        23)
                # Waiting for the current process to end and clear the lock file
                sleep_counter=0
                while [[ -e "$snapshot_lock_file" ]]
                do
                        sleep 1
                        ((sleep_counter++))
                        if [[ $sleep_counter -ge $SLEEP_THRESHOLD ]]; then
                                (
                                printf "%s\n" "Called from:             $script_parameters"
                                printf "%s\n" "On:                      $date"
                                printf "%s\n" "profile_id:              $profile_id"
                                printf "%s\n" "rsync_return_code:       $rsync_return_code"
                                printf "%s\n" "snapshot_lock_file:      $snapshot_lock_file"
                                ) | mail -s "Failed ${host_name}: backintime lock file has not been deleted within $SLEEP_THRESHOLD seconds for the $profile_name profile" admin@example.com
                                exit 1
                        fi
                done
                
                # Retaking the backup
                /usr/bin/nice -n 19 /usr/bin/ionice -c2 -n7 /usr/bin/backintime --profile "$profile_name" backup
                ;;
        *)
                # Saving snapshot log
                cp --force --preserve=all "${snapshot_log_file}" "${snapshot_log_file}.$(date --iso-8601=minutes)"
                (
                printf "%s\n" "Called from:             $script_parameters"
                printf "%s\n" "On:                      $date"
                printf "%s\n" "profile_id:              $profile_id"
                printf "%s\n" "rsync_return_code:       $rsync_return_code"
                printf "%s\n" "snapshot_log_file:       $snapshot_log_file"
                ) | mail -s "Failed ${host_name}: backintime rsync_return_code is unknown for the $profile_name profile" admin@example.com
                exit 1
                ;;
esac

@jean-christophe-manciot
Copy link

Here is the matching log in takesnapshot.log:

WARNING: user-callback returncode: 1
ERROR: Plugin usercallbackplugin processEnd failed: 
INFO: Unlock'
WARNING: user-callback returncode: 1
ERROR: Plugin usercallbackplugin processEnd failed: 
INFO: Unlock'
INFO: Unlock'
DEBUG: [common/tools.py:1134 keyringSupported] No keyring due to import error.
DEBUG: [common/mount.py:161 Mount.__init__] pw-cache is not running
DEBUG: [common/mount.py:169 Mount.__init__] Call command: /usr/bin/backintime pw-cache start
DEBUG: [plugins/usercallbackplugin.py:81 UserCallbackPlugin.callback] Call user-callback: /root/.config/backintime/user-callback 10 Data 8
INFO: [common/snapshots.py:851 Snapshots.backup] Unlock

Back In Time
Version: 1.4.3

@aryoda
Copy link
Contributor Author

aryoda commented Feb 8, 2024

@jean-christophe-manciot

  1. Could you please start your backup manually via command line incl. the --debug option and send me the anonymized output (it seems the above takesnapshot.log does not show the relevant critical execution path to diagnose this).

    I will try to find the execution path then that "forgets" to delete the lock file...

  2. In the code I can see that the error may also be caused by a different reason than exit code 23 (when the rsync log is parsed and an error is detected). Can you find any snapshot log entry that contains the string rsync: (which is BiT's way of recognizing and returning an rsync error)?

Internal note: The error recognition in the snapshot log is here:

if line.endswith(')'):
if line.startswith('rsync:'):
if not line.startswith('rsync: chgrp ') and not line.startswith('rsync: chown '):
# matches rsync error lines like:
# rsync: [generator] link [...] failed: Invalid cross-device link (18)
params[0] = True
self.setTakeSnapshotMessage(1, 'Error: ' + line)

@aryoda
Copy link
Contributor Author

aryoda commented Feb 8, 2024

What's troubling is that the lock file has not been deleted within 300 seconds after the error has been detected in my user-callback (SLEEP_THRESHOLD=300

  • Just to be sure which execution path I a have to debug:
    while [[ -e "$snapshot_lock_file" ]] references the worker*.lock file?
  • In which "reason" ($3 of the user callback) your handler code is executed?
    I need to check if the worker*.lock file is already deleted in this "reason" handler (or do you spawn a separate parallel process to execute your above script?)

@jean-christophe-manciot

@aryoda

In the code I can see that the error may also be caused by a different reason than exit code 23 (when the rsync log is parsed and an error is detected). Can you find any snapshot log entry that contains the string rsync: (which is BiT's way of recognizing and returning an rsync error)?

Actually, I do believe the error has been caused only by rsync error 23; here are the sole errors present in the same log:

[E] Error: rsync: [sender] read errors mapping "/some/file": No data available (61)
[E] Error: rsync: [sender] read errors mapping "/some/other/file": No data available (61)
[E] Error: rsync: [sender] read errors mapping "/some/different/file": No data available (61)

Just to be sure which execution path I a have to debug:
while [[ -e "$snapshot_lock_file" ]] references the worker*.lock file?

snapshot_lock_file="/root/.local/share/backintime/worker${profile_id}.lock"
with profile_id=8 here

In which "reason" ($3 of the user callback) your handler code is executed?
I need to check if the worker*.lock file is already deleted in this "reason" handler (or do you spawn a separate parallel process to execute your above script?)

$3=2 when the email was sent.

Could you please start your backup manually via command line incl. the --debug option and send me the anonymized output (it seems the above takesnapshot.log does not show the relevant critical execution path to diagnose this).
I will try to find the execution path then that "forgets" to delete the lock file...

I will try when I find the time.
This issue may be linked to the fact that each successful snapshot (rsync_return_code=0|24) calls another one in the user-callback script without waiting for the lock file to be deleted.
I can spot in takesnapshot.log that for each successful snapshot, there is a

ERROR: user-callback returned 'INFO: Lock
INFO: Take a new snapshot. Profile: <number> <name>

and at the end of all snapshots, I get:

INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'
INFO: Unlock'

matching precisely the number of previous ERROR: user-callback returned 'INFO: Lock +1 (for the last successful snapshot)

@jean-christophe-manciot
Copy link

jean-christophe-manciot commented Feb 10, 2024

Unfortunately, I've just realized that the lock file is not removed as long as the user-callback is running, so the latter can wait forever for the lock file to be deleted before moving on... so it's a catch-22 situation!

The only way out is to delete the lock file inside the user-callback script just before moving on.
Let's see what happens.

@jean-christophe-manciot
Copy link

jean-christophe-manciot commented Feb 10, 2024

No luck, it does not work, the whole process is stopped when the lock file is manually deleted.
This means that it seems impossible to retake the failed snapshot from within the user-callback after an rsync error 23.

@aryoda
Copy link
Contributor Author

aryoda commented Feb 12, 2024

Work around: Relax rsync exit code 23: Ignore instead of error now (part of #1587)
1.4.3 continues to make the snapshot fail:

I have "lost" the overall view, do you think there is an open bug in BiT (1.4.4-dev version, above change is not yet released!) that I should fix?

No luck, it does not work, the whole process is stopped when the lock file is manually deleted.
This means that it seems impossible to retake the failed snapshot from within the user-callback after an rsync error 23.

That is strange, I see now obvious indication in our source code why BiT should fail if an existing lock file is deleted.

I think you urgently need my proposed FR #1591 (on my list but currently I am focusing on our Qt5 to Qt6 migration of the GUI code)...

This issue may be linked to the fact that each successful snapshot (rsync_return_code=0|24) calls another one in the user-callback script without waiting for the lock file to be deleted.

It looks like you are using the usercallback to implement special requirements to BiT that are currently available .

If you want you could describe your requirements in a new issue as FR and we could think about if and how to implement this in BiT instead...

@ptilopteri
Copy link

ptilopteri commented Feb 12, 2024 via email

@ptilopteri
Copy link

ptilopteri commented Feb 12, 2024 via email

@ptilopteri
Copy link

ptilopteri commented Feb 12, 2024 via email

@jean-christophe-manciot

@aryoda

I have "lost" the overall view, do you think there is an open bug in BiT (1.4.4-dev version, above change is not yet released!) that I should fix?

So it seems that the rsync error 23 is no longer treated as an error in 1.4.4-dev instead of 1.4.3.

I think you urgently need my proposed FR #1591 (on my list but currently I am focusing on our Qt5 to Qt6 migration of the GUI code)...

Urgently is a strong word, I'd rather say ASAP, and definitely before very far away, someday, until "never"...

It looks like you are using the usercallback to implement special requirements to BiT that are currently available .

If you want you could describe your requirements in a new issue as FR and we could think about if and how to implement this in BiT instead...

I think you mean "special requirements to BiT that are currently NOT available" (or did I missed something?).
You seem to refer to the ability to chain multiple profiles. I'll try to open a new FR about it.

@aryoda
Copy link
Contributor Author

aryoda commented Feb 15, 2024

@aryoda

I have "lost" the overall view, do you think there is an open bug in BiT (1.4.4-dev version, above change is not yet released!) that I should fix?

So it seems that the rsync error 23 is no longer treated as an error in 1.4.4-dev instead of 1.4.3.

Yes. I hope this does not cause new problems in your workflow (mainly: user callback script), my goal was to relax this again until we introduce "warnings" as third "take snapshot" result type...

I think you urgently need my proposed FR #1591 (on my list but currently I am focusing on our Qt5 to Qt6 migration of the GUI code)...

Urgently is a strong word, I'd rather say ASAP, and definitely before very far away, someday, until "never"...

OK, I will check the source code next week to estimate how much work (and possibly risks) this implies...

It looks like you are using the usercallback to implement special requirements to BiT that are currently available .
If you want you could describe your requirements in a new issue as FR and we could think about if and how to implement this in BiT instead...

I think you mean "special requirements to BiT that are currently NOT available" (or did I missed something?). You seem to refer to the ability to chain multiple profiles. I'll try to open a new FR about it.

Ah yes, I forgot the word "not" and meant indeed "currently not available" :-)

If you think your requirements help to improve BiT and are also helpful for others a PR is really welcome!

@aryoda
Copy link
Contributor Author

aryoda commented Feb 15, 2024

@jean-christophe-manciot Would it be possible to help me finalizing the requirements for #1591 (esp. how to cope with user callback calls in case of errors). I continue the conversation related to "retries" in #1591 ...

@jean-christophe-manciot

Yes. I hope this does not cause new problems in your workflow (mainly: user callback script), my goal was to relax this again until we introduce "warnings" as third "take snapshot" result type...

No worries. I'll adapt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants