Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: analyze which files/directories are using the most storage #8074

Open
MatthewL246 opened this issue Feb 6, 2024 · 2 comments
Open

Comments

@MatthewL246
Copy link

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Feature request.

System information

N/A

Feature request

I think it would be useful if Borg could generate a list showing which files and directories have been using the most storage space (after compression and deduplication) in a repo within a certain time period (such as in the last month). This would be helpful for finding directories that are wasting space in the repo and the user might have accidentally forgotten to exclude.

My inspiration for this is the git-filter-repo --analyze option, which creates a report of which files in a Git repo have used the most space throughout the repo's history. A borg analyze command could look something like that.

Example `git-filter-repo` analysis for the Borg repo
=== All directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
   744008017   26787700 <present>  <toplevel>
   517286397   12653532 <present>  src/borg
   517286397   12653532 <present>  src
   104691374   10788204 <present>  docs
    11934188    5279538 <present>  docs/internals
    16803244    2387797 <present>  src/borg/algorithms
   115776607    2116171 <present>  src/borg/testsuite
    13993693    1921949 <present>  src/borg/algorithms/zstd/lib
    13993693    1921949 <present>  src/borg/algorithms/zstd
    80398937    1552724 <present>  borg
     2621210    1193969 <present>  docs/misc
    11735486     893532 <present>  docs/man
     9106300     720133 <present>  docs/usage
     3752583     686089 <present>  src/borg/algorithms/zstd/lib/compress
     5965792     479344 <present>  src/borg/algorithms/zstd/lib/legacy
    15161225     435674 <present>  attic
     1631753     358329 <present>  docs/misc/asciinema
    17886036     333528 <present>  borg/testsuite
     7417804     322251 <present>  src/borg/helpers
     4493375     267651 2013-07-09 darc
     1256006     246006 <present>  src/borg/algorithms/zstd/lib/common
     1535426     240287 <present>  src/borg/algorithms/xxh64
     9980617     239345 <present>  src/borg/archiver
    10038085     212145 <present>  src/borg/testsuite/archiver
     1234524     204819 <present>  src/borg/algorithms/zstd/lib/decompress
     7281292     200862 <present>  src/borg/crypto
     2799285     163539 <present>  scripts
      747157     155625 <present>  src/borg/algorithms/lz4/lib
      747157     155625 <present>  src/borg/algorithms/lz4
     2624369     141717 <present>  scripts/shell_completions
      862668     125015 <present>  src/borg/algorithms/zstd/lib/dictBuilder
      967385     109191 2019-05-13 src/borg/_msgpack
      987519     103475 2010-10-27 dedupestore
     1502405      93505 <present>  src/borg/platform
     1524670      74299 <present>  scripts/shell_completions/zsh
     2475851      59442 <present>  attic/testsuite
      596814      46152 <present>  docs/deployment
      407120      39587 <present>  docs/usage/general
      874383      39206 <present>  scripts/shell_completions/fish
      225316      28212 <present>  scripts/shell_completions/bash
       92775      27211 <present>  docs/_static
      189831      26596 2010-03-01 dedupstore
      414837      25466 <present>  .github
      410788      24095 <present>  .github/workflows
      142992      23397 2017-05-02 src/borg/_crc32
       69599      19679 2021-01-28 src/borg/algorithms/blake2
       90056      19502 2016-01-24 borg/support
      354626      18382 <present>  src/borg/cache_sync
       77575      16404 <present>  src/borg/algorithms/zstd/lib/deprecated
       84075      15074 2020-12-21 .travis
      222060      12566 <present>  src/borg/algorithms/msgpack
       41389      11046 2021-01-28 src/borg/algorithms/blake2/ref
       28210       8642 <present>  src/borg/blake2
       17281       8199 <present>  requirements.d
       70103       8080 <present>  docs/borg_theme/css
       70103       8080 <present>  docs/borg_theme
       65232       6249 2015-10-12 docs/_themes
       40066       5064 <present>  deployment/windows
       40066       5064 <present>  deployment
      104163       4258 2013-07-09 darc/testsuite
       11638       3982 <present>  docs/3rd_party
       53338       3683 2015-10-12 docs/_themes/local
        7968       3133 2022-02-27 docs/3rd_party/blake2
       11894       2566 2015-05-13 docs/_themes/attic
       45939       2393 2015-10-12 docs/_themes/local/static
        9171       1553 2015-05-13 docs/_themes/attic/static
        2012        765 <present>  scripts/fuzz-cache-sync
        1608        735 <present>  scripts/make-testdata
        3235        661 2010-10-31 doc
        1032        608 <present>  docs/_templates
        1530        451 2022-02-26 docs/3rd_party/zstd
         328        269 2013-06-24 fake_pyrex
         231        177 2013-06-24 fake_pyrex/Pyrex
         204        142 2013-06-24 fake_pyrex/Pyrex/Distutils
         266        124 <present>  scripts/fuzz-cache-sync/testcase_dir
         614        117 <present>  docs/3rd_party/msgpack
        1311        110 2022-02-26 docs/3rd_party/lz4

It would also be interesting to see a feature that does something similar for "time spent backing up" instead of storage used, although I don't know if that would be feasible.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Feb 6, 2024

Borg does not yet have such a feature, but guess it would be possible to implement the space-usage analysis.

It is not possible to analyse the time spent for backing up some file/dir, we only have the overall backup time for a backup archive, but no more fine-granular timing data.

Implementation notes:

  • "within a certain time period" - borg already has some means to select some archives via cli options (like -a, --last N, etc.) - these can be reused/extended.
  • this operation is relatively expensive,
    O(N_archives_considered * archive_size)
  • due to the deduplication, doing a meaningful space-usage analysis is not trivial, make sure that the implementation actually makes sense / is useful.

@MatthewL246
Copy link
Author

Since it sounds like individual file timing isn't implemented, I made a quick Python script that ranks directories on their backup times in case anyone else finds that useful. It requires a timestamped backup log, which can be generated with borg create --list ... | ts -s "%.s" | tee borg_log.txt.

from collections import defaultdict

path_backup_times = defaultdict(float)

with open("borg_log.txt", "r") as file:
    previous_timestamp = 0
    for line in file:
        parts = line.split()
        if len(parts) >= 3:
            timestamp = float(parts[0])
            file_flag = parts[1]
            file_path = " ".join(parts[2:])

            # See https://borgbackup.readthedocs.io/en/latest/usage/create.html#item-flags
            if file_flag in ["A", "M", "U", "C", "E"]:
                backup_time = timestamp - previous_timestamp
                path_components = file_path.split("/")
                for i in range(1, len(path_components) + 1):
                    component = "/".join(path_components[:i])
                    path_backup_times[component] += backup_time

            previous_timestamp = timestamp

sorted_paths = sorted(path_backup_times.items(), key=lambda x: x[1], reverse=True)[:20]

for rank, (path, backup_time) in enumerate(sorted_paths, start=1):
    print(f"{rank}. {path} ({round(backup_time)}s)")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants