Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print out the backup size when listing snapshots (enhancement) #693

Closed
yhafri opened this issue Dec 10, 2016 · 40 comments · Fixed by #4705
Closed

Print out the backup size when listing snapshots (enhancement) #693

yhafri opened this issue Dec 10, 2016 · 40 comments · Fixed by #4705

Comments

@yhafri
Copy link

yhafri commented Dec 10, 2016

Output of restic version

Any.

Expected behavior

Adding an extra column to list the size of the backup (in bytes) can be very useful.
It'll help distinguish between different backups just by checking their size.

$ restic snapshots
ID        Date                 Host        Tags        Directory    Size
--------------------------------------------------------------------------
5b969a0e  2016-12-09 15:10:32  localhost               myfile       390865

Actual behavior

$ restic snapshots
ID        Date                 Host        Tags        Directory
----------------------------------------------------------------------
5b969a0e  2016-12-09 15:10:32  localhost               myfile
@fd0
Copy link
Member

fd0 commented Dec 10, 2016

Thanks for the suggestion. What would you expect the size to be? Since all data is deduplicated, a "size" for a particular snapshot is not that easy to determine. Would that be the size of all data referenced in that snapshot? Or the data that was not yet stored in the repo when the snapshot was taken (new data)?

@fd0 fd0 added the feature label Dec 10, 2016
@zcalusic
Copy link
Member

This is a very good proposal. The number on the right should be the cumulative size of blobs added to the repo. It is the most interesting quantitative parameter of any backup run.

How much space did my incremental wasted this night? Oops, it's 10x more than last night, I left some junk somehere (or forgot to put some excludes), I better clean it up. ;)

@yhafri
Copy link
Author

yhafri commented Dec 10, 2016

+1 for @zcalusic suggestion

@fd0
Copy link
Member

fd0 commented Dec 11, 2016

The problem with the size of "new" blobs (added by that particular snapshot) becomes less relevant over time, because those blobs will be referenced by later snapshots. In addition, when earlier snapshots are removed, the number of blobs referenced by a particular snaphot will grow.

I think it's valuable to print this information right after the backup is complete, and we can also record it in the snapshot data structure in the repo. I've planned to add some kind of 'detail' view for a particular snapshot, and I think it is a good idea to display the number and size of new blobs there, but in the overview (command snapshots) it's not relevant enough. There, I think restic should display the whole size of a particular snapshot (what you get if you were to restore it), because that doesn't change.

@mgumz
Copy link

mgumz commented May 9, 2017

i was instantly reminded of the statistics flag of rdiff-backup (see https://www.systutorials.com/docs/linux/man/1-rdiff-backup-statistics/ ). sometimes it's nice to see some sort of delta between 2 snapshots.

@fd0
Copy link
Member

fd0 commented May 14, 2017

Indeed, but that's a different thing: It's computed live and compares two snapshots. We may add something like that, but doing that for the snapshots overview list is too expensive (at least with the information we have available in the data structures right now).

@bj0
Copy link

bj0 commented Oct 20, 2017

it could be useful to know the size of the data 'unique' to the snapshot vs the total size (including dedup'd data) of the snapshot.

@alexeymuranov
Copy link

IMO it would be quite useful to have an idea of how much extra space was used for a new snapshot. This could be even just physical storage space computed during backup and stored in snapshot's metadata. If some snapshot is removed, this metadata should be then invalidated in all future snapshots.

I think i would appreciate such a feature even if nothing else is done in this direction. However, an option of recalculating this "extra size" after some previous backups were removed would also be nice. I think this is what BackupLoupe does for Time Machine on Mac OS. (The deduplication in Time Machine is very basic, but the problem of defining the "size of a snapshot" is the same).

@rawtaz
Copy link
Contributor

rawtaz commented Feb 6, 2018

The most fundamental thing I'd like to know off the bat is how much disk space would the contents of snapshot X consume on the target disk if I restored it.

Preferrably I would also be able to get this information for only a subset of the files, e.g. if there was a size command that took the same type of include/exclude options as the restore command. Or if the restore command has an option that makes it just report statistics like this instead of actually restoring.

@larsks
Copy link

larsks commented Feb 6, 2018

Thanks @rawtaz for pointing me at this issue.

I'm storing backups in metered storage (Backblaze B2). I want to know how much new data I'm creating every time I run a backup. It seems like this ought to be easy to calculate during the backup process; I would be happy if restic would simply log that as part of concluding a backup...but it seems like it might also be useful to store this as an attribute of the snapshot (so it can be queried in the future).

I am not really interested in anything that requires extensive re-scanning of the repository, since that will simply incur additional charges.

@er1z
Copy link

er1z commented Feb 13, 2018

Any news?

@simeydk
Copy link

simeydk commented Jul 4, 2018

Hello

I would like to second this suggestion. In addition to 'How big would this snapshot be if I restored it' for any existing snapshot and 'how much did this snapshot add' when a snapshot is created, I have a third suggestion:

It would also help to be able to answer the question: 'By how much would my repo size reduce if I remove the following snapshot(s)?' This would be useful in restic forget --prune --dry-run when deciding whether to drop snapshots. For example, I recently dropped 20 of the 40 snapshots in a repo, and it reduced the size from 1.1GB to 1.0GB. Had I known this would only have saved 100MB, I likely would have kept the older snapshots.

@dimejo
Copy link
Contributor

dimejo commented Jul 4, 2018

@mholt made #1729 to show some stats. Maybe he can chime in to say something about the progress of this PR.

@mholt
Copy link
Contributor

mholt commented Jul 4, 2018

@dimejo It's done -- just waiting for it to be reviewed/merged. :)

@dev-rowbot
Copy link

dev-rowbot commented Apr 18, 2019

Jumping on a really old issue here but to me there are 2 important size fields when thinking of snapshots

  • The snapshot size in storage
  • The restore size

e.g.

$ restic snapshots
ID        Date                 Host        Tags        Directory    Snapshot Size   Restore Size 
--------------------------------------------------------------------------------------------------
5b969a0e  2016-12-09 15:10:32  localhost               myfile       10 MB           57 GB

At least then I could tell how much space a single snapshot is using and how much space I need to perform a restore.

@dimejo
Copy link
Contributor

dimejo commented Apr 18, 2019

As @fd0 already pointed out, printing the size on every invocation of restic snapshots would be a pretty expensive command. But you can use restic stats to print the size of individual snapshots or the whole repository.

@tomcolley1962
Copy link

tomcolley1962 commented Apr 30, 2019

I think it's valuable to print this information right after the backup is complete, and we can also record it in the snapshot data structure in the repo. I've planned to add some kind of 'detail' view for a particular snapshot, and I think it is a good idea to display the number and size of new blobs there, but in the overview (command snapshots) it's not relevant enough. There, I think restic should display the whole size of a particular snapshot (what you get if you were to restore it), because that doesn't change.

Great idea! Is this enhancement in the queue? The total size of the deduplicated data in the repository would also be helpful in such a synopsis.

@erfansahaf
Copy link

erfansahaf commented Oct 13, 2020

Any update for this feature? It's very useful to be able to see each snapshot size and its restore size.

@vzool
Copy link

vzool commented Oct 29, 2020

+1

@rawtaz
Copy link
Contributor

rawtaz commented Oct 29, 2020

Not at this point. If there are any updates, it'll show in this issue.

@hraban
Copy link

hraban commented Dec 28, 2021

I'd love to see this as well, particularly as a "sanity check" to see if one particular backup perhaps accidentally added some huge files that I don't need backed up (e.g. because I made a mistake in file exclusion rules). And, if so, to figure out which snapshot that was.

Being able to then inspect a snapshot and see just which directory exactly is causing the blowup, is particularly useful. If you can only compare it against "all other backups, past and future", then you can at least use it to find large files that change often and thrash the backup. If you can compare it against "only past snapshots", you can easily discover which file exactly it was, that is causing a particular snapshot to have grown so large.

For comparison, here is how Mac OS's TimeMachine does it:

$ tmutil calculatedrift /Volumes/ex1806/Backups.backupdb/my-machine/

2018-06-16-155213 - 2018-06-25-205709
-------------------------------------
Added:         5.3G
Removed:       1.0G
Changed:       5.4G


2018-06-25-205709 - 2018-07-16-160709
-------------------------------------
Added:         3.5G
Removed:       1.6G
Changed:       2.0G

...

Every such block takes about a few minutes to calculate, on an external USB (spinning 3.5") disk. Rule of thumb on my setup is 1min/GB changed.

You can drill down into directories with reasonable speed:

$ tmutil uniquesize /Volumes/ex1806/Backups.backupdb/my-machine/2021-12-14-195712/Macintosh\ HD\ -\ Data/Users/hraban/
133.2M /Volumes/ex1806/Backups.backupdb/my-machine/2021-12-14-195712/Macintosh HD - Data/Users/hraban
$ time tmutil uniquesize /Volumes/ex1806/Backups.backupdb/my-machine/2021-12-14-195712/Macintosh\ HD\ -\ Data/Users/hraban/Library/
66.9M /Volumes/ex1806/Backups.backupdb/my-machine/2021-12-14-195712/Macintosh HD - Data/Users/hraban/Library

real	0m5.991s
user	0m0.030s
sys	0m0.140s

This is not the same as "total file size" (i.e.: tmutil uniquesize takes deduplication into consideration):

$ time du -sh /Volumes/ex1806/Backups.backupdb/my-machine/2021-12-14-195712/Macintosh\ HD\ -\ Data/Users/hraban/
164G	/Volumes/ex1806/Backups.backupdb/my-machine/2021-12-14-195712/Macintosh HD - Data/Users/hraban/

real	4m0.598s
user	0m1.000s
sys	0m18.789s

Context, for those unfamiliar with mac's time machine: time machine uses filenames as keys and does no content inspection at all. Renaming a file leads to an entire new copy being stored in the backup. 1 bit changed in a file (and timestamp updated): same, full new copy in the backup (the pathological case for Time Machine is a large sqlite3 file with frequent, small changes). It's got some similarities to rsync, if you squint right. On the plus side, the backup target is a regular(ish) directory, so you can open and inspect it with your regular tools.


It would be nice if you could use a (hypothetical) restic equivalent to figure out if it's actually handling that pathological case well. In the case of frequent minor changes to a large sqlite file: is restic actually able to reuse parts of it from previous snapshots? How much? -- or can you already answer this question using existing tools?

@EugenMayer
Copy link

Really would like to emphasize how important this feature is. Regular size checks are a part of backup-reviews to ensure to backup does not suddenly backup nothing / too much, which mostly can bee seen when the backup size goes up or down unreasonable.

Thank you for the effort!

@alphapapa
Copy link

@EugenMayer FYI, if you happen to be running Restic locally, my restic-runner script optionally outputs the change in repository size after a backup run. It's helped me catch several times when new, large files were backed up that I didn't want backed up. https://github.com/alphapapa/restic-runner

@vic-t
Copy link

vic-t commented Mar 13, 2022

It's understood that calculating the size of a snapshot is expensive. So, adding it to the snapshots command by default is going to make it extremely slow. Still, there may be situations, as the ones described by other people here, where this information would be so important to me that I'd be willing to wait even 2 hours for a result. So, maybe stat (or a slightly simpler version of it) could in fact be added as a flag to the snapshots command, properly documented as something that should be used only when absolutely necessary.

Having said that, what most people have asked for here is a lot simpler than that. Already today, restic calculates the size of the snapshot after each backup run. Why not simply add this information as a string to the snapshot in the repo? This information could be then easily added as to the result set of the snapshot command, as an additional column called "Reported snapshot size".

Sure, if you're going to implement this now, it will look a bit ugly since older snapshots won't yet have this information. Personally, I'd be fine with it.

And thanks for developing restic, it's a great tool.

@AndrewSav
Copy link

Yes, I agree. Any metadata could be added (by restic developers) to a snapshot during creation and reading these metadata should not slow anything down. The thing is, it's been 5 years, so I'm not really holding my breath.

@alexeymuranov
Copy link

Having said that, what most people have asked for here is a lot simpler than that. Already today, restic calculates the size of the snapshot after each backup run. Why not simply add this information as a string to the snapshot in the repo? This information could be then easily added as to the result set of the snapshot command, as an additional column called "Reported snapshot size".

Should it be recalculated each time older snapshots are removed?

@vic-t
Copy link

vic-t commented Mar 13, 2022

Having said that, what most people have asked for here is a lot simpler than that. Already today, restic calculates the size of the snapshot after each backup run. Why not simply add this information as a string to the snapshot in the repo? This information could be then easily added as to the result set of the snapshot command, as an additional column called "Reported snapshot size".

Should it be recalculated each time older snapshots are removed?

No. If it's a matter of wording, let's call it "upload size", or anything else. It's just logged information. As with any other log, this information should not change later down the road.

@alexeymuranov
Copy link

No. If it's a matter of wording, let's call it "upload size", or anything else. It's just logged information. As with any other log, this information should not change later down the road.

But after the previous snapshot will be deleted, this information will become meaningless. Worse, unless the previous snapshot hash is recorded with this information, there will be no indication that the recorded information is meaningless.

@thedaveCA
Copy link

I don’t see what value there would be in knowing the upload size, in most cases. I can see a few exceptions.

I backup a database driven app that uses block storage, similarish to restic itself in that blocks are added, but a “purge” is only periodic. It would be useful to spot when a big uploaded happened as there is basically no value in deleting intermediary snapshots, but when it does a purge and rebuilds storage blocks, there is suddenly tons of useless data, and spotting an upload size spike would make it easy to delete everything older to get a decent amount of space back.

But this is an edge case at best, in most cases the upload size isn’t actually going to provide useful information.

Because of restic’s nature, the only thing I can see as useful would be a way to propose a deletion, and get a value of how much space could be released. I’m unclear if this can be calculated in a dry-run, but there was a post here that seemed to suggest maybe?

@mirabilos
Copy link

I’m also missing this. I would add two columns, though:

  • “net size”, i.e. the size of the restore area, were I to restore the full snapshot (modulo filesystem cluster size; I’d be fine with just adding the individual files’ sizes, or rounding them up to 512 bytes or 1/2/4/8 KiB)
  • “snapshot size”, i.e. amount of storage added when this snapshot was added, i.e. server-side size of the snapshot minus amount saved due to deduplication from prior snapshots (note that this can and will change when removing prior snapshots, and yes, that is expected)

@mgax
Copy link

mgax commented Apr 21, 2022

“net size”, i.e. the size of the restore area, were I to restore the full snapshot (modulo filesystem cluster size; I’d be fine with just adding the individual files’ sizes, or rounding them up to 512 bytes or 1/2/4/8 KiB)

I'm also interested in this metric. It's useful as a sanity check:

  • Did this snapshot capture what it was supposed to?
  • How is the dataset size trending over time?

IIUC, it's not a cheap computation to perform in the current repository format, but perhaps the number could be saved on a snapshot when it's created.

@mirabilos
Copy link

mirabilos commented Apr 21, 2022 via email

@aawsome
Copy link
Contributor

aawsome commented Apr 26, 2022

rustic is able to optionally read and display statistical information if it is stored in a snapshot. It also writes this information. E.g.:

alex-dev@thinkpad:~/rust/rustic$ rustic-rs -r /path/to/repo/ backup src/
enter repository password: 
password is correct
getting latest snapshot...
[00:00:00] ████████████████████████████████████████          6/6
using parent 42b2a5d6
reading index...
[00:00:00] ████████████████████████████████████████          2/2
determining size of backup source...
starting backup...
[00:00:00] ████████████████████████████████████████ 163.67 KiB/163.67 KiB 54.00 MiB/s  (ETA 0s)
Files:       0 new, 0 changed, 43 unchanged
Dirs:        0 new, 0 changed, 12 unchanged
Added to the repo: 0 B
processed 55 nodes, 183.0 kiB
snapshot 384cace0 successfully saved.

alex-dev@thinkpad:~/rust/rustic$ rustic-rs -r /path/to/repo/ snapshots 
enter repository password: 
password is correct
 ID       | Time                | Host     | Tags | Paths                          | Nodes |      Size 
----------+---------------------+----------+------+--------------------------------+-------+-----------
 af253680 | 2022-04-14 23:28:16 | thinkpad | tag2 | /home/alex-dev/rust/rustic/src |    53 | 136.7 kiB 
 867f1b1e | 2022-04-19 17:22:20 | thinkpad | tag2 | /home/alex-dev/rust/rustic/src |    54 | 143.0 kiB 
 6ad2ca6a | 2022-04-26 11:09:32 | thinkpad |      | /home/alex-dev/rust/rustic/src |    55 | 180.4 kiB 
 42b2a5d6 | 2022-04-26 14:38:35 | thinkpad |      | /home/alex-dev/rust/rustic/src |    55 | 183.0 kiB 
 384cace0 | 2022-04-26 14:38:45 | thinkpad |      | /home/alex-dev/rust/rustic/src |    55 | 183.0 kiB 
5 snapshot(s)

alex-dev@thinkpad:~/rust/rustic$ rustic-rs -r /path/to/repo/ snapshots --long 384cace0
enter repository password: 
password is correct
 Snapshot      | 384cace02ac480a1587ed8adaf89d5b25080c96f9d0a9c248d7b4d7000c07969 
 Time          | 2022-04-26 14:38:45 
 Host          | thinkpad 
 Tags          |  
 Paths         | /home/alex-dev/rust/rustic/src 
               |  
 Command       | rustic-rs -r /path/to/repo/ backup src/
 Source        | size: 183.0 kiB / nodes: 55 
               |  
 Files         | new:          0 / changed:          0 / unchanged:         43 
 Trees         | new:          0 / changed:          0 / unchanged:         12 
               |  
 Added to repo | total: 0 B / tree blobs: 0 / data blobs: 0 
 Duration      | Start: 2022-04-26 14:38:45 / End: 2022-04-26 14:38:45 / Duration: 2ms 682us 946ns 

1 snapshot(s)

The JSON format of the snapshot looks like this:

{
  "time": "2022-04-26T14:38:45.559439626+02:00",
  "tree": "64b513428cd244ef52f26b8f30a1758adaa95ef3ff54a9b5cdfd75078c43a6d8",
  "paths": [
    "/home/alex-dev/rust/rustic/src"
  ],
  "hostname": "thinkpad",
  "username": "",
  "uid": 0,
  "gid": 0,
  "tags": [],
  "command": "rustic-rs -r /path/to/repo/ backup src/",
  "backup_start": "2022-04-26T14:38:45.567287699+02:00",
  "backup_end": "2022-04-26T14:38:45.569970645+02:00",
  "files_new": 0,
  "files_changed": 0,
  "files_unchanged": 43,
  "trees_new": 0,
  "trees_changed": 0,
  "trees_unchanged": 12,
  "data_blobs_written": 0,
  "tree_blobs_written": 0,
  "data_added": 0,
  "node_count": 55,
  "size": 187409
}

So, if we decide to let restic save this or some of this information, please use the same JSON attributes!

@MichaelEischer
Copy link
Member

Is there a specific reason for not basing the snapshot statistics on the JSON summaryOutput of the backup command?

type summaryOutput struct {
. We could take that JSON struct type and just remove the message_type and snapshot_id. That would provide the benefit of having the same format everywhere. The executed command still would have to be stored separately.

time and backup_end seem to be redundant?

@aawsome
Copy link
Contributor

aawsome commented Apr 30, 2022

Is there a specific reason for not basing the snapshot statistics on the JSON summaryOutput of the backup command?

The reason is, I never used that and wasn't aware of it.

We could take that JSON struct type and just remove the message_type and snapshot_id

dry_run is is also kind of unnecessary 😉

time and backup_end seem to be redundant?

No, in fact we have three times here:

  • the time the command was called
  • the time the backup started, i.e. when the first node is processed (and after initializing like finding parents or reading the index)
  • the time the backup ended. This is of course kind of redundant if the total duration (includung initialization) or the duration of the backup is also saved.

Thanks for the hint about the existing JSON structure, I'll change this in rustic!

Additionally I really think that this comparably small change would add a huge benefit to restic too!

@MichaelEischer
Copy link
Member

dry_run is is also kind of unnecessary 😉

Counting to three is hard ^^ .

Thanks for the hint about the existing JSON structure, I'll change this in rustic!

My suggestion would be to keep the statistics information in a (sub-)object within the snapshot, if that isn't already the plan.

  • the time the backup started, i.e. when the first node is processed (and after initializing like finding parents or reading the index)

What's the use case for separately reporting the backup_start and time? Somehow determine how long it took to start the backup?

@mirabilos
Copy link

mirabilos commented Apr 30, 2022 via email

@aawsome
Copy link
Contributor

aawsome commented Jun 2, 2022

I now save in rustic the stats in a summary substructure which is an extended version of the summaryOutput used in restic:

{
  "time": "2022-05-31T01:02:39.316727217+02:00",
  "tree": "cc6255675e242cbe8ac51f147acaa75da1c40f27bb9359212ac031c739dd13ff",
  "paths": [
    "/home/alex-dev/rust/rustic/src"
  ],
  "hostname": "thinkpad",
  "username": "",
  "uid": 0,
  "gid": 0,
  "tags": [],
  "summary": {
    "files_new": 1,
    "files_changed": 23,
    "files_unmodified": 21,
    "dirs_new": 0,
    "dirs_changed": 12,
    "dirs_unmodified": 0,
    "data_blobs": 5,
    "tree_blobs": 9,
    "data_added": 55059,
    "data_added_packed": 16541,
    "data_added_files": 39915,
    "data_added_files_packed": 12156,
    "data_added_trees": 15144,
    "data_added_trees_packed": 4385,
    "total_files_processed": 45,
    "total_dirs_processed": 12,
    "total_bytes_processed": 193974,
    "total_dirsize_processed": 20470,
    "total_duration": 0.397277247,
    "command": "rustic-rs -r /tmp/repo backup src/",
    "backup_start": "2022-05-31T01:02:39.323038480+02:00",
    "backup_end": "2022-05-31T01:02:39.714004464+02:00",
    "backup_duration": 0.390965984
  }
}

The added fields in comparison to the summaryOutput struct are:

  • "data_added_packed": like "data_added", but the size in the pack, i.e. the (maybe) compressed and encrypted size
  • "data_added_files": like "data_added", but only file contents
  • "data_added_files_packed": like "data_added_packed", but only file contents
  • "data_added_trees": like "data_added", but only trees
  • "data_added_trees_packed": like "data_added_packed", but only trees
  • "total_dirs_processed": like "total_files_processed", but the dirs
  • "total_dirsize_processed": the size of the processed trees
  • "command": the command called
  • "backup_start": start of the actual backup process
  • "backup_end": end of the actual backup process
  • "backup_duration": "backup_end" - "backup_start" (yes, is somewhat redundant, but so is total_duration which is "backup_end" - "time")

IMO those are all statistically relevant data points worth keeping in a snapshot.

The use case to have "backup_start" and "time" is that you can determine the "warm-up time" (i.e. index reading, finding parent, etc). Moreover, you know that data modified before "backup_start" is always contained in the backup. From a backup point of view, "backup_start" is more interesting than the time when the command was started, but as this is already saved in "time", I decided to add another statistical information.

@aawsome
Copy link
Contributor

aawsome commented Nov 4, 2022

Actually, I think I forgot a quite important field: The version of the programm called to do the backup. This allows e.g. to identify which snapshots may be affected by some a-posteriori discovered bug or allows to identify snapshots (and tree metadata) which needs migration if changes within these structs are made.

@MichaelEischer
Copy link
Member

I've implemented a first draft in #4705.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.