Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More stats in snapshots list #874

Open
mnesrine opened this issue Mar 10, 2017 · 18 comments
Open

More stats in snapshots list #874

mnesrine opened this issue Mar 10, 2017 · 18 comments
Labels

Comments

@mnesrine
Copy link

It would be useful to have a more stats that are not currently in snapshots list like:

  • Backup size
  • Backup duration or end date

Thanks :)

@middelink
Copy link
Member

middelink commented Mar 10, 2017

Would that be the size read from the local system or or written to the repository? Due to dedupping those two numbers will not be remotely the same. And what does that latter number mean over time anyway? Say you make 2 backups in succession, the first backup backuped 100GB and stores 102GB in the repository. The second backup reads say 1GB locally, but backups only ~50MB. Great numbers now show in your snapshot. But then you "forget" the first backup, and the numbers in the 2nd backup snapshot still show 50MB... You will file a bug complaining restic is lying to you ^^

And in case you want to suggest that prune has to update the stats in the snapshot data, I need to point out that snapshot ids are based of the sha256 of their data, so any change in the snapshot data will make a new snapshot id, adding to confusion.

None the less, if we do add more stats, I would like to see files/dirs added and maybe aggregated cpu cycles used, peak memory used, blobs "created" and blobs "dropped". The later two are important to see how good a "fit" the parent snapshot was. With a bad fit, restic will go thru all the motions to create a blob, aes encrypt it, only to drop it in the end because it's id is already in the index.

We could hide all this extra information behind a -l flag ^^

@mnesrine
Copy link
Author

Indeed that's a good point, by size i mean a size of 'file/dir' from local system not in remotely.
to make stats about a backup in remotely, we use another way not snaphsots list.

@yhafri
Copy link

yhafri commented Mar 11, 2017

@middelink interesting observations +1

@mnesrine's idea to add more metdata to snapshots command can be very useful.
No need to have a complicated solution here, just the bare minimum for now.

We can think of it as something complementary to the --tag option.
Imagine being able to retrieve:

  • the original size of the backup (before deduplication): this info can easily be calculated when restic's comparing local blocks with the server's ones
  • the size of the backup after deduplication: can help compute how much disk saving we've got. Still optional for me, IMHO
  • the local date when the backup was successufully stored: see no challenge here
  • the type of the backup: was it a file or a directory?

And now that we've the --json option in place when listing the snapshots, these metadata can be retrieved and parsed in a nice way (thanks to jq).

@middelink I see no conflict when a prune is performed, as nothing has to be updated. These are archive's metadata. They've to always reflect what was true for a backup when it was stored in the first place.

We could hide all this extra information behind a -l flag ^^

+1 for the --l. I just want to get the minimum list of metadata for now (only 3, see above).

@fd0 fd0 added the type: feature enhancement improving existing features label Mar 11, 2017
@fd0
Copy link
Member

fd0 commented Mar 11, 2017

I like the idea of storing some statistics with the snapshot. Easy things:

  • Size of the data before deduplication
  • Backup duration
  • Number of new files/directories
  • Number of changed files

A bit more computationally expensive, but not too hard to do:

  • Size of the data after deduplication

@yhafri I have some questions:

  • I don't understand what you mean with "the local date when the backup was successfully stored". Each snapshot already has a time stamp, what's the difference here? Do you mean the finish time for the backup?
  • What's "the type of the backup"? We already store the list of things to be saved (what's passed to restic on the command line or via --files-from), so if I have two dirs and one file in there, what type of backup should that be? What's the type of backup for reading from stdin? Why is this information relevant, what do you plan to do with it? Restic can just look at the nodes in the repo to determine if it is a file or directory for each of the backup targets (just run restic ls on the snapshot), so what's the point in having the information again in the snapshot?

FWIW I think all this information shouldn't be part of the plain text snapshots output. It's okay to have it in the JSON output (as users can filter that with jq or whatever), and I've already thought of adding a new command that displays details for a particular snapshot, similar to what git show <commit> does for a commit.

@yhafri
Copy link

yhafri commented Mar 11, 2017

I don't understand what you mean with "the local date when the backup was successfully stored". Each snapshot already has a time stamp, what's the difference here? Do you mean the finish time for the backup?

Yes, the finish time.

What's "the type of the backup"? We already store the list of things to be saved (what's passed to restic on the command line or via --files-from), so if I have two dirs and one file in there, what type of backup should that be? What's the type of backup for reading from stdin? Why is this information relevant, what do you plan to do with it? Restic can just look at the nodes in the repo to determine if it is a file or directory for each of the backup targets (just run restic ls on the snapshot), so what's the point in having the information again in the snapshot?

I didn't thought about this use case.
In general, we don't mix things up when using restic. We only backup one file or dir at a time.
Thus, it's easy to know if it's a file or a directory.

FWIW I think all this information shouldn't be part of the plain text snapshots output. It's okay to have it in the JSON output (as users can filter that with jq or whatever), and I've already thought of adding a new command that displays details for a particular snapshot, similar to what git show does for a commit.

Agreed. It's fine to only have them when using --json option with snapshots

@middelink
Copy link
Member

middelink commented Mar 11, 2017

@fd0 size after dedup?
Did you read my remarks on that? I don't think it is wise to add that to the snapshot information as it needs to be maintained.

Also, what about new blobs vs new-blobs-but-already-there? Or cpu/peak memory impact? I want to be able to observe how much resources taking a backup consumes. (Rationale: currently restic ooms on all my 512MB VMs and about 1/3 on 1GB VMs. Which is kinda ridiculous. So when we start driving memory usage down, I want to be able to tell... The more statistics we have for diagnosing issues, the easier it becomes.)

@yhafri
Copy link

yhafri commented Mar 11, 2017

So fine to not add size after dedup then.

@fd0
Copy link
Member

fd0 commented Mar 11, 2017

@middelink After re-reading your post, I think we're talking about different things: What I meant with "size before deduplication" and "size after deduplication" is the intra-snapshot deduplication. And this neither depends on what is already stored in the repo nor does it change over time.

Suppose you're saving two files which contain exactly the same 1 MiB of data, which is saved in one blob in the repo. So "size before deduplication" is 2MiB (sum of all file sizes) and "size after deduplication" is 1MiB. Both numbers do not depend on whether or not the blob is already stored in the repo and will be valid as long as the snapshot is there.

I think what you were writing about is inter-snapshot deduplication: How many new blobs have been added to the repo which were not there before. This number is valid only at the time the snapshot is made, and changes over time (e.g. when an older snapshot that shares some blobs with a newer snapshot is removed). I agree that this is not reasonable to store this number.

We could rather compute the "added size" of a snapshot on the fly (ok, maybe we should wait for the metadata cache, otherwise it gets really time consuming): Make a list of all blobs that are only referenced by a particular snapshot, and sum the sizes.

@tamalsaha
Copy link

@fd0, we would like to see these stats also available with --json

@yhafri
Copy link

yhafri commented Mar 12, 2017

Adding more stats is very good idea indeed +1
But let start with the minimum/basic stats as discussed above guys.

  • the original size of the backup (before dedup) @mnesrine
  • finish time of the backup @mnesrine @fd0
  • the type of the backup: was it a file or a directory?
  • number of files/dirs in the backup (before dedup) @middelink

From there, we can decide which advanced stats we would like to add.

@tamalsaha
Copy link

By " finish time of the backup", do you mean duration of the backup?

@yhafri
Copy link

yhafri commented Mar 14, 2017

Yes, the backup duration time (ex. 1h 40mn 15 sec) or the finish time (ex. 2017-03-14 07:34:09).
From one we can deduce the other

@middelink
Copy link
Member

I very much doubt that backup "type" is a useful item to add for the general population. It seems like a highly specific use-case.

Can you give a clearer definition what you mean by "file"? Just a single file in a backup? Can there be multiple files?
Also, given restic backup -x /home/user/file1, this backup actually has directories, namely /, home and user. So how would you classify it?

(Not trying to be pedantic, but this use-case is so specific that I would like some clarification...)

@yhafri
Copy link

yhafri commented Mar 15, 2017

@middelink as i've explained before, we never use restic with multiple target dirs at a time. Only one.

But you're right, this is a very specific use case and we can live without it.

@lathspell
Copy link

Adding the dedup'ed size to the snapshot meta data would

  • help "checking" if the backup worked as intended
  • help estimation how much longer an e.g. external USB drive lasts if the daily changes are roughly constant (and no old snapshots are removed)
  • be easy to calculate (if the hash is already present, just add "0" else the filesize)

Of course if the user removes an old snapshot that size information looses its validity but I'm sure your users will understand that.

@mholt
Copy link
Contributor

mholt commented Apr 20, 2018

I've got an initial implementation of a restic stats command up in #1729. I need people to test it and see if the counts are accurate.

It could probably be expanded to count more things, but I'm starting simple.

@darkdragon-001
Copy link
Contributor

A lot of information is available by running restic diff against parent in (restic snapshots ID --json). I suggested in #2757 improving restic stats introduced by @mholt. There should be an extensive list, but some sort of overview for restic snapshots like the number of changes would be awesome in order to easily find snapshots where a lot of changes were introduced (and maybe something went wrong).

@aawsome
Copy link
Contributor

aawsome commented Nov 4, 2022

see also #693
Note, that the solution posted there allows to optionally store the summary information. This allows to add this feature in a backwards-compatible way (as I did in rustic)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants