Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional metric: last snapshot date/timestamp(per repository) #256

Open
Gaibhne opened this issue Oct 4, 2023 · 4 comments
Open

Additional metric: last snapshot date/timestamp(per repository) #256

Gaibhne opened this issue Oct 4, 2023 · 4 comments

Comments

@Gaibhne
Copy link

Gaibhne commented Oct 4, 2023

Output of rest-server --version

Not relevant.

What should rest-server do differently?

Export the timestamp of the last successful snapshot (and ideally more, I added a few ideas at the end, but last snapshot timestamp is most critical) as part of the Prometheus metrics.

What are you trying to do? What is your use case?

The Prometheus metrics are perfect to set up a monitoring system to alert on backups not running, because it would allow to monitor the actual result of the backup job, so it would be much better than say the backup job itself sending alerts on failures - if the job doesn't run, for example, it might never send out notifications. Watching the REST servers metrics on the other hand would always be able to confirm that everything else aside, the snapshot made it to the repository.

Did rest-server help you today? Did it make you happy in any way?

It's fantastic, and I am currently working on switching a large part of my personal and professional life to back up to a Restic-REST server we run internally (as a rootless Podman service, which is ever so nice) and it's very exciting to have such a clean backup interface. Thank you guys!

Additional metrics that may be useful; some of which I suspect would need the repositories credentials. I am not sure if the REST server would have the capabilities to hook into that. Maybe it could generate metrics during the running of the actual backup command and then store them for the metrics export later, since it can't very well open the repository for each metrics request ?

  • HDD/SSD/Whatever storage device metrics (per Repository, as we store our Repos on separate volumes for better isolation) - total size, free size, used size in bytes, maybe optionally some health data like device errors if present ? Useful for obvious reasons such as alerting on low disk space.
  • Last backup metrics such as duration, affected files/dirs, maybe things like delta sizes or total files/bytes represented by a snapshot to monitor for suspicious changes in usage patterns such as encryption malware on the client system.
  • Date and results of last forget/prune/check commands such as runtime, deleted snapshots, recovered bytes, repacked bytes and so on.
@wojas
Copy link
Contributor

wojas commented Oct 5, 2023

Exporting a metric with the last time a snapshot was written during the lifetime of a process would not be hard to add.

Exporting it for repositories that have not seen a backup since the last rest-server restart would be harder, as it would require scanning the disk for all repositories on startup. We have so far avoided that. This part is probably not necessary to make this useful for monitoring.

@wojas
Copy link
Contributor

wojas commented Oct 5, 2023

Additional metrics that may be useful; some of which I suspect would need the repositories credentials. I am not sure if the REST server would have the capabilities to hook into that. Maybe it could generate metrics during the running of the actual backup command and then store them for the metrics export later, since it can't very well open the repository for each metrics request ?

The only way to implement this would be for the restic client to store such report in the repo, but then there is also the consideration of how much information is too much for a report that rest-server can read. Things like "affected files/dirs" would give information about the backup contents that restic is currently making sure to encrypt.

HDD/SSD/Whatever storage device metrics (per Repository, as we store our Repos on separate volumes for better isolation) - total size, free size, used size in bytes, maybe optionally some health data like device errors if present ? Useful for obvious reasons such as alerting on low disk space.

This could be useful, but doing this per repository would scanning all repositories, which rest-server currently does not do.

Last backup metrics such as duration, affected files/dirs, maybe things like delta sizes or total files/bytes represented by a snapshot to monitor for suspicious changes in usage patterns such as encryption malware on the client system.

The rest-server cannot read this data.

Date and results of last forget/prune/check commands such as runtime, deleted snapshots, recovered bytes, repacked bytes and so on.

The rest-server cannot tell when specific command were run. Creating a new snapshot has the side-effect of creating a new file in the snapshots directory which makes this easy, but this is not true for these commands.


This issue is related to #50.

@schoentoon
Copy link

I actually have metrics for most of these things already using a simple bash script, a systemd timer and prometheus-node-exporter picking up the textfile produced by it.

#!/bin/sh

set +e
set +x

NAMESPACE=restic

BACKUP_FOLDER=/mnt/restic

for dir in $(find "${BACKUP_FOLDER}" -maxdepth 1 -mindepth 1 -type d); do
   total_size=$(du -bs "${dir}" | cut -f 1) 

   snapshots_raw=$(ls -t -l --full-time "${dir}/snapshots" | sed 1d)
   snapshots_count=$(echo "${snapshots_raw}" | wc -l)
   lock_count=$(ls -1 "${dir}/locks" | wc -l)
   latest_snapshot=$(echo "${snapshots_raw}" | head -n 1 | awk '{ print $6 " " $7 }') 
   latest_snapshot_unix=$(date -d "${latest_snapshot}" +"%s")

   OUTPUT="${OUTPUT}${NAMESPACE}_repository_size_bytes{repository=\"${dir}\"} ${total_size}\n"
   OUTPUT="${OUTPUT}${NAMESPACE}_snapshots_count{repository=\"${dir}\"}  ${snapshots_count}\n"
   OUTPUT="${OUTPUT}${NAMESPACE}_latest_snapshot_time_seconds{repository=\"${dir}\"} ${latest_snapshot_unix}\n"
   OUTPUT="${OUTPUT}${NAMESPACE}_lock_count{repository=\"${dir}\"} ${lock_count}\n"
done

echo $OUTPUT | sort

This does make a fair bit of assumptions however, it won't work with restic repositories in subdirectories for example. But this has served me very well so far.

@Gaibhne
Copy link
Author

Gaibhne commented Oct 16, 2023

Additional metrics that may be useful; some of which I suspect would need the repositories credentials. I am not sure if the REST server would have the capabilities to hook into that. Maybe it could generate metrics during the running of the actual backup command and then store them for the metrics export later, since it can't very well open the repository for each metrics request ?
The only way to implement this would be for the restic client to store such report in the repo, but then there is also the consideration of how much information is too much for a report that rest-server can read. Things like "affected files/dirs" would give information about the backup contents that restic is currently making sure to encrypt.

The REST server only supplies 'protocol', it can't tap into the commands themselves, is that correct ? I can see how that would be problematic and probably make such statistics severely out of scope. Would there be any way for a client script or similar to communicate such data to the server (optionally), or even interest in a solution like that ? I was thinking of trying to bridge or include https://github.com/ngosang/restic-exporter with this project, but if there is no real way for the server to hold that data, each client would have to run their own exporter, which doesn't really seem desirable.

HDD/SSD/Whatever storage device metrics (per Repository, as we store our Repos on separate volumes for better isolation) - total size, free size, used size in bytes, maybe optionally some health data like device errors if present ? Useful for obvious reasons such as alerting on low disk space.
This could be useful, but doing this per repository would scanning all repositories, which rest-server currently does not do.

It would solve a large part of our metric/alerting needs, so I would be very much in favor of that. Optionally, probably even opt-in, as the majority of people probably don't use metrics, I would think.

Date and results of last forget/prune/check commands such as runtime, deleted snapshots, recovered bytes, repacked bytes and so on.
The rest-server cannot tell when specific command were run. Creating a new snapshot has the side-effect of creating a new file in the snapshots directory which makes this easy, but this is not true for these commands.

The existing metrics for that could still be improved - latest snapshot timestamp, for example, would be very helpful. Currently, if I run automated forgetting, it would be hard to distinguish between no backup running or a backup + one expired snapshot, since both would result in the same reported snapshot amount/change (+-0), right ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants