Additional metric: last snapshot date/timestamp(per repository) #256

Gaibhne · 2023-10-04T08:03:38Z

Output of `rest-server --version`

Not relevant.

What should rest-server do differently?

Export the timestamp of the last successful snapshot (and ideally more, I added a few ideas at the end, but last snapshot timestamp is most critical) as part of the Prometheus metrics.

What are you trying to do? What is your use case?

The Prometheus metrics are perfect to set up a monitoring system to alert on backups not running, because it would allow to monitor the actual result of the backup job, so it would be much better than say the backup job itself sending alerts on failures - if the job doesn't run, for example, it might never send out notifications. Watching the REST servers metrics on the other hand would always be able to confirm that everything else aside, the snapshot made it to the repository.

Did rest-server help you today? Did it make you happy in any way?

It's fantastic, and I am currently working on switching a large part of my personal and professional life to back up to a Restic-REST server we run internally (as a rootless Podman service, which is ever so nice) and it's very exciting to have such a clean backup interface. Thank you guys!

Additional metrics that may be useful; some of which I suspect would need the repositories credentials. I am not sure if the REST server would have the capabilities to hook into that. Maybe it could generate metrics during the running of the actual backup command and then store them for the metrics export later, since it can't very well open the repository for each metrics request ?

HDD/SSD/Whatever storage device metrics (per Repository, as we store our Repos on separate volumes for better isolation) - total size, free size, used size in bytes, maybe optionally some health data like device errors if present ? Useful for obvious reasons such as alerting on low disk space.
Last backup metrics such as duration, affected files/dirs, maybe things like delta sizes or total files/bytes represented by a snapshot to monitor for suspicious changes in usage patterns such as encryption malware on the client system.
Date and results of last forget/prune/check commands such as runtime, deleted snapshots, recovered bytes, repacked bytes and so on.

The text was updated successfully, but these errors were encountered:

wojas · 2023-10-05T13:26:24Z

Exporting a metric with the last time a snapshot was written during the lifetime of a process would not be hard to add.

Exporting it for repositories that have not seen a backup since the last rest-server restart would be harder, as it would require scanning the disk for all repositories on startup. We have so far avoided that. This part is probably not necessary to make this useful for monitoring.

wojas · 2023-10-05T13:35:37Z

Additional metrics that may be useful; some of which I suspect would need the repositories credentials. I am not sure if the REST server would have the capabilities to hook into that. Maybe it could generate metrics during the running of the actual backup command and then store them for the metrics export later, since it can't very well open the repository for each metrics request ?

The only way to implement this would be for the restic client to store such report in the repo, but then there is also the consideration of how much information is too much for a report that rest-server can read. Things like "affected files/dirs" would give information about the backup contents that restic is currently making sure to encrypt.

HDD/SSD/Whatever storage device metrics (per Repository, as we store our Repos on separate volumes for better isolation) - total size, free size, used size in bytes, maybe optionally some health data like device errors if present ? Useful for obvious reasons such as alerting on low disk space.

This could be useful, but doing this per repository would scanning all repositories, which rest-server currently does not do.

Last backup metrics such as duration, affected files/dirs, maybe things like delta sizes or total files/bytes represented by a snapshot to monitor for suspicious changes in usage patterns such as encryption malware on the client system.

The rest-server cannot read this data.

Date and results of last forget/prune/check commands such as runtime, deleted snapshots, recovered bytes, repacked bytes and so on.

The rest-server cannot tell when specific command were run. Creating a new snapshot has the side-effect of creating a new file in the snapshots directory which makes this easy, but this is not true for these commands.

This issue is related to #50.

schoentoon · 2023-10-06T23:42:16Z

I actually have metrics for most of these things already using a simple bash script, a systemd timer and prometheus-node-exporter picking up the textfile produced by it.

#!/bin/sh

set +e
set +x

NAMESPACE=restic

BACKUP_FOLDER=/mnt/restic

for dir in $(find "${BACKUP_FOLDER}" -maxdepth 1 -mindepth 1 -type d); do
   total_size=$(du -bs "${dir}" | cut -f 1) 

   snapshots_raw=$(ls -t -l --full-time "${dir}/snapshots" | sed 1d)
   snapshots_count=$(echo "${snapshots_raw}" | wc -l)
   lock_count=$(ls -1 "${dir}/locks" | wc -l)
   latest_snapshot=$(echo "${snapshots_raw}" | head -n 1 | awk '{ print $6 " " $7 }') 
   latest_snapshot_unix=$(date -d "${latest_snapshot}" +"%s")

   OUTPUT="${OUTPUT}${NAMESPACE}_repository_size_bytes{repository=\"${dir}\"} ${total_size}\n"
   OUTPUT="${OUTPUT}${NAMESPACE}_snapshots_count{repository=\"${dir}\"}  ${snapshots_count}\n"
   OUTPUT="${OUTPUT}${NAMESPACE}_latest_snapshot_time_seconds{repository=\"${dir}\"} ${latest_snapshot_unix}\n"
   OUTPUT="${OUTPUT}${NAMESPACE}_lock_count{repository=\"${dir}\"} ${lock_count}\n"
done

echo $OUTPUT | sort

This does make a fair bit of assumptions however, it won't work with restic repositories in subdirectories for example. But this has served me very well so far.

Gaibhne · 2023-10-16T15:04:50Z

Additional metrics that may be useful; some of which I suspect would need the repositories credentials. I am not sure if the REST server would have the capabilities to hook into that. Maybe it could generate metrics during the running of the actual backup command and then store them for the metrics export later, since it can't very well open the repository for each metrics request ?
The only way to implement this would be for the restic client to store such report in the repo, but then there is also the consideration of how much information is too much for a report that rest-server can read. Things like "affected files/dirs" would give information about the backup contents that restic is currently making sure to encrypt.

The REST server only supplies 'protocol', it can't tap into the commands themselves, is that correct ? I can see how that would be problematic and probably make such statistics severely out of scope. Would there be any way for a client script or similar to communicate such data to the server (optionally), or even interest in a solution like that ? I was thinking of trying to bridge or include https://github.com/ngosang/restic-exporter with this project, but if there is no real way for the server to hold that data, each client would have to run their own exporter, which doesn't really seem desirable.

HDD/SSD/Whatever storage device metrics (per Repository, as we store our Repos on separate volumes for better isolation) - total size, free size, used size in bytes, maybe optionally some health data like device errors if present ? Useful for obvious reasons such as alerting on low disk space.
This could be useful, but doing this per repository would scanning all repositories, which rest-server currently does not do.

It would solve a large part of our metric/alerting needs, so I would be very much in favor of that. Optionally, probably even opt-in, as the majority of people probably don't use metrics, I would think.

Date and results of last forget/prune/check commands such as runtime, deleted snapshots, recovered bytes, repacked bytes and so on.
The rest-server cannot tell when specific command were run. Creating a new snapshot has the side-effect of creating a new file in the snapshots directory which makes this easy, but this is not true for these commands.

The existing metrics for that could still be improved - latest snapshot timestamp, for example, would be very helpful. Currently, if I run automated forgetting, it would be hard to distinguish between no backup running or a backup + one expired snapshot, since both would result in the same reported snapshot amount/change (+-0), right ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional metric: last snapshot date/timestamp(per repository) #256

Additional metric: last snapshot date/timestamp(per repository) #256

Gaibhne commented Oct 4, 2023

wojas commented Oct 5, 2023

wojas commented Oct 5, 2023 •

edited

schoentoon commented Oct 6, 2023

Gaibhne commented Oct 16, 2023

Additional metric: last snapshot date/timestamp(per repository) #256

Additional metric: last snapshot date/timestamp(per repository) #256

Comments

Gaibhne commented Oct 4, 2023

Output of rest-server --version

What should rest-server do differently?

What are you trying to do? What is your use case?

Did rest-server help you today? Did it make you happy in any way?

wojas commented Oct 5, 2023

wojas commented Oct 5, 2023 • edited

schoentoon commented Oct 6, 2023

Gaibhne commented Oct 16, 2023

Output of `rest-server --version`

wojas commented Oct 5, 2023 •

edited