Add script to send mail in case btrfs issues were detected #107

ximion · 2022-03-19T01:40:37Z

Hi!
This PR adds an extremely basic script that just runs btrfs device stats --check on all btrfs filesystems every hour and sends an email to a user-defined address (most likely root in 90% of all cases) in case any issues were found.
This should very much work like the mdadm daemon feature that also sends mail in case one of the RAID members is about to fail.

A feature like this can be very useful for smaller setups where the admin still would
like to receive an email in case a disk in a btrfs RAID array fails.
This also is likely the billionth time someone has written such a script, so putting a version in one place where it can be shared and improved seemed like a good idea, and btrfsmaintenance seems to be the perfect place to add such a feature.

Thanks for considering this PR!

btrfs-errmail.sh

eku · 2022-03-19T18:13:50Z

I suggest a cron job, cause cron knows how to send mails.

MAILTO=admin@myserver.com
@hourly /sbin/btrfs device stats /data | grep -vE ' 0$'

ximion · 2022-03-19T19:45:46Z

I suggest a cron job, cause cron knows how to send mails.
MAILTO=admin@myserver.com
@hourly /sbin/btrfs device stats /data | grep -vE ' 0$'

Doing that would result in:

An email without a clear subject on what the issue was
An email that possibly doesn't have enough information to get an overview of the issue
Spam every hour in case there was a failure event, instead of once every day / per reboot
The user having to manually add an entry like this for each btrfs mountpoint, instead of solving this once for all the btrfs filesystems on the machine
The user actually having to use cron and configure it (maybe systemd timers are actually preferred)

So, I still see good reasons to have the extra script for this :-)

sten0 · 2022-03-19T20:04:22Z

I suggest a cron job, cause cron knows how to send mails.

Which cron implementation can do this without an MTA? When I investigated this, I discovered that Fedora now appears to log cron output to syslog (and by now, maybe journald rather than syslog) rather than piping output to the MTA; using journald might also be problematic, because not all systems have adequate persistent journal retention policies. I like the idea of using a file (/run/btrfs-issue-mail-sent), and I wonder if this idea could be extended. @ximion, what do you think about the following approach (pros, cons, etc):

Poll btrfs stats on an hourly basis, and dump it to a file. Limit notification emails similarly to the logic you've proposed, but send a follow up email if the rate of errors rapidly increases.

The reason I wonder about this approach is because of the following case: One disk is begins to fail rapidly, and the rate of failed reads (or failed writes) is increasing hour by hour. Meanwhile, the firmware lies about SMART data while claiming everything is fine.

It also seems like having a file with regularly updated stats could be used to enable desktop notifications, albeit in another project, since this seems out of scope for btrfsmaintenance. Btrfs dev stats are "updated during filesystem [mount] lifetime" in addition to "from a scrub run" (btrfs-device(8)), which is why I think this approach may have value :-)

sten0 · 2022-03-19T20:05:07Z

Oh, and here are the citations for the Fedora case:
https://fedoraproject.org/wiki/Changes/NoDefaultSendmail#Detailed_Description
https://fedoraproject.org/wiki/Changes/NoDefaultSendmail#Release_Notes

ximion · 2022-03-19T21:20:26Z

In general I think those are good ideas, and the case of errors rapidly increasing on a disk actually appears to be relatively common - on our systems once a disk is starting to fail, I can pretty much bet on this behavior.
This would need a script that's a lot more complex than the proposal here though, and I have to say that the idea of just writing a btrfs maintenance daemon that's lightweight and running all the time did cross my mind :-D The btrfs commands pretty much all have nice JSON output that such a daemon could parse to perform the appropriate actions, be it sending an email, writing a log message or sending a message to a desktop environment (but for that case, having a feature like that in udisks is likely the better spot).
Major drawback of this is that such a tool would have to be written and maintained in the first place ^^

karlmistelberger · 2022-03-20T02:29:13Z

Why not use mail instead of sendmail? See the following fragment from unit packagekit-background.service

this is when something useful was done

if [ $PKCON_RETVAL -ne 5 ]; then
# send email
if [ -n "$MAILTO" ]; then
mail -Ssendwait -s "System updates available: $SYSTEM_NAME" $MAILTO < $PKTMP
else
# default behavior is to use cron's internal mailing of output from cron-script
cat $PKTMP
fi
fi

This can be very useful for smaller setups where the admin still would like to receive an email in case a disk in a btrfs RAID array fails. Partially resolves kdave#88

AuHau · 2022-03-27T15:42:55Z

Small suggestion. It would be a good idea if there would be some test path to validate that everything is set up correctly and that I will indeed get the email notification when something goes wrong. Similarly like SMART has the -M test flag.

But otherwise, this is very much needed for me so thanks a lot for this PR! Hopefully this will be merged 👍

btrfs-issuemail.sh

sten0 · 2022-05-08T03:59:54Z

In general I think those are good ideas, and the case of errors rapidly increasing on a disk actually appears to be relatively common - on our systems once a disk is starting to fail, I can pretty much bet on this behavior.

Thanks. I imagine it's stuff you've already thought of, of course ;) I'm encouraged to hear that this failure mode is common, because common problems of sufficient severity make something work towards a solution pragmatically useful.

This would need a script that's a lot more complex than the proposal here though, and I have to say that the idea of just writing a btrfs maintenance daemon that's lightweight and running all the time did cross my mind :-D The btrfs commands pretty much all have nice JSON output that such a daemon could parse to perform the appropriate actions, be it sending an email, writing a log message or sending a message to a desktop environment (but for that case, having a feature like that in udisks is likely the better spot). Major drawback of this is that such a tool would have to be written and maintained in the first place ^^

Yes, definitely, and there was upstream thread that indicates a need for it:

Zygo Blaxell proposes an autodefrag daemon here: https://www.spinics.net/lists/linux-btrfs/msg122168.html
Qu Wenruo supports the idea here: https://www.spinics.net/lists/linux-btrfs/msg122170.html

And a user (Ghislain Adnet) requests what this PR solves here: https://www.spinics.net/lists/linux-btrfs/msg110798.html

I find Adnet's request interesting because this would be where a future btrfsd could initiate a replace from hot spare, or rebalance to higher raid1c$redundancy level to defend against the rapidly increasing errors failure mode (ie: it's probable that two disks in the volume are from the same batch, and if one is failing, another may soon begin to fail).

sten0 · 2022-05-08T04:00:26Z

/\ @ximion

rjlasko · 2022-07-25T00:37:47Z

Agree that an email-on-error service should be added. ZFS supports this behavior, for any preinstalled mail service, via zed configuration.

clickwir · 2022-10-11T09:10:28Z

FWIW, we've been using 'sendemail' for many years. It's still a dependency, but a much lighter one. The actual mail server runs elsewhere, no need to have every system be it's own mail server.

…

On Sat, Mar 19, 2022 at 1:40 PM Matthias Klumpp ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In btrfs-errmail.sh <#107 (comment)> : > +then + # no email set, nothing to do for us + exit 0 +fi + +BTRFS_STATS_MOUNTPOINTS=$(expand_auto_mountpoint "auto") +OIFS="$IFS" +IFS=: +for MM in $BTRFS_STATS_MOUNTPOINTS; do + if ! is_btrfs "$MM"; then + echo "Path $MM is not btrfs, skipping" + continue + fi + devstats=$(btrfs device stats --check $MM 2>&1) + if [ $? -ne 0 ]; then + mail_body="$(sendmail -t <<EOF Sendmail would obviously have to be a dependency of this. I changed the code so in case an email location was set but sendmail wasn't installed, the script will fail and print a warning to stderr. — Reply to this email directly, view it on GitHub <#107 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACQSRYS7ZSEJPGWN3KRBH3TVAYUTFANCNFSM5RDJO4IA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Co-authored-by: Adam Uhlíř <adam@uhlir.dev>

ximion · 2023-02-20T10:43:41Z

/\ @ximion

Do you know if any progress has been made on the "btrfsd" front?

sten0 · 2023-04-18T21:43:18Z

Matthias Klumpp ***@***.***> writes:

> /\ @ximion Do you know if any progress has been made on the "btrfsd" front?

I haven't heard anything further. If boot environment handling is within the ideal scope of "btrfsd", then maybe grub-btrfsd could be grown into a general-purpose maintenance btrfsd? But maybe that's too much of a stretch... https://github.com/Antynea/grub-btrfs If future btrfsd would does boot environment handling, then it will probably need to support systemd-boot. I wonder if this chicken/egg problem isn't going to be solved until someone from Fedora implements something, and then it becomes defacto standard.

ximion · 2023-04-18T22:26:21Z

I'm working on a thing (called btrfsd for now because I don't have a better name...) which will basically be a small binary called by a systemd timer to perform actions like btrfsmaintenance does, but likely a bit more basic, and scratch my particular itch about mail sending and syslog-message-writing, because this patch apparently won't be merged anytime soon.
No ETA on this thing yet though, as I am drowning in work a bit and this will be a "when time permits" kind of project.
grub-btrfs looks super cool! Probably does make sense being its own project though (consolidating all tools would ease maintenance a bit, but would also require the maintainers to be familiar with every aspect of the software...)

sten0 · 2023-04-27T21:32:21Z

I'm working on a thing (called btrfsd for now because I don't have a better name...) which will basically be a small binary called by a systemd timer to perform actions like btrfsmaintenance does, but likely a bit more basic, and scratch my particular itch about mail sending and syslog-message-writing, because this patch apparently won't be merged anytime soon. No ETA on this thing yet though, as I am drowning in work a bit and this will be a "when time permits" kind of project.

Thank you, much appreciated! Please CC me news.

grub-btrfs looks super cool! Probably does make sense being its own project though (consolidating all tools would ease maintenance a bit, but would also require the maintainers to be familiar with every aspect of the software...)

🙂 and fair point; I guess that means there's still a need for distribution maintainers to do this work themselves!

ximion · 2023-08-24T01:16:34Z

Thank you, much appreciated! Please CC me news.

I actually had some time to work on this, and tiny Btrfsd is born :-)
I am currently testing it on my computer and a server, and if things work out well, make the tool available in Debian as well. It is not as extensive as btrfsmaintenance and will probably only ever support stats/scrub/balance, but it has some nice features (like sending mail on errors, and more mails if errors increase, or only running scrub/balance if the system is not running on battery power).
Maybe you'll like it, and others find it useful too :-)

eku reviewed Mar 19, 2022

View reviewed changes

btrfs-errmail.sh Outdated Show resolved Hide resolved

ximion force-pushed the master branch from 9419ac2 to 118f9b3 Compare March 19, 2022 19:39

Conan-Kudo approved these changes Mar 19, 2022

View reviewed changes

Add script to send mail in case btrfs issues were detected

6a23666

This can be very useful for smaller setups where the admin still would like to receive an email in case a disk in a btrfs RAID array fails. Partially resolves kdave#88

ximion force-pushed the master branch from 118f9b3 to 6a23666 Compare March 21, 2022 10:07

AuHau reviewed Mar 31, 2022

View reviewed changes

btrfs-issuemail.sh Outdated Show resolved Hide resolved

Update btrfs-issuemail.sh

4b2f898

Co-authored-by: Adam Uhlíř <adam@uhlir.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to send mail in case btrfs issues were detected #107

Add script to send mail in case btrfs issues were detected #107

ximion commented Mar 19, 2022

eku commented Mar 19, 2022 •

edited

ximion commented Mar 19, 2022

sten0 commented Mar 19, 2022 •

edited

sten0 commented Mar 19, 2022

ximion commented Mar 19, 2022 •

edited

karlmistelberger commented Mar 20, 2022

AuHau commented Mar 27, 2022 •

edited

sten0 commented May 8, 2022

sten0 commented May 8, 2022

rjlasko commented Jul 25, 2022 •

edited

clickwir commented Oct 11, 2022 via email

ximion commented Feb 20, 2023

sten0 commented Apr 18, 2023 via email

ximion commented Apr 18, 2023

sten0 commented Apr 27, 2023 via email

ximion commented Aug 24, 2023

Add script to send mail in case btrfs issues were detected #107

Are you sure you want to change the base?

Add script to send mail in case btrfs issues were detected #107

Conversation

ximion commented Mar 19, 2022

eku commented Mar 19, 2022 • edited

ximion commented Mar 19, 2022

sten0 commented Mar 19, 2022 • edited

sten0 commented Mar 19, 2022

ximion commented Mar 19, 2022 • edited

karlmistelberger commented Mar 20, 2022

this is when something useful was done

AuHau commented Mar 27, 2022 • edited

sten0 commented May 8, 2022

sten0 commented May 8, 2022

rjlasko commented Jul 25, 2022 • edited

clickwir commented Oct 11, 2022 via email

ximion commented Feb 20, 2023

sten0 commented Apr 18, 2023 via email

ximion commented Apr 18, 2023

sten0 commented Apr 27, 2023 via email

ximion commented Aug 24, 2023

eku commented Mar 19, 2022 •

edited

sten0 commented Mar 19, 2022 •

edited

ximion commented Mar 19, 2022 •

edited

AuHau commented Mar 27, 2022 •

edited

rjlasko commented Jul 25, 2022 •

edited