Skip to content

Latest commit

 

History

History
479 lines (261 loc) · 12.2 KB

check-health.rst

File metadata and controls

479 lines (261 loc) · 12.2 KB

check-health

check system health

Author

Marius Gedminas <marius@gedmin.as>

Date

2020-10-31

Version

0.13.2

Manual section

8

SYNOPSIS

check-health [-c] [-v] [-f configfile]

check-health -g > configfile

check-health -h

DESCRIPTION

check-health is a "poor man's Nagios": a script that performs some basic system health checks. The checks are specified in the configuration file /etc/pov/check-health; if that file doesn't exist, check-health will exit silently without checking anything.

You can run check-health -g to generate a config file. You'll probably need to modify it to suit your needs.

Usually check-health is run automatically from cron. It doesn't emit any output and returns exit code 0 if all checks pass. Any output indicates an error, and cron emails it to root.

OPTIONS

-h Print brief usage message and exit. -v Verbose output: show what checks are being performed. -c Colorize error messages in red. -g Generate a sample config file and print it to stdout. -f FILENAME Use the specified config file instead of /etc/pov/check-health.

Note: -v also uses some colors, for informational messages, when standard output is a terminal that supports colors. -c, on the other hand, is unconditional and always uses colors, which is useful when you run check-health over ssh without an allocated terminal and want to see the errors stand out.

AVAILABLE CHECKS

All checks return a status code in addition to warning about problems.

checkuptime [<uptime>[s/m/h/sec/min/hour]]

Skip the rest of the checks if system uptime is less than N seconds/minutes/hours.

<uptime> defaults to 10 minutes.

Example: checkuptime 10m

checkfs <mountpoint> [<amount>[K/M/G/T]]

Check that the filesystem mounted on <mountpoint> has at least <amount> of metric kilo/mega/giga/terabytes free.

<amount> defaults to 1M.

Example: checkfs / 100M

checkinodes <mountpoint> [<inodes>]

Check that the filesystem mounted on <mountpoint> has at least <inodes> of free inodes left.

<inodes> defaults to 5000.

Example: checkinodes /

checknfs <mountpoint>

Check that an NFS file system is mounted on <mountpoint>.

If not, try to mount all NFS filesystems.

Used as a workaround for an Ubuntu issue where NFS filesystems would fail to mount during boot, but would mount fine afterwards.

This hasn't been a problem lately.

Example: checknfs /home

checkpidfile <filename>

Check that the process listed in a given pidfile is running.

Example: checkpidfile /var/run/crond.pid

checkpidfiles <filename> ...

Check that the processes listed in given pidfiles are running.

Suppresses warnings for /var/run/sm-notify.pid because it feels like a false positive.

Suppresses warnings for failed glob expansion under /run or /var/run.

Example: checkpidfiles /var/run/*.pid /var/run/*/*.pid

checkproc <name>

Check that a process with a given name is running.

See also: checkproc_pgrep, checkproc_pgrep_full.

Example: checkproc crond

checkproc_pgrep <name>

Check that a process with a given name is running.

Uses pgrep instead of pidof.

Example: checkproc_pgrep tracd

checkproc_pgrep_full <cmdline>

Check that a process matching a given command line is running.

Uses pgrep -f instead of pidof, which makes it handle all sorts of things.

Example: checkproc_pgrep_full scriptname.py

Example: checkproc_pgrep_full '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'

checktoomanyproc <name> <limit>

Check that fewer than <limit> instances of a given process is running.

See also: checktoomanyproc_pgrep, checktoomanyproc_pgrep_full.

Example: checktoomanyproc aspell 2

checktoomanyproc_pgrep <name> <limit>

Check that fewer than <limit> instances of a given process is running.

Uses pgrep instead of pidof.

Example: checktoomanyproc_pgrep tracd 2

checktoomanyproc_pgrep_full <limit> <cmdline>

Check that fewer than <limit> instances of a given process is running.

Uses pgrep -f instead of pidof, which makes it handle all sorts of things.

Example: checktoomanyproc_pgrep_full 2 scriptname.py

Example: checktoomanyproc_pgrep_full 2 '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'

checkthreads <min> <pgrep-args>

Check that a process has at least <min> threads.

Uses pgrep <pgrep-args> to find the process. Shows an error if pgrep finds nothing, or if pgrep finds more than one process.

Useful to detect dying threads due to missing/buggy exception handling.

Example: checkthreads 7 runzope -u ivija-staging

checklocale <locale> <pgrep-args>

Check that a process is running with the correct locale set.

Uses pgrep <pgrep-args> to find the process. Shows an error if pgrep finds nothing, or if pgrep finds more than one process.

Looks at LC_ALL/LC_CTYPE/LANG in the process environment. <locale> can be a glob pattern.

Background: this is useful to detect problems when a system daemon's locale differs depending on which sysadmin used their ssh session to launch it (or if the daemon was started at system startup).

Example: checklocale en_US.UTF-8 runzope -u ivija-staging

Example: checklocale '*.UTF-8' runzope -u ivija-staging

checkram [<free>[M/G/T]]

Check that at least <free> metric mega/giga/terabytes of virtual memory are free.

<free> defaults to 100 megabytes.

Example: checkram 100M

checkswap [<limit>[M/G/T]]

Check if more than <limit> metric mega/giga/terabytes of swap are used.

<limit> defaults to 100 megabytes.

Example: checkswap 2G

checkmailq [<limit>]

Check if more than <limit> emails are waiting in the outgoing mail queue.

<limit> defaults to 20.

The check is silently skipped if you don't have any MTA (that provides a mailq command) installed. Otherwise it probably works only with Postfix.

Example: checkmailq 100

checkzopemailq <path> ...

Check if any messages older than one minute are present in the outgoing maildir used by zope.sendmail.

<path> needs to refer to the 'new' subdirectory of the mail queue.

Example: checkzopemailq /apps/zopes/*/var/mailqueue/new

checkcups <queuename>

Check if the printer is ready.

Try to enable it if it became disabled.

Background: I had this issue with CUPS randomly disabling a particular mail queue after it couldn't talk to the printer for a while due to network issues or something. Manually reenabling the printer got old fast. This hasn't been a problem lately.

Example: checkcups cheese

cmpfiles <pathname1> <pathname2>

Check if the two files are identical.

Background: there were some init.d scripts that were writable by a non-root user. I wanted to do manual inspection before replacing copies of them into /etc/init.d/.

Example: cmpfiles /etc/init.d/someservice /home/someservice/initscript

check_no_matching_lines <regexp> <pathname>

Check that a file has no lines matching a regular expression.

Background: I had Jenkins jobs install random user crontabs.

Example: check_no_matching_lines ^[^#] /var/spool/cron/crontabs/jenkins

checkaliases

Check if /etc/aliases.db is up to date.

Probably works only with Postfix, and only if you use the default database format.

Background: when you edit /etc/aliases it's so easy to forget to run newaliases.

Example: checkaliases

check_postmap_up_to_date <pathname>

Check if <pathname>.db is up to date with respect to <pathname>.

Background: when you edit /etc/postfix/* it's so easy to forget to run postmap.

Example: check_postmap_up_to_date /etc/postfix/virtual

checklilo

Check if LILO was run after a kernel update.

Background: if you don't re-run LILO after you update your kernel, your machine will not boot. We had to use LILO on one server because GRUB completely refused to boot from the Software RAID-1 root partition.

Example: checklilo

checkweb

Check if a website is available over HTTP/HTTPS.

A thin wrapper around check_http from nagios-plugins-basic. See https://www.monitoring-plugins.org/doc/man/check_http.html for the available options.

Normally you wouldn't use this from /etc/pov/check-web-health, and not from /etc/pov/check-health.

Example: checkweb -H www.example.com

Example: checkweb --ssl -H www.example.com -u /prefix/ -f follow -s 'Expect this string' --timeout=30

Example: checkweb --ssl -H www.example.com -u /protected/ -e 'HTTP/1.1 401 Unauthorized' -s 'Login required'

Example: checkweb --ssl -H www.example.com --invert-regex -r "Database connection error"

This function is normally used from /etc/pov/check-web-health.

checkweb_auth

Check if a website is available over HTTP/HTTPS.

checkweb_auth user:pwd args is equivalent to checkweb -a user:pwd args but the username/password pair is not printed if the check fails or in verbose mode.

(It's still visible to any local system user who can run 'ps' while check-web-health is running.)

Example: checkweb_auth username:password -H www.example.com

This function is normally used from /etc/pov/check-web-health.

checkcert <hostname>[:<port>] [<days>]

Check if the SSL certificate of a website is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert www.example.com

Example: checkcert www.example.com:8443

This function is normally used from /etc/pov/check-ssl-certs.

checkcert_ssmtp <hostname> [<days>]

Check if the SSL certificate of an SSMTP server is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert_ssmtp mail.example.com

This function is normally used from /etc/pov/check-ssl-certs.

checkcert_smtp_starttls <hostname> [<days>]

Check if the SSL certificate of an SMTP server is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert_smtp_starttls mail.example.com

This function is normally used from /etc/pov/check-ssl-certs.

checkcert_imaps <hostname> [<days>]

Check if the SSL certificate of an IMAPS server is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert_imaps mail.example.com

This function is normally used from /etc/pov/check-ssl-certs.

EXAMPLES

Example /etc/pov/check-health:

# Check that processes are running
checkproc apache2
checkproc cron
checkproc sshd
checkproc_pgrep tracd
checkproc_pgrep_full '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'

# Check for daemons with known bugs and restart them automatically
checkproc atop || service atop restart

# Check for stale aspell processes (more than 2)
checktoomanyproc aspell 2

# Check for stale pidfiles
checkpidfiles /var/run/*.pid /var/run/*/*.pid

# Check free disk space
checkfs /    200M
checkfs /var 200M

# Check free inodes
checkinodes /
checkinodes /var

# Check free memory
checkram 100M

# Check excessive swap usage
checkswap 2G

# Check mail queue
checkmailq 100

# Check if /etc/aliases is up to date
checkaliases

BUGS

check-health returns exit code 0 even if some checks failed. You need to watch stderr to notice problems.

Many checks don't check their arguments for correctness and may fail in unexpected ways if you supply a wrong value (or neglect to supply a value where one was expected).

DESIGN LIMITATIONS

If cron doesn't work, or email sending doesn't work, check-health won't be able to report problems. You can combine it with a service like https://healthchecks.io to catch these kinds of problems.

check-health is stateless and as such will keep reporting the same error once an hour (assuming default cron configuration) until you fix it.

SEE ALSO

check-web-health(8), check-ssl-certs(8)