check-health

check system health

Author: Marius Gedminas <marius@gedmin.as>
Date: 2020-10-31
Version: 0.13.2
Manual section: 8

SYNOPSIS

check-health [-c] [-v] [-f configfile]

check-health -g > configfile

check-health -h

DESCRIPTION

check-health is a "poor man's Nagios": a script that performs some basic system health checks. The checks are specified in the configuration file /etc/pov/check-health; if that file doesn't exist, check-health will exit silently without checking anything.

You can run check-health -g to generate a config file. You'll probably need to modify it to suit your needs.

Usually check-health is run automatically from cron. It doesn't emit any output and returns exit code 0 if all checks pass. Any output indicates an error, and cron emails it to root.

OPTIONS

-h Print brief usage message and exit. -v Verbose output: show what checks are being performed. -c Colorize error messages in red. -g Generate a sample config file and print it to stdout. -f FILENAME Use the specified config file instead of /etc/pov/check-health.

Note: -v also uses some colors, for informational messages, when standard output is a terminal that supports colors. -c, on the other hand, is unconditional and always uses colors, which is useful when you run check-health over ssh without an allocated terminal and want to see the errors stand out.

AVAILABLE CHECKS

All checks return a status code in addition to warning about problems.

checkuptime [<uptime>[s/m/h/sec/min/hour]]

Skip the rest of the checks if system uptime is less than N seconds/minutes/hours.

<uptime> defaults to 10 minutes.

Example: checkuptime 10m

checkfs <mountpoint> [<amount>[K/M/G/T]]

Check that the filesystem mounted on <mountpoint> has at least <amount> of metric kilo/mega/giga/terabytes free.

<amount> defaults to 1M.

Example: checkfs / 100M

checkinodes <mountpoint> [<inodes>]

Check that the filesystem mounted on <mountpoint> has at least <inodes> of free inodes left.

<inodes> defaults to 5000.

Example: checkinodes /

checknfs <mountpoint>

Check that an NFS file system is mounted on <mountpoint>.

If not, try to mount all NFS filesystems.

Used as a workaround for an Ubuntu issue where NFS filesystems would fail to mount during boot, but would mount fine afterwards.

This hasn't been a problem lately.

Example: checknfs /home

checkpidfile <filename>

Check that the process listed in a given pidfile is running.

Example: checkpidfile /var/run/crond.pid

checkpidfiles <filename> ...

Check that the processes listed in given pidfiles are running.

Suppresses warnings for /var/run/sm-notify.pid because it feels like a false positive.

Suppresses warnings for failed glob expansion under /run or /var/run.

Example: checkpidfiles /var/run/*.pid /var/run/*/*.pid

checkproc <name>

Check that a process with a given name is running.

See also: checkproc_pgrep, checkproc_pgrep_full.

Example: checkproc crond

checkproc_pgrep <name>

Check that a process with a given name is running.

Uses pgrep instead of pidof.

Example: checkproc_pgrep tracd

checkproc_pgrep_full <cmdline>

Check that a process matching a given command line is running.

Uses pgrep -f instead of pidof, which makes it handle all sorts of things.

Example: checkproc_pgrep_full scriptname.py

Example: checkproc_pgrep_full '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'

checktoomanyproc <name> <limit>

Check that fewer than <limit> instances of a given process is running.

See also: checktoomanyproc_pgrep, checktoomanyproc_pgrep_full.

Example: checktoomanyproc aspell 2

checktoomanyproc_pgrep <name> <limit>

Check that fewer than <limit> instances of a given process is running.

Uses pgrep instead of pidof.

Example: checktoomanyproc_pgrep tracd 2

checktoomanyproc_pgrep_full <limit> <cmdline>

Check that fewer than <limit> instances of a given process is running.

Uses pgrep -f instead of pidof, which makes it handle all sorts of things.

Example: checktoomanyproc_pgrep_full 2 scriptname.py

Example: checktoomanyproc_pgrep_full 2 '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'

checkthreads <min> <pgrep-args>

Check that a process has at least <min> threads.

Uses pgrep <pgrep-args> to find the process. Shows an error if pgrep finds nothing, or if pgrep finds more than one process.

Useful to detect dying threads due to missing/buggy exception handling.

Example: checkthreads 7 runzope -u ivija-staging

checklocale <locale> <pgrep-args>

Check that a process is running with the correct locale set.

Uses pgrep <pgrep-args> to find the process. Shows an error if pgrep finds nothing, or if pgrep finds more than one process.

Looks at LC_ALL/LC_CTYPE/LANG in the process environment. <locale> can be a glob pattern.

Background: this is useful to detect problems when a system daemon's locale differs depending on which sysadmin used their ssh session to launch it (or if the daemon was started at system startup).

Example: checklocale en_US.UTF-8 runzope -u ivija-staging

Example: checklocale '*.UTF-8' runzope -u ivija-staging

checkram [<free>[M/G/T]]

Check that at least <free> metric mega/giga/terabytes of virtual memory are free.

<free> defaults to 100 megabytes.

Example: checkram 100M

checkswap [<limit>[M/G/T]]

Check if more than <limit> metric mega/giga/terabytes of swap are used.

<limit> defaults to 100 megabytes.

Example: checkswap 2G

checkmailq [<limit>]

Check if more than <limit> emails are waiting in the outgoing mail queue.

<limit> defaults to 20.

The check is silently skipped if you don't have any MTA (that provides a mailq command) installed. Otherwise it probably works only with Postfix.

Example: checkmailq 100

checkzopemailq <path> ...

Check if any messages older than one minute are present in the outgoing maildir used by zope.sendmail.

<path> needs to refer to the 'new' subdirectory of the mail queue.

Example: checkzopemailq /apps/zopes/*/var/mailqueue/new

checkcups <queuename>

Check if the printer is ready.

Try to enable it if it became disabled.

Background: I had this issue with CUPS randomly disabling a particular mail queue after it couldn't talk to the printer for a while due to network issues or something. Manually reenabling the printer got old fast. This hasn't been a problem lately.

Example: checkcups cheese

cmpfiles <pathname1> <pathname2>

Check if the two files are identical.

Background: there were some init.d scripts that were writable by a non-root user. I wanted to do manual inspection before replacing copies of them into /etc/init.d/.

Example: cmpfiles /etc/init.d/someservice /home/someservice/initscript

check_no_matching_lines <regexp> <pathname>

Check that a file has no lines matching a regular expression.

Background: I had Jenkins jobs install random user crontabs.

Example: check_no_matching_lines ^[^#] /var/spool/cron/crontabs/jenkins

checkaliases

Check if /etc/aliases.db is up to date.

Probably works only with Postfix, and only if you use the default database format.

Background: when you edit /etc/aliases it's so easy to forget to run newaliases.

Example: checkaliases

check_postmap_up_to_date <pathname>

Check if <pathname>.db is up to date with respect to <pathname>.

Background: when you edit /etc/postfix/* it's so easy to forget to run postmap.

Example: check_postmap_up_to_date /etc/postfix/virtual

checklilo

Check if LILO was run after a kernel update.

Background: if you don't re-run LILO after you update your kernel, your machine will not boot. We had to use LILO on one server because GRUB completely refused to boot from the Software RAID-1 root partition.

Example: checklilo

checkweb

Check if a website is available over HTTP/HTTPS.

A thin wrapper around check_http from nagios-plugins-basic. See https://www.monitoring-plugins.org/doc/man/check_http.html for the available options.

Normally you wouldn't use this from /etc/pov/check-web-health, and not from /etc/pov/check-health.

Example: checkweb -H www.example.com

Example: checkweb --ssl -H www.example.com -u /prefix/ -f follow -s 'Expect this string' --timeout=30

Example: checkweb --ssl -H www.example.com -u /protected/ -e 'HTTP/1.1 401 Unauthorized' -s 'Login required'

Example: checkweb --ssl -H www.example.com --invert-regex -r "Database connection error"

This function is normally used from /etc/pov/check-web-health.

checkweb_auth

Check if a website is available over HTTP/HTTPS.

checkweb_auth user:pwd args is equivalent to checkweb -a user:pwd args but the username/password pair is not printed if the check fails or in verbose mode.

(It's still visible to any local system user who can run 'ps' while check-web-health is running.)

Example: checkweb_auth username:password -H www.example.com

This function is normally used from /etc/pov/check-web-health.

checkcert <hostname>[:<port>] [<days>]

Check if the SSL certificate of a website is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert www.example.com

Example: checkcert www.example.com:8443

This function is normally used from /etc/pov/check-ssl-certs.

checkcert_ssmtp <hostname> [<days>]

Check if the SSL certificate of an SSMTP server is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert_ssmtp mail.example.com

This function is normally used from /etc/pov/check-ssl-certs.

checkcert_smtp_starttls <hostname> [<days>]

Check if the SSL certificate of an SMTP server is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert_smtp_starttls mail.example.com

This function is normally used from /etc/pov/check-ssl-certs.

checkcert_imaps <hostname> [<days>]

Check if the SSL certificate of an IMAPS server is close to expiration.

<days> defaults to $CHECKCERT_WARN_BEFORE, and if that's not specified, 21.

Example: checkcert_imaps mail.example.com

This function is normally used from /etc/pov/check-ssl-certs.

EXAMPLES

Example /etc/pov/check-health:

# Check that processes are running
checkproc apache2
checkproc cron
checkproc sshd
checkproc_pgrep tracd
checkproc_pgrep_full '/usr/bin/java -jar /usr/share/jenkins/jenkins.war'

# Check for daemons with known bugs and restart them automatically
checkproc atop || service atop restart

# Check for stale aspell processes (more than 2)
checktoomanyproc aspell 2

# Check for stale pidfiles
checkpidfiles /var/run/*.pid /var/run/*/*.pid

# Check free disk space
checkfs /    200M
checkfs /var 200M

# Check free inodes
checkinodes /
checkinodes /var

# Check free memory
checkram 100M

# Check excessive swap usage
checkswap 2G

# Check mail queue
checkmailq 100

# Check if /etc/aliases is up to date
checkaliases

BUGS

check-health returns exit code 0 even if some checks failed. You need to watch stderr to notice problems.

Many checks don't check their arguments for correctness and may fail in unexpected ways if you supply a wrong value (or neglect to supply a value where one was expected).

DESIGN LIMITATIONS

If cron doesn't work, or email sending doesn't work, check-health won't be able to report problems. You can combine it with a service like https://healthchecks.io to catch these kinds of problems.

check-health is stateless and as such will keep reporting the same error once an hour (assuming default cron configuration) until you fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check-health.rst

check-health.rst

check-health

check system health

SYNOPSIS

DESCRIPTION

OPTIONS

AVAILABLE CHECKS

EXAMPLES

BUGS

DESIGN LIMITATIONS

SEE ALSO

Files

check-health.rst

Latest commit

History

check-health.rst

File metadata and controls

check-health

check system health

SYNOPSIS

DESCRIPTION

OPTIONS

AVAILABLE CHECKS

EXAMPLES

BUGS

DESIGN LIMITATIONS

SEE ALSO