How to use

Syntax

**** check_health.sh ****
**** (c) KUDOS BV - Patrick Van der Veken ****

Execute/report simple health checks (HC) on UNIX hosts.

Syntax: /opt/hc/bin/check_health.sh [--help] | [--help-terse] | [--version] |
    [--list=<needle>] | [--list-core] | [--fix-symlinks] | [--show-stats] | (--archive-all | --disable-all | --enable-all) | [ --fix-logs [--with-history]] |
	(--check-host | ((--archive | --check | --enable | --disable | --run [--timeout=<secs>] | --show) --hc=<list_of_checks> [--config-file=<configuration_file>] [hc-args="<arg1,arg2=val,arg3">]))
	[--display=<method>] ([--debug] [--debug-level=<level>]) [--log-healthy] [--no-monitor] [--no-fix] [--no-log] [--no-lock] [[--flip-rc] [--with-rc=<count|max|sum]]]
	[--notify=<method_list>] [--mail-to=<address_list>] [--sms-to=<sms_rcpt> --sms-provider=<name>]
	[--report=<method> [--with-history] ( ([--last] | [--today]) | [(--older|--newer)=<date>] | [--reverse] [--id=<fail_id> [--detail]] )]

Parameters:

--archive       : move events from the HC log file into archive log files (one HC)
--archive-all   : move events for all HCs from the HC log file into archive log files
--check         : display HC state.
--check-host    : execute all configured HC(s) (see check_host.conf)
--config-file   : custom configuration file for a HC (may only be specified when executing a single HC plugin)
--debug         : run script in debug mode
--debug-level   : level of debugging information to show (0,1,2)
--detail        : show detailed info on failed HC event (will show STDOUT+STDERR logs)
--disable       : disable HC(s).
--disable-all   : disable all HC.
--display       : display HC results in a formatted way. Default is STDOUT (see --list-core for available formats)
--enable        : enable HC(s).
--enable-all    : enable all HCs.
--fix-logs      : fix rogue log entries (can be used with --with-history)
--fix-symlinks  : update symbolic links for the KSH autoloader.
--flip-rc       : exit the health checker with the RC (return code) of the HC plugin instead of its own RC (will be discarded)
                  This option may only be specified when executing a single HC plugin
--hc            : list of health checks to be executed (comma-separated) (see also --list-hc)
--hc-args       : extra arguments to be passed to an individual HC. Arguments must be comma-separated and enclosed
                  in double quotes (example: --hc_args="arg1,arg2=value,arg3").
--id            : value of a FAIL ID (must be specified as uninterrupted sequence of numbers)
--last          : show the last (failed) events for each HC and their combined STC value
--list          : show the available health checks. Use <needle> to search with wildcards. Following details are shown:
                  - health check (plugin) name
                  - state of the HC plugin (disabled/enabled)
                  - version of the HC plugin
                  - whether the HC plugin requires a configuration file in /etc/opt/hc
                  - whether the HC plugin is scheduled by cron
--list-core     : show the available core plugins (mail,SMS,...)
--list-include  : show the available includes/libraries
--log-healthy   : log/show also passed health checks. By default this is off when the plugin support this feature.
                  (can be overridden by --no-log to disable all logging)
--mail-to       : list of e-mail address(es) to which an e-mail alert will be send to [requires mail core plugin]
--newer         : show the (failed) events for each HC that are newer than the given date
--no-fix        : do not apply fix/healing logic for failed health checks (if available)
--no-lock       : disable locking to allow concurrent script executions
--no-log        : do not log any messages to the script log file or health check results.
--no-monitor    : do not stop the execution of a HC after $HC_TIME_OUT seconds
--notify        : notify upon HC failure(s). Multiple options may be specified if comma-separated (see --list-core for available formats)
--older         : show the (failed) events for each HC that are older than the given date
--report        : report on failed HC events (STDOUT is the default reporting method)
--reverse       : show the report in reverse date order (newest events first)
--run           : execute HC(s).
--show          : show information/documentation on a HC
--show-stats    : show statistics on HC events (current & archived)
--sms-provider  : name of a supported SMS provider (see $SMS_PROVIDERS) [requires SMS core plugin]
--sms-to        : name of person or group to which a sms alert will be send to [requires SMS core plugin]
--timeout       : maximum runtime of a HC plugin in seconds (overrides $HC_TIME_OUT)
--today         : show today\'s (failed) events (HC and their combined STC value)
--version       : show the timestamp of the script.
--with-history  : also include events that have been archived already (reporting)
--with-rc       : define RC handling (plugin) when --flip-rc is used

Running a HC

Running a single HC

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --run

INFO: *** start of check_health.sh [--hc=check_hpux_ioscan --run] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: spawning child process with time-out of 60 secs for HC call [PID=2341]
INFO: collecting ioscan information (kernel_mode=yes), this is may take a while ...
INFO: check_hpux_ioscan [STC=0]: no problems detected by /usr/sbin/ioscan
INFO: executed HC: check_hpux_ioscan [RC=0]
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_ioscan --run] ***

The check will return the result via STDOUT and log the results also to the HC log file. If you wish to see a terse version of the output, then use:

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --run --terse
Health Check                    STC     Message
check_hpux_ioscan               0       no problems detected by /usr/sbin/ioscan

Running multiple HCs (at once)

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan,check_hpux_ovpa_status --run

INFO: *** start of check_health.sh [--hc=check_hpux_ioscan,check_hpux_ovpa_status --run] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: spawning child process with time-out of 60 secs for HC call [PID=2623]
INFO: collecting ioscan information (kernel_mode=yes), this is may take a while ...
INFO: check_hpux_ioscan [STC=0]: no problems detected by /usr/sbin/ioscan
INFO: executed HC: check_hpux_ioscan [RC=0]
INFO: spawning child process with time-out of 60 secs for HC call [PID=2720]
INFO: check_hpux_ovpa_status [STC=0]: scopeux is running
INFO: check_hpux_ovpa_status [STC=0]: midaemon is running
INFO: check_hpux_ovpa_status [STC=0]: perfalarm is running
INFO: check_hpux_ovpa_status [STC=0]: ttd is running
INFO: check_hpux_ovpa_status [STC=0]: ovcd is running
INFO: check_hpux_ovpa_status [STC=0]: ovbbccb is running
INFO: check_hpux_ovpa_status [STC=0]: coda is running
INFO: executed HC: check_hpux_ovpa_status [RC=0]
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_ioscan,check_hpux_ovpa_status --run] ***

The check will return the result(s) via STDOUT and log the results also to the HC log file.

Running a single HC with a custom configuration file

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --config-file=/etc/opt/hc/check_hpux_ioscan_new.conf --run

INFO: *** start of check_health.sh [--hc=check_hpux_ioscan --config-file=/etc/opt/hc/check_hpux_ioscan_new.conf --run] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: spawning child process with time-out of 60 secs for HC call [PID=3140]
INFO: collecting ioscan information (kernel_mode=no), this is may take a while ...
INFO: check_hpux_ioscan [STC=0]: no problems detected by /usr/sbin/ioscan
INFO: executed HC: check_hpux_ioscan [RC=0]
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_ioscan --config-file=/etc/opt/hc/check_hpux_ioscan_new.conf --run] ***

In this case, the file pointed to by the --config-file parameter should contain the names of all subsystems that need to be active. Again, the check will return the result via STDOUT and log the results also to the HC log file.

Checking available HCs

# /opt/hc/bin/check_health.sh --list

Health Check                    State           Version         Config? Sched?
--------------------------------------------------------------------------------
check_hpux_autopath             enabled         2013-08-29      No      No
check_hpux_file_age             enabled         2013-05-29      Yes     No
check_hpux_file_change          enabled         2017-05-18      Yes     No
check_hpux_fs_mounts            enabled         2016-07-04      No      Yes
check_hpux_fs_mounts_options    enabled         2016-12-02      Yes     No
check_hpux_guid_status          enabled         2017-05-18      No      No
check_hpux_httpd_status         enabled         2017-04-23      No      No
check_hpux_ignite_backup        enabled         2016-06-03      Yes     Yes
check_hpux_ioscan               enabled         2016-12-01      Yes     Yes
check_hpux_named_status         enabled         2017-01-07      No      Yes
check_hpux_ntp_status           enabled         2016-12-29      Yes     Yes
check_hpux_ovpa_status          enabled         2016-12-01      No      Yes
check_hpux_postfix_status       enabled         2016-12-01      No      Yes
check_hpux_root_crontab         enabled         2013-09-09      Yes     Yes
check_hpux_sg_cluster_config    enabled         2016-12-01      Yes     No
check_hpux_sg_cluster_status    enabled         2017-05-07      Yes     Yes
check_hpux_sg_package_config    enabled         2016-12-01      Yes     No
check_hpux_sg_package_status    enabled         2017-05-07      Yes     Yes
check_hpux_sg_qs_status         enabled         2017-05-01      No      No
check_hpux_sshd_status          enabled         2017-04-01      No      Yes
check_hpux_syslog               enabled         2017-05-18      Yes     Yes
check_hpux_vg_minor_number      enabled         2016-04-28      No      Yes

Dead links:

current FPATH: /opt/hc/lib/platform:/opt/hc/lib/platform/hp-ux

Config?: plugin has a default configuration file (Yes/No)
Sched? : plugin is scheduled through cron (Yes/No)
H+?    : plugin can choose whether to log/show passed health checks (Yes/No/Supported/Not supported)

The --list command will display various information about each health check:

Name: which can be used in the --hc parameter
State: whether the HC is currently active or disabled. The default is “enabled”.
Version: the HC version should be always be equal or greater than the check_health.sh version

# /opt/hc/bin/check_health.sh --version

INFO: ./check_health.sh: 2017-05-17

Needs config?: whether or not the plugin requires a configuration file to be present and/or configured.
Sched?: whether the HC plugin is scheduled in any (root) crontab
H+?: whether the HC plugin:
- supports the --log-healthy option or log_healthy plugin configuration parameter. See also the logging topic.
- the logging of passed health checks is enabled or not.

If you receive unlinked as State:

# /opt/hc/bin/check_health.sh --list

Health Check                    State           Version         Config? Sched?
--------------------------------------------------------------------------------
check_hpux_autopath             enabled         2013-08-29      No      No
check_hpux_file_age             enabled         2013-05-29      Yes     No
check_hpux_file_change          enabled         2017-05-18      Yes     No
check_hpux_fs_mounts            enabled         2016-07-04      No      Yes
check_hpux_fs_mounts_options    enabled         2016-12-02      Yes     No
check_hpux_guid_status          enabled         2017-05-18      No      No
check_hpux_httpd_status         enabled         2017-04-23      No      No
check_hpux_ignite_backup        enabled         2016-06-03      Yes     Yes
check_hpux_ioscan               enabled         2016-12-01      Yes     Yes
check_hpux_named_status         enabled         2017-01-07      No      Yes
check_hpux_ntp_status           enabled         2016-12-29      Yes     Yes
check_hpux_ovpa_status          enabled         2016-12-01      No      Yes
check_hpux_postfix_status       enabled         2016-12-01      No      Yes
check_hpux_root_crontab         enabled         2013-09-09      Yes     Yes
check_hpux_sg_cluster_config    enabled         2016-12-01      Yes     No
check_hpux_sg_cluster_status    unlinked        2017-05-07      Yes     Yes
check_hpux_sg_package_config    enabled         2016-12-01      Yes     No
check_hpux_sg_package_status    enabled         2017-05-07      Yes     Yes
check_hpux_sg_qs_status         enabled         2017-05-01      No      No
check_hpux_sshd_status          enabled         2017-04-01      No      Yes
check_hpux_syslog               enabled         2017-05-18      Yes     Yes
check_hpux_vg_minor_number      enabled         2016-04-28      No      Yes

Dead links:

current FPATH: /opt/hc/lib/platform:/opt/hc/lib/platform/hp-ux

Config?: plugin has a default configuration file (Yes/No)
Sched? : plugin is scheduled through cron (Yes/No)
H+?    : plugin can choose whether to log/show passed health checks (Yes/No/Supported/Not supported)

then this is an indication that new, undiscovered plugins are available. Typically this might happen when you first install the the HC tool. To make all (new) HC plugins available, run the following command:

# /opt/hc/bin/check_health.sh --fix-symlinks

INFO: created symbolic link /opt/hc/lib/platform/hp-ux/check_hpux_sg_cluster_status.sh -> /opt/hc/lib/platform/hp-ux/check_hpux_sg_cluster_status

This will create the necessary symbolic links with the KSH function names so that they can be picked up from the FPATH setting:

# /opt/hc/bin/check_health.sh --list

Health Check                    State           Version         Config? Sched?
--------------------------------------------------------------------------------
check_hpux_autopath             enabled         2013-08-29      No      No
check_hpux_file_age             enabled         2013-05-29      Yes     No
check_hpux_file_change          enabled         2017-05-18      Yes     No
check_hpux_fs_mounts            enabled         2016-07-04      No      Yes
check_hpux_fs_mounts_options    enabled         2016-12-02      Yes     No
check_hpux_guid_status          enabled         2017-05-18      No      No
check_hpux_httpd_status         enabled         2017-04-23      No      No
check_hpux_ignite_backup        enabled         2016-06-03      Yes     Yes
check_hpux_ioscan               enabled         2016-12-01      Yes     Yes
check_hpux_named_status         enabled         2017-01-07      No      Yes
check_hpux_ntp_status           enabled         2016-12-29      Yes     Yes
check_hpux_ovpa_status          enabled         2016-12-01      No      Yes
check_hpux_postfix_status       enabled         2016-12-01      No      Yes
check_hpux_root_crontab         enabled         2013-09-09      Yes     Yes
check_hpux_sg_cluster_config    enabled         2016-12-01      Yes     No
check_hpux_sg_cluster_status    enabled         2017-05-07      Yes     Yes
check_hpux_sg_package_config    enabled         2016-12-01      Yes     No
check_hpux_sg_package_status    enabled         2017-05-07      Yes     Yes
check_hpux_sg_qs_status         enabled         2017-05-01      No      No
check_hpux_sshd_status          enabled         2017-04-01      No      Yes
check_hpux_syslog               enabled         2017-05-18      Yes     Yes
check_hpux_vg_minor_number      enabled         2016-04-28      No      Yes

Dead links:

current FPATH: /opt/hc/lib/platform:/opt/hc/lib/platform/hp-ux

Config?: plugin has a default configuration file (Yes/No)
Sched? : plugin is scheduled through cron (Yes/No)
H+?    : plugin can choose whether to log/show passed health checks (Yes/No/Supported/Not supported)

Showing information on a HC

Use the --show option to get a short help on an individual plug-in:

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --show
NAME    : check_hpux_ioscan
VERSION : 2016-12-01
CONFIG  : /etc/opt/hc/check_hpux_ioscan.conf with:
            ioscan_classes=<list_of_device_classes_to_check>
            kernel_mode=<yes|no>
            agile_view=<yes|no>
PURPOSE : Checks whether 'ioscan' returns errors or not (NO_HW, ERROR)

Enabling/disabling a HC

Each HC can be (temporarily) disabled to allow for a maintenance period for example:

Check current HC status:

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --check                                                   

INFO: *** start of check_health.sh [--hc=check_hpux_ioscan --check] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: HC check_hpux_ioscan is currently enabled
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_ioscan --check] ***

Disable the HC plugin:

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --disable

INFO: *** start of check_health.sh [--hc=check_hpux_ioscan --disable] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: disabling HC: check_hpux_ioscan
INFO: succesfully disabled HC: check_hpux_ioscan
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_ioscan --disable] ***

When trying to execute a disabled plugin, it will fail:

# /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --run

INFO: *** start of check_health.sh [--hc=check_hpux_ioscan --run] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
ERROR: may not run disabled HC: check_hpux_ioscan
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_ioscan --run] ***

Doing a host check

The Health checker can be used to run a number of checks in sequence to give an overall health of a system. You can configure which health checks (aka plugins) to execute in the /etc/opt/hc/check_host.conf configuration file.

Example:

hc:check_linux_root_crontab::
hc:check_linux_fs_mounts::
hc:check_linux_httpd_status::
hc:check_linux_named_status::
hc:check_linux_postfix_status::
hc:check_linux_samba_status::
hc:check_linux_sshd_status::

Then execute the host check:

# /opt/hc/bin/check_health.sh --check-host

check_linux_root_crontab                                         (default config)       [     FAIL ] (20170529111355)
check_linux_fs_mounts                                            (default config)       [       OK ]
check_linux_httpd_status                                         (default config)       [     FAIL ] (20170529111355)
check_linux_named_status                                         (default config)       [     FAIL ] (20170529111355)
check_linux_postfix_status                                       (default config)       [       OK ]
check_linux_samba_status                                         (default config)       [     FAIL ] (20170529111355)
check_linux_sshd_status                                          (default config)       [       OK ]

Reporting on failed HC’s

‘last’ report

This will show the status of the last check. For example:

# /opt/hc/bin/check_health.sh --report --last

| HC                             | Timestamp            | FAIL ID        | STC (combined value)
----------------------------------------------------------------------------------------------------
| check_hpux_autopath            | -                    | -              | -
| check_hpux_file_age            | -                    | -              | -
| check_hpux_fs_mounts           | 2017-01-11 09:45:01  | -              | 0
| check_hpux_fs_mounts_options   | 2017-01-11 09:45:01  | -              | 0
| check_hpux_ignite_backup       | -                    | -              | -
| check_hpux_ioscan              | 2017-01-11 11:15:18  | -              | 0
| check_hpux_named_status        | -                    | -              | -
| check_hpux_ntp_status          | 2017-01-11 12:48:00  | -              | 0
| check_hpux_ovpa_status         | 2017-01-11 12:42:05  | -              | 0
| check_hpux_postfix_status      | -                    | -              | -
| check_hpux_root_crontab        | 2017-01-08 03:00:00  | -              | 0
| check_hpux_sg_cluster_config   | -                    | -              | -
| check_hpux_sg_cluster_status   | 2017-01-11 12:57:00  | -              | 0
| check_hpux_sg_package_config   | -                    | -              | -
| check_hpux_sg_package_status   | 2017-01-11 12:47:01  | -              | 0
| check_hpux_sshd_status         | 2017-01-11 12:38:00  | -              | 0
| check_hpux_syslog              | 2017-01-11 11:25:00  | -              | 0
| check_hpux_vg_minor_number     | 2017-01-08 04:00:00  | -              | 0

Today’s report

# /opt/hc/bin/check_health.sh --report --today

SUMMARY: 0 failed HC events found.

# /opt/hc/bin/check_health.sh --report --today

| Timestamp            | FAIL ID        | HC                             | Message
------------------------------------------------------------------------------------------------------------------------
| 2017-05-29 10:21:52  | 20170529102151 | check_linux_httpd_status       | httpd is not running
| 2017-05-29 10:21:52  | 20170529102151 | check_linux_named_status       | named is not running
| 2017-05-29 10:21:52  | 20170529102151 | check_linux_root_crontab       | '/usr/bin/cfg2html-linux' is not configured in cron
| 2017-05-29 10:21:52  | 20170529102151 | check_linux_samba_status       | NMB/SMB are not running

Global report

This will show all (failed) events from the current log HC file. For example:

# /opt/hc/bin/check_health.sh --report                             

| Timestamp            | ID             | HC                             | Message
------------------------------------------------------------------------------------------------------------------------
| 2016-05-04 09:58:42  | 20160504095842 | check_hpux_ovpa_status         | scopeux is not running
| 2016-05-04 09:58:42  | 20160504095842 | check_hpux_ovpa_status         | midaemon is not running
| 2016-05-04 09:58:42  | 20160504095842 | check_hpux_ovpa_status         | ttd is not running
| 2016-06-21 16:12:45  | 20160621161245 | check_hpux_syslog              | found 653 new SYSLOG messages
| 2016-06-22 14:53:37  | 20160622145337 | check_hpux_syslog              | found 653 new SYSLOG messages
| 2016-06-22 15:31:18  | 20160622153118 | check_hpux_syslog              | found 653 new SYSLOG messages
| 2016-06-28 11:25:00  | 20160628112500 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-06-29 11:25:01  | 20160629112501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-06-30 11:25:01  | 20160630112501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-06-30 15:25:01  | 20160630152501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-07-04 15:25:01  | 20160704152501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-07-04 15:40:01  | 20160704154001 | check_hpux_fs_mounts           | /software is not mounted
| 2016-07-04 15:40:01  | 20160704154001 | check_hpux_fs_mounts           | /logs is not mounted
| 2016-07-05 09:40:00  | 20160705094000 | check_hpux_fs_mounts           | /software is not mounted
| 2016-07-05 09:40:00  | 20160705094000 | check_hpux_fs_mounts           | /logs is not mounted
| 2016-07-06 15:25:01  | 20160706152501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-07-08 15:25:01  | 20160708152501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-07-09 07:25:00  | 20160709072500 | check_hpux_syslog              | found 653 new SYSLOG messages
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | perfalarm is not running
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | ovcd is not running
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | ovbbccb is not running
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | coda is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | perfalarm is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | ovcd is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | ovbbccb is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | coda is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | perfalarm is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | ovcd is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | ovbbccb is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | coda is not running
| 2016-11-24 11:25:00  | 20161124112500 | check_hpux_syslog              | found 653 new SYSLOG messages
| 2016-11-24 12:42:04  | 20161124124204 | check_hpux_ovpa_status         | perfalarm is not running
| 2016-11-24 12:42:04  | 20161124124204 | check_hpux_ovpa_status         | ovcd is not running
| 2016-11-24 12:42:04  | 20161124124204 | check_hpux_ovpa_status         | ovbbccb is not running
| 2016-11-24 12:42:04  | 20161124124204 | check_hpux_ovpa_status         | coda is not running

SUMMARY: 19 failed HC event(s) found.

Filtered report

It is possible to filter events by month. For example:

# /opt/hc/bin/check_health.sh --report --id=201607

| Timestamp            | ID             | HC                             | Message
------------------------------------------------------------------------------------------------------------------------
| 2016-07-04 15:25:01  | 20160704152501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-07-04 15:40:01  | 20160704154001 | check_hpux_fs_mounts           | /software is not mounted
| 2016-07-04 15:40:01  | 20160704154001 | check_hpux_fs_mounts           | /logs is not mounted
| 2016-07-05 09:40:00  | 20160705094000 | check_hpux_fs_mounts           | /software is not mounted
| 2016-07-05 09:40:00  | 20160705094000 | check_hpux_fs_mounts           | /logs is not mounted
| 2016-07-06 15:25:01  | 20160706152501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-07-08 15:25:01  | 20160708152501 | check_hpux_syslog              | found 1 new SYSLOG messages
| 2016-07-09 07:25:00  | 20160709072500 | check_hpux_syslog              | found 653 new SYSLOG messages
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | perfalarm is not running
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | ovcd is not running
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | ovbbccb is not running
| 2016-07-11 08:42:02  | 20160711084202 | check_hpux_ovpa_status         | coda is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | perfalarm is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | ovcd is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | ovbbccb is not running
| 2016-07-11 12:42:02  | 20160711124202 | check_hpux_ovpa_status         | coda is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | perfalarm is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | ovcd is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | ovbbccb is not running
| 2016-07-11 16:42:02  | 20160711164202 | check_hpux_ovpa_status         | coda is not running

SUMMARY: 9 failed HC event(s) found.

Detailed report

This will show the details of one HC event. The report will show the basic information on all related messages and additionally show the saved STDOUT & STDERR output. For example:

# /opt/hc/bin/check_health.sh --report --id=20160704154001 --detail
------------------------------------MSG-#001------------------------------------
Time    : 2016-07-04 15:40:01
HC      : check_hpux_fs_mounts
Detail  : /software is not mounted
------------------------------------MSG-#002------------------------------------
Time    : 2016-07-04 15:40:01
HC      : check_hpux_fs_mounts
Detail  : /logs is not mounted
-------------------------------------STDOUT-------------------------------------
/ on /dev/vg00/lvol3 ioerror=mwdisable,largefiles,delaylog,dev=40000003 on Wed Mar 16 12:54:16 2016
/stand on /dev/vg00/lvol1 ioerror=mwdisable,nolargefiles,log,tranflush,dev=40000001 on Wed Mar 16 12:54:22 2016
/var on /dev/vg00/lvol8 ioerror=mwdisable,largefiles,delaylog,dev=40000008 on Wed Mar 16 12:54:31 2016
/var/adm/crash on /dev/vg00/lvol9 ioerror=mwdisable,largefiles,delaylog,dev=40000009 on Wed Mar 16 12:54:31 2016
/usr on /dev/vg00/lvol7 ioerror=mwdisable,largefiles,delaylog,dev=40000007 on Wed Mar 16 12:54:31 2016
/usr/sap/tmp on /dev/vgusrsap/lvusrsaptmp ioerror=mwdisable,largefiles,delaylog,dev=80008003 on Wed Mar 16 12:54:31 2016
/usr/sap/saptbx on /dev/vgusrsap/lvusrsaptbx ioerror=mwdisable,largefiles,delaylog,dev=80008006 on Wed Mar 16 12:54:31 2016
/usr/sap/hostctrl on /dev/vgusrsap/lvsaphostctrl ioerror=mwdisable,largefiles,delaylog,dev=80008004 on Wed Mar 16 12:54:31 2016
/tmp on /dev/vg00/lvol6 ioerror=mwdisable,largefiles,delaylog,dev=40000006 on Wed Mar 16 12:54:31 2016
/oracle on /dev/vgoracle/lvoracle ioerror=mwdisable,largefiles,delaylog,dev=80000001 on Wed Mar 16 12:54:31 2016
/opt on /dev/vg00/lvol5 ioerror=mwdisable,largefiles,delaylog,dev=40000005 on Wed Mar 16 12:54:32 2016
/home on /dev/vg00/lvol4 ioerror=mwdisable,largefiles,delaylog,dev=40000004 on Wed Mar 16 12:54:32 2016
# System /etc/fstab file.  Static information about the file systems
# See fstab(4) and sam(1M) for further details on configuring devices.
/dev/vg00/lvol3 / vxfs delaylog 0 1
/dev/vg00/lvol1 /stand vxfs tranflush 0 1
/dev/vg00/lvol4 /home vxfs delaylog 0 2
/dev/vg00/lvol5 /opt vxfs delaylog 0 2
/dev/vg00/lvol6 /tmp vxfs delaylog 0 2
/dev/vg00/lvol7 /usr vxfs delaylog 0 2
/dev/vg00/lvol8 /var vxfs delaylog 0 2
/dev/vg00/lvol9 /var/adm/crash vxfs delaylog 0 2
/dev/vgswap/lvol1 ... swap pri=1 0 0
/dev/vgoracle/lvoracle /oracle vxfs rw,delaylog,datainlog,largefiles 0 2
/dev/vgusrsap/lvusrsaptmp /usr/sap/tmp vxfs rw,delaylog,datainlog,largefiles 0 2
/dev/vgusrsap/lvsaphostctrl /usr/sap/hostctrl vxfs rw,delaylog,datainlog,largefiles 0 2
/dev/vgusrsap/lvusrsaptbx /usr/sap/saptbx vxfs rw,delaylog,datainlog,largefiles 0 2
-------------------------------------STDERR-------------------------------------
No STDERR found
--------------------------------------------------------------------------------

Archival of HC messages

Maintaining the `hc.log`

Over time the event log file (hc.log) will keep growing and possibly cause the reporting feature to produce lengthy outputs. This problem can be solved by archiving event messages into seperate archive log files into the /var/opt/hc/archive directory.

The archiving feature works for all event messages (failed and succeeded) of a particular health check which are currently still in the event log file, e.g.:

# /opt/hc/bin/check_health.sh --hc=check_hpux_kernel_params --archive

INFO: *** start of check_health.sh [--hc=check_hpux_kernel_params --archive] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: archiving log entries for check_hpux_kernel_params...
INFO: # of new entries to archive: 66
INFO: # entries in /var/opt/hc/archive/hc.2017-12.log now: 72
INFO: # entries in /var/opt/hc/hc.log now: 14
INFO: successfully archived log entries for check_hpux_kernel_params
INFO: performing cleanup ...
INFO: *** finish of check_health.sh [--hc=check_hpux_kernel_params --archive] ***

Alternatively you can use the --archive-all command-line option to archive all (of all health checks) current event messages.

Messages will be archived into archive log files organized by year/month (YYYY-MM) based on the timestamp of each message.

It is not possible to archive event messages before a certain date or within a certain timeframe.

Archival will not touch the event information stored in the /var/opt/hc/events (sub)director(ies)

Reporting on archived messages

(Failed) event messages that have been archived will no longer be displayed in the standard report. In order to see archived events, use the --with-history option. e.g.:

# /opt/hc/bin/check_health.sh --report --with-history

| Timestamp            | FAIL ID        | HC                             | Message
------------------------------------------------------------------------------------------------------------------------
| 2017-12-22 13:56:55  | 20171222135655 | check_hpux_kernel_params       | ipl_suppress has a wrong value (0 != 1)
| 2017-12-22 13:56:55  | 20171222135655 | check_hpux_kernel_params       | nproc has a wrong expression (5000 != 8000)
| 2017-12-22 14:50:17  | 20171222145013 | check_hpux_ovpa_status         | oacore is not running
| 2017-12-22 14:50:17  | 20171222145013 | check_hpux_ovpa_status         | perfalarm is not running

SUMMARY: 2 failed HC event(s) found.

Alerting on failed HCs

E-mail

The Health Checker can send out e-mail alerts upon health failures. For example:

# /opt/hc/bin/check_health.sh --hc=check_hpux_root_crontab --run --notify=mail --mail-to="alert@acme.com"
INFO: *** start of check_health.sh [--hc=check_hpux_root_crontab --run --notify=mail --mail-to=alert@acme.com] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: spawning child process with time-out of 60 secs for HC call [PID=11797]
INFO: check_hpux_root_crontab [STC=0]: '/usr/sbin/dmesg - >>/var/adm/messages' is configured in cron
INFO: check_hpux_root_crontab [STC=0]: '/usr/lbin/sa/sa1 60 60' is configured in cron
INFO: check_hpux_root_crontab [STC=0]: '/usr/lbin/sa/sa2 -s 8:00 -e 22:01 -i 3600 -A' is configured in cron
INFO: check_hpux_root_crontab [STC=0]: '/opt/ignite/bin/make_net_recovery -u -v -s igniteB' is configured in cron
INFO: check_hpux_root_crontab [STC=1]: '/opt/ignite/bin/make_net_recovery -u -v -s igniteA' is not configured in cron FAIL_ID=20160418142154]
INFO: sent mail alert to alert@acme.com
INFO: executed HC: check_hpux_root_crontab [RC=0]
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_root_crontab --run --notify=mail --mail-to=alert@acme.com] ***

Upon failure, a e-mail will be dispatched to the configured recipients with following contents:

HC     : check_hpux_root_crontab
FAIL_ID: 20160418142154
MESSAGE:

'/opt/ignite/bin/make_net_recovery -u -v -s igniteA' is not configured in cron

------------------------------------------------------------------------------

Please check the corresponding log file(s) for more details (see below or at at lum0307hp:/var/opt/hc)

********************** SUMMARY OF ATTACHED FILES *****************************

STDOUT      : /var/opt/hc/events/2016-04/20160418142154/check_hpux_root_crontab.stdout
STDERR      : no log file available

*** END OF MAIL. DO NOT REPLY TO THIS E-MAIL. NOBODY WILL SEE IT! ***

SMS

This requires a supported SMS provider.

ITM (Tivoli)

The Health Checker can send out ITM alerts upon health failures. Note: in order for this to work you must have the POSTEIFMSG tool installed and properly configured on the host. For example:

# /opt/hc/bin/check_health.sh --hc=check_hpux_root_crontab --run --notify=eif                                        
INFO: *** start of check_health.sh [--hc=check_hpux_root_crontab --run --notify=eif] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: spawning child process with time-out of 60 secs for HC call [PID=12142]
INFO: check_hpux_root_crontab [STC=0]: '/usr/sbin/dmesg - >>/var/adm/messages' is configured in cron
INFO: check_hpux_root_crontab [STC=0]: '/usr/lbin/sa/sa1 60 60' is configured in cron
INFO: check_hpux_root_crontab [STC=0]: '/usr/lbin/sa/sa2 -s 8:00 -e 22:01 -i 3600 -A' is configured in cron
INFO: check_hpux_root_crontab [STC=0]: '/opt/ignite/bin/make_net_recovery -u -v -s igniteB' is configured in cron
INFO: check_hpux_root_crontab [STC=1]: '/opt/ignite/bin/make_net_recovery -u -v -s igniteA' is not configured in cron FAIL_ID=20160418142804]
INFO: spawning child process with time-out of 10 secs for EIF notify [PID=12184]
INFO: child process with PID 12184 ended correctly
INFO: executed HC: check_hpux_root_crontab [RC=0]
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--hc=check_hpux_root_crontab --run --notify=eif] ***

Check the TIP console for the incoming message.

Tips

Alias

Define the following alias in the profile of the root user (.kshrc, .bashrc):

alias hc='/opt/hc/bin/check_health.sh'

This will allow you to run shortened commands like:

# hc --list

Remote command

When running the health checker via an SSH remote command, use the –no-monitor switch to avoid problems with hanging background processes:

$ ssh ${USER}@${HOST} "sudo /opt/hc/bin/check_health.sh --hc=check_hpux_ioscan --run --no-monitor"

Bad entries in the log files

When the /opt/hc/bin/check_health.sh --report command reports rogue entries in the HC log file(s) then run the health checker with the --fix-logs option:

# /opt/hc/bin/check_health.sh --report
...
SUMMARY: 30 failed HC event(s) found.
NOTE: found 1 rogue entr(y|ies) in log file /var/opt/hc/hc.log
NOTE: fix log errors with check_health.sh --fix-logs [--with-history]

# /opt/hc/bin/check_health.sh --fix-logs
INFO: *** start of check_health.sh [--fix-logs] ***
INFO: logging takes places in /var/opt/hc/check_health.sh.log
INFO: fixing log file /var/opt/hc/hc.log ...
INFO: successfully fixed log entries
INFO: performing cleanup ...
INFO: /var/tmp/.check_health.sh.lock lock directory removed
INFO: *** finish of check_health.sh [--fix-logs] ***

# /opt/hc/bin/check_health.sh --report
...
SUMMARY: 29 failed HC event(s) found.

Note: as of release 19th May 2019 (20190519) rogue log entries no longer appear.

Caching reporting data

With release >=20200407 it is possible to have some data cached for reporting purposes, specifically for the today and last reports. To enable this caching, set the following configuration parameters in /etc/opt/hc/core/check_health.conf:

# cache "last" reporting entries. Set to 'Yes' to speed up reporting of the last
# registered HC events
# [values: Yes|No]
HC_REPORT_CACHE_LAST="Yes"

# cache "today" reporting entries.  Set to 'Yes' to speed up reporting of today's
# registered HC events
# [values: Yes|No]
HC_REPORT_CACHE_TODAY="Yes"

Caching data is useful when the HC log file (/var/opt/hc/hc.log) grows to a larger size and you do not wish to archive past events yet. Using the caching features has the additional advantage that the last and today’s HC events are still displayed even if a fully archiving is done. Without caching, archiving empties the HC log and also the last and today reports (unless you use the --with-history toggle to include archived events).