Concepts

Several types of log and state files are being kept by the HC tool:

  1. check_health.sh.log: log file of the main script itself. Provides a chronological detail of all script runs
  1. hc.log: log file containing a formatted records of HC executions and results. Standard format is as follows:
<timestamp>|<hc plugin>|<hc result or STC>|<hc details>|<fail id>

For example:

2016-04-11 15:17:00|check_hpux_sg_package_status|0|'postfix:status=up' has correct value
2016-04-11 15:17:00|check_hpux_sg_package_status|0|'postfix:state=running' has correct value
2016-04-11 15:17:00|check_hpux_sg_package_status|0|'postfix:autorun=enabled' has correct value
2016-04-11 15:17:00|check_hpux_sg_package_status|0|'ovpm:status=up' has correct value
2016-04-11 15:17:00|check_hpux_sg_package_status|0|'ovpm:state=running' has correct value
2016-04-11 15:17:00|check_hpux_sg_package_status|0|'ovpm:autorun=enabled' has correct value
2016-04-11 15:20:01|check_hpux_ioscan|0|no problems detected by /usr/sbin/ioscan
2016-04-11 15:27:00|check_hpux_sg_cluster_status|0|'status=up' has correct value
2016-04-11 15:27:00|check_hpux_sg_cluster_status|0|'state=stable' has correct value
2016-04-11 15:42:01|check_hpux_ovpa_status|0|scopeux is running
2016-04-11 15:42:01|check_hpux_ovpa_status|0|midaemon is running
2016-04-11 15:42:01|check_hpux_ovpa_status|0|perfalarm is running

A HC result (or STC) of:

  • <>0: indicates that the corresponding HC has failed (~problems)
  • 0: indicates that the corresponding HC did not detect any issues.
  1. Event files: upon HC failure, a FAIL_ID will be generated (=timestamp). Such FAIL_ID is used to generate an event and corresponding evidence of the event. Typically this will lead to STDOUT/STDERR information gathered during the HC being saved into a separate event directory. For example:
# /var/opt/hc/events # cd 2016-04/20160417030000
# /var/opt/hc/events/2016-04/20160417030000 # ls -l
total 16
-rw-r--r--   1 root       sys              0 Apr 17 03:00 check_hpux_root_crontab.stderr.log
-rw-r--r--   1 root       sys           3247 Apr 17 03:00 check_hpux_root_crontab.stdout.log

In this example the FAIL_ID is 20160417030000 and can be used to retrace the event also in the hc.log:

# /var/opt/hc # grep "20160417030000" hc.log
2016-04-17 03:00:00|check_hpux_root_crontab|1|'/opt/ignite/bin/make_net_recovery -u -v -s igniteA' is not configured in cron|20160417030000

Events are organized in separate directories per month-year (e.g. 2016-04) to avoid cluttering of a single directory.

  1. State files: some plugins may require the use of a state or intermediary file(s) to retain info between checks. Such files are placed in the location pointed to by the STATE_DIR setting. This location is also used for the enablement/disablement feature of HC plugins themselves. Never clean such files unless you know what you are doing

Logging control

--no-log

Logging can be ad-hoc switched off by using the --no-log script option. This will run the health checker in a preview or dry-run mode. The actual health check(s) will be executed but no results will be logged

log_healthy

This option can be enabled via 2 ways:

  • command-line option --log-healthy
  • plugin configuration parameter log_healthy (note that not all plugins support this option, see --list)

The --log-healthy option will control the logging (and display) of passed health checks (aka healthy health checks). Most of the plugins will only log/show failed health checks (but this is dependent on the plugin code also).

You can combine the --log_healthy option with the --no-log command-line option to toggle messages being displayed but not being logged. --no-log always takes highest precedence.

:warning: Your HC log file may grow very quickly though when passed health checks are also logged.

Updated:

Leave a comment