Conventions

Keep in mind following conventions when working on the HC tool source code:

1) HC plugins are implemented as KSH functions in a stand-alone file. These functions are loaded upon demand (run-time) and must have the same name as the function itself (see man ksh: FPATH). Since we prefer all scripts to carry the .sh extension, a symbolic link must be made between the script name & function name.

For example:

function check_linux_process_limits
{}

matches to the file(s):

lrwxrwxrwx 1 root root    29 Jul 10 12:58 check_linux_process_limits -> check_linux_process_limits.sh
-rwxr-xr-x 1 root root 14486 Jul 16 08:23 check_linux_process_limits.sh

See also: check_health.sh –-fix-symlinks

2) Naming conventions for scripts: HC plugins should be named using following pattern: <check_<platform>/<customer>_<description>

For example:

  • check_aix_errpt
  • check_linux_fs_mounts
  • check_hpux_ovpa_status

The customer tag may be considered optional. If a plugin is cross-platform, then use ‘all’ as platform indicator.

3) By default only one, global namespace exists in the HC tool (main script & functions). However, variables that are limited to a function scope should defiend via typeset and be prefixed with an underscore (_). Though most KSH variants will not expose these variables outside the function, the underscore convention still allows variables to be safely used together with global variables of the same name (but without underscore)

HC plugin code

Global variables

Following global variables may be handy to remember or use:

  • $SCRIPT_NAME : self-explanatory
  • $SCRIPT_DIR : self-explanatory
  • $HOST_NAME : self-explanatory
  • $OS_NAME : self-explanatory
  • $LOG_DIR : location of the log & state files
  • $STATE_DIR : location of the state files (=contains information that needs to be kept between consecutive runs of the same health checks)
  • $STATE_PERM_DIR : location of the permanent state files
  • $STATE_TEMP_DIR : location of the temporary state files
  • $HC_STDOUT_LOG : location of the event log file for STDOUT for the HC plugin
  • $HC_STDERR_LOG : location of the event log file for STDERR for the HC plugin
  • $ARG_DEBUG : debug flag (0 is off)
  • $ARG_DEBUG_LEVEL : debug level (0,1,2)
  • $IS_PDKSH : flag voor pdksh/mksh

Command-line options/parameters

Following list contains the list of command-line options/parameters of check_health.sh:

typeset ARG_ACTION=0            # HC action flag
typeset ARG_CHECK_HOST=0        # host check is off by default
typeset ARG_CONFIG_FILE=""      # custom configuration file for a HC, none by default
typeset ARG_DEBUG=0             # debug is off by default
typeset ARG_DEBUG_LEVEL=0       # debug() only by default
typeset ARG_DETAIL=0            # for --report
typeset ARG_DISPLAY=""          # display is STDOUT by default
typeset ARG_FAIL_ID=""
typeset ARG_FLIP_RC=0           # swapping EXIT RC is off by default
typeset ARG_HC=""
typeset ARG_HC_ARGS=""          # no extra arguments to HC plug-in by default
typeset ARG_HISTORY=0           # include historical events is off by default
typeset ARG_LAST=0              # report last events
typeset ARG_LIST=""             # list all by default
typeset ARG_LOCK=1              # lock for concurrent script executions is on by default
typeset ARG_LOG=1               # logging is on by default
typeset ARG_LOG_HEALTHY=0       # logging of healthy health checks is off by default
typeset ARG_MONITOR=1           # killing long running HC processes is on by default
typeset ARG_NEWER=""
typeset ARG_NOTIFY=""           # notification of problems is off by default
typeset ARG_NO_FIX=0 		    # fix/healing is not disabled by default
typeset ARG_OLDER=""
typeset ARG_REVERSE=0           # show report in reverse date order is off by default
typeset ARG_REPORT=""           # report of HC events is off by default
typeset ARG_TIME_OUT=0          # custom timeout is off by default
typeset ARG_TERSE=0             # show terse help is off by default
typeset ARG_TODAY=0             # report today's events
typeset ARG_VERBOSE=1           # STDOUT is on by default
typeset ARG_WITH_RC=""

Any HC plugin/function should start with the following lines:

For example:

# -----------------------------------------------------------------------------
function check_hpux_drd_status
{
# ------------------------- CONFIGURATION starts here -------------------------
typeset _CONFIG_FILE="${CONFIG_DIR}/$0.conf"
typeset _VERSION="2018-05-18"                           # YYYY-MM-DD
typeset _SUPPORTED_PLATFORMS="HP-UX"                    # uname -s match
# ------------------------- CONFIGURATION ends here ---------------------------

# set defaults
(( ARG_DEBUG != 0 && ARG_DEBUG_LEVEL > 0 )) && set "${DEBUG_OPTS}"
init_hc "$0" "${_SUPPORTED_PLATFORMS}" "${_VERSION}"

# handle arguments (originally comma-separated)
for _ARG in ${_ARGS}
do
    case "${_ARG}" in
        help)
            _show_usage "$0" "${_VERSION}" "${_CONFIG_FILE}" && return 0
            ;;
    esac
done

# define local variables here

# handle configuration file
[[ -n "${ARG_CONFIG_FILE}" ]] && _CONFIG_FILE="${ARG_CONFIG_FILE}"
if [[ ! -r ${_CONFIG_FILE} ]]
then
    warn "unable to read configuration file at ${_CONFIG_FILE}"
    return 1
fi

# code for reading required configuration values
...

The _CONFIG_FILE directive should only be specified if the HC plugin requires the use of a configuration file

Reading values from the plugin configuration file

Use constructs with the data_get_lvalue_from_config() function if possible, e.g.:

_CLONE_MAX_AGE=$(_CONFIG_FILE="${_CONFIG_FILE}" data_get_lvalue_from_config 'clone_age')

Syntax of the data_get_lvalue_from_config() function:

# -----------------------------------------------------------------------------
# @(#) FUNCTION: data_get_lvalue_from_config()
# DOES: get an lvalue from the configuration file
# EXPECTS: parameter to look for [string]
# OUTPUTS: parameter value [string]
# RETURNS: 0=found; 1=not found
# REQUIRES: n/a

Logging events

Feeding events back to the HC (main) script to allow logging, alerting or reporting should be done via the log_hc() function, e.g.:

log_hc "$0" 1 "clone has not yet been created"

log_hc "$0" ${_STC} "${_MSG}" "${_CHECK_VALUE}" "${_MAX_KCUSAGE}"

Syntax of the log_hc() function:

# -----------------------------------------------------------------------------
# @(#) FUNCTION: log_hc()
# DOES: log a HC plugin result
# EXPECTS: 1=HC name [string], 2=HC status code [integer], 3=HC message [string],
#          4=HC found value [string] (optional),
#          5=HC expected value [string] (optional)
# RETURNS: 0
# REQUIRES: n/a

Log healthy

Insert this into your plugin code (or copy/paste from an existing plugin):

When using a configuration file for your plugin

...
# handle arguments (originally comma-separated)
for _ARG in ${_ARGS}
do
    case "${_ARG}" in
        help)
            _show_usage "$0" "${_VERSION}" "${_CONFIG_FILE}" && return 0
            ;;
    esac
done
...
# handle configuration file
[[ -n "${ARG_CONFIG_FILE}" ]] && _CONFIG_FILE="${ARG_CONFIG_FILE}"
if [[ ! -r ${_CONFIG_FILE} ]]
then
    warn "unable to read configuration file at ${_CONFIG_FILE}"
    return 1
fi
...
# read required configuration values
_CFG_HEALTHY=$(_CONFIG_FILE="${_CONFIG_FILE}" data_get_lvalue_from_config 'log_healthy')
case "${_CFG_HEALTHY}" in
    yes|YES|Yes)
        _LOG_HEALTHY=1
        ;;
    *)
        # do not override hc_arg
        (( _LOG_HEALTHY > 0 )) || _LOG_HEALTHY=0
        ;;
esac

# log_healthy
(( ARG_LOG_HEALTHY > 0 )) && _LOG_HEALTHY=1
if (( _LOG_HEALTHY > 0 ))
then
    if (( ARG_LOG > 0 ))
    then
        log "logging/showing passed health checks"
    else
        log "showing passed health checks (but not logging)"
    fi
else
    log "not logging/showing passed health checks"
fi
...

When NOT using a configuration file for your plugin

...
# handle arguments (originally comma-separated)
for _ARG in ${_ARGS}
do
    case "${_ARG}" in
        help)
            _show_usage "$0" "${_VERSION}" "${_CONFIG_FILE}" && return 0
    ;;
    esac
done

# log_healthy
(( ARG_LOG_HEALTHY > 0 )) && _LOG_HEALTHY=1
if (( _LOG_HEALTHY > 0 ))
then
    if (( ARG_LOG > 0 ))
    then
        log "logging/showing passed health checks"
    else
        log "showing passed health checks (but not logging)"
    fi
else
    log "not logging/showing passed health checks"
fi
...

Incorporating fix/healing logic

The core idea of the Health Checker framework is to always work in read-only mode, i.e. limiting its functionality to only checking or verifying application/system status. However, at times it may be appropriate to add logic to plugin code that can also fix (auto-healing) problems. For this reason the --no-fix command-line parameter + global variable ARG_NO_FIX and the _HC_CAN_FIX plugin variable was added (as of release 20190629).

If your code contains auto-healing logic then set:

typeset _HC_CAN_FIX=1 # plugin has fix/healing logic?

so that the --list option will show that the plugin has such ability. The value of ARG_NO_FIX can be used to manipulate whether the script should actually do the healing or not (dry-run mode). Running with the --no-log option automatically also implies that ARG_NO_FIX is set.

Show usage (and other internal plugin functions)

If you wish to write functions (or subroutines) that are specific to certain plugin then you can add them as separate functions in the same plugin file. The name of these functions should start with an underscore to indicate that it is a plugin internal routine ( for example: _show_usage)

Do add an _show_usage function to each HC plugin with a short description of what the plugin does. Example:

function _show_usage
{
cat <<- EOT
NAME        : $1
VERSION     : $2
CONFIG      : $3 with parameters:
                  max_kcusage=<threshold_%>
                  exclude_params=<list_of_exluded_parameters>
               and formatted stanzas:
                  param:<my_param1>:<my_param1_threshold_%>
                  exclude_params=<list_of_exluded_parameters>
PURPOSE     : Checks the current usage of kernel parameter resources
LOG HEALTHY : Supported

EOT

return 0
}

About exit/return codes

Some conventions:

  • die() : custom function in HC framework
  • exit() : refers to the standard shell function
  • return() : refers to the standard shell function

Following rules apply:

  • When the main script and the plugin code does not encounter a flagged (script) error OR failed health check: RC = 0

  • When the main script encounters a flagged (script) error and consequently ends via die() or exit(): RC = 1 or value of $EXIT_CODE

  • When the main script ends without (script) error AND the plugin code encounters a flagged (script) error and consequently ends via return(): RC = value of return()

  • When both main script and plugin code end without flagged (script) error AND the plugin code encounters a failed health check: RC = 0

  • When both main script and plugin code end without flagged (script) error AND the plugin code encounters a failed health check AND option --flip-rc is used:
    • when --with-rc == Count or not used: RC = count of STC>0 (max 255)
    • when --with-rc == Max or not used: RC = Max of all STC values (max 255)
    • when --with-rc == Sum or not used: RC = Sum of all STC values (max 255)
  • When the option --no-log is used : RC = 0

Also note that:

  • plugins should never use die() or exit() but always use log_hc() or return()
  • die() should only be used in the main script and/or core plugins
  • exit() should only be used in the main script
  • do not rely on the RC when running the health checker with multiple plugins enabled. The RC will then only depend on the processing of the last called plugin. There is no good or graceful way to deal with RCs from multiple HC plugins.

Useful (global) auxiliary functions

You can look in the source code of the include_data.shh & include_os.sh scripts to find useful helper functions that can be used anywhere in the health checker code.

Updated:

Leave a comment