nagiosでJavaプログラムの監視と再起動



nagiosについての全投稿は/tag/nagiosにあるので参照されたい。

ここではJavaプログラムを監視し、落ちている場合は警報を出すと共に再起動することを考える。対象としてはtomcatである。良く落ちることがあるのだ。tomcatの再起動は他にも方法があるようだ、何もnagiosにやらせる必要はない。

設定ファイルと動作

nagiosのインストールと全体的な動作はnagiosのインストールと最小限の設定で説明したが、ここでおさらいしておく。

設定ファイルは/etc/nagiosディレクトリにあり、

  • nagios.cfgがnagiosの動作を設定するファイルだが、ここから他のファイルをインクルードされている。
  • contacts.cfgでは異常発生時のメールアドレスを定義する
  • localhost.cfgはローカルホスト監視用の定義が記述される。もちろん、外部ホストを監視したい場合は同様のファイルを作成して、nagios.cfgからインクルードさせる。
  • commands.cfgはlocalhost.cfg等で使用するコマンドが定義される。ここから呼び出されるコマンドの多くがプラグインとして提供されており、nagiosとは別途インストールしなければならない。

設定ファイルを変更したら、設定チェックをしてから再起動する。

nagios -v /etc/nagios/nagios.cfg
service nagios restart

コマンド引数

localhost.cfgにはコマンドの呼び出し方を書き、command.cfgにはコマンドの定義を記述し、これらのコマンドはプラグインとなっている。そのコマンドの呼び出し方が問題になる。

引数はプラグインコマンドを–help付きで呼び出すことでわかる。

# /usr/lib64/nagios/plugins/check_http --help
check_http v2.2.1 (nagios-plugins 2.2.1)
Copyright (c) 1999 Ethan Galstad <nagios@nagios.org>
Copyright (c) 1999-2014 Nagios Plugin Development Team
    <devel@nagios-plugins.org>

This plugin tests the HTTP service on the specified host. It can test
normal (http) and secure (https) servers, follow redirects, search for
strings and regular expressions, check connection times, and report on
certificate expiration times.


Usage:
 check_http -H <vhost> | -I <IP-address> [-u <uri>] [-p <port>]
       [-J <client certificate file>] [-K <private key>]
       [-w <warn time>] [-c <critical time>] [-t <timeout>] [-L] [-E] [-a auth]
       [-b proxy_auth] [-f <ok|warning|critcal|follow|sticky|stickyport>]
       [-e <expect>] [-d string] [-s string] [-l] [-r <regex> | -R <case-insensitive regex>]
       [-P string] [-m <min_pg_size>:<max_pg_size>] [-4|-6] [-N] [-M <age>]
       [-A string] [-k string] [-S <version>] [--sni] [-C <warn_age>[,<crit_age>]]
       [-T <content-type>] [-j method]
NOTE: One or both of -H and -I must be specified

Options:
 -h, --help
    Print detailed help screen
 -V, --version
    Print version information
 --extra-opts=[section][@file]
    Read options from an ini file. See
    https://www.nagios-plugins.org/doc/extra-opts.html
    for usage and examples.
 -H, --hostname=ADDRESS
    Host name argument for servers using host headers (virtual host)
    Append a port to include it in the header (eg: example.com:5000)
 -I, --IP-address=ADDRESS
    IP address or name (use numeric address if possible to bypass DNS lookup).
 -p, --port=INTEGER
    Port number (default: 80)
 -4, --use-ipv4
    Use IPv4 connection
 -6, --use-ipv6
    Use IPv6 connection
 -S, --ssl=VERSION[+]
    Connect via SSL. Port defaults to 443. VERSION is optional, and prevents
    auto-negotiation (2 = SSLv2, 3 = SSLv3, 1 = TLSv1, 1.1 = TLSv1.1,
    1.2 = TLSv1.2). With a '+' suffix, newer versions are also accepted.
 --sni
    Enable SSL/TLS hostname extension support (SNI)
 -C, --certificate=INTEGER[,INTEGER]
    Minimum number of days a certificate has to be valid. Port defaults to 443
    (when this option is used the URL is not checked.)
 -J, --client-cert=FILE
   Name of file that contains the client certificate (PEM format)
   to be used in establishing the SSL session
 -K, --private-key=FILE
   Name of file containing the private key (PEM format)
   matching the client certificate
 -e, --expect=STRING
    Comma-delimited list of strings, at least one of them is expected in
    the first (status) line of the server response (default: HTTP/1.)
    If specified skips all other status line logic (ex: 3xx, 4xx, 5xx processing)
 -d, --header-string=STRING
    String to expect in the response headers
 -s, --string=STRING
    String to expect in the content
 -u, --uri=PATH
    URI to GET or POST (default: /)
 --url=PATH
    (deprecated) URL to GET or POST (default: /)
 -P, --post=STRING
    URL encoded http POST data
 -j, --method=STRING  (for example: HEAD, OPTIONS, TRACE, PUT, DELETE, CONNECT)
    Set HTTP method.
 -N, --no-body
    Don't wait for document body: stop reading after headers.
    (Note that this still does an HTTP GET or POST, not a HEAD.)
 -M, --max-age=SECONDS
    Warn if document is more than SECONDS old. the number can also be of
    the form "10m" for minutes, "10h" for hours, or "10d" for days.
 -T, --content-type=STRING
    specify Content-Type header media type when POSTing

 -l, --linespan
    Allow regex to span newlines (must precede -r or -R)
 -r, --regex, --ereg=STRING
    Search page for regex STRING
 -R, --eregi=STRING
    Search page for case-insensitive regex STRING
 --invert-regex
    Return CRITICAL if found, OK if not

 -a, --authorization=AUTH_PAIR
    Username:password on sites with basic authentication
 -b, --proxy-authorization=AUTH_PAIR
    Username:password on proxy-servers with basic authentication
 -A, --useragent=STRING
    String to be sent in http header as "User Agent"
 -k, --header=STRING
    Any other tags to be sent in http header. Use multiple times for additional headers
 -E, --extended-perfdata
    Print additional performance data
 -L, --link
    Wrap output in HTML link (obsoleted by urlize)
 -f, --onredirect=<ok|warning|critical|follow|sticky|stickyport>
    How to handle redirected pages. sticky is like follow but stick to the
    specified IP address. stickyport also ensures port stays the same.
 -m, --pagesize=INTEGER<:INTEGER>
    Minimum page size required (bytes) : Maximum page size required (bytes)
 -w, --warning=DOUBLE
    Response time to result in warning status (seconds)
 -c, --critical=DOUBLE
    Response time to result in critical status (seconds)
 -t, --timeout=INTEGER:<timeout state>
    Seconds before connection times out (default: 10)
    Optional ":<timeout state>" can be a state integer (0,1,2,3) or a state STRING
 -v, --verbose
    Show details for command-line debugging (Nagios may truncate output)

Notes:
 This plugin will attempt to open an HTTP connection with the host.
 Successful connects return STATE_OK, refusals and timeouts return STATE_CRITICAL
 other errors return STATE_UNKNOWN.  Successful connects, but incorrect reponse
 messages from the host result in STATE_WARNING return values.  If you are
 checking a virtual server that uses 'host headers' you must supply the FQDN
 (fully qualified domain name) as the [host_name] argument.
 You may also need to give a FQDN or IP address using -I (or --IP-Address).

 This plugin can also check whether an SSL enabled web server is able to
 serve content (optionally within a specified time) or whether the X509 
 certificate is still valid for the specified number of days.

 Please note that this plugin does not check if the presented server
 certificate matches the hostname of the server, or if the certificate
 has a valid chain of trust to one of the locally installed CAs.

Examples:
 CHECK CONTENT: check_http -w 5 -c 10 --ssl -H www.verisign.com

 When the 'www.verisign.com' server returns its content within 5 seconds,
 a STATE_OK will be returned. When the server returns its content but exceeds
 the 5-second threshold, a STATE_WARNING will be returned. When an error occurs,
 a STATE_CRITICAL will be returned.

 CHECK CERTIFICATE: check_http -H www.verisign.com -C 14

 When the certificate of 'www.verisign.com' is valid for more than 14 days,
 a STATE_OK is returned. When the certificate is still valid, but for less than
 14 days, a STATE_WARNING is returned. A STATE_CRITICAL will be returned when
 the certificate is expired.


 CHECK CERTIFICATE: check_http -H www.verisign.com -C 30,14

 When the certificate of 'www.verisign.com' is valid for more than 30 days,
 a STATE_OK is returned. When the certificate is still valid, but for less than
 30 days, but more than 14 days, a STATE_WARNING is returned.
 A STATE_CRITICAL will be returned when certificate expires in less than 14 days
 CHECK SSL WEBSERVER CONTENT VIA PROXY USING HTTP 1.1 CONNECT: 

 check_http -I 192.168.100.35 -p 80 -u https://www.verisign.com/ -S -j CONNECT -H www.verisign.com 
 all these options are needed: -I <proxy> -p <proxy-port> -u <check-url> -S(sl) -j CONNECT -H <webserver>
 a STATE_OK will be returned. When the server returns its content but exceeds
 the 5-second threshold, a STATE_WARNING will be returned. When an error occurs,
 a STATE_CRITICAL will be returned.

Send email to help@nagios-plugins.org if you have questions regarding use
of this software. To submit patches or suggest improvements, send email to
devel@nagios-plugins.org

tomcatのポートを監視する

apacheもtomcatも監視するために、commands.cfgを変更する。

# 'check_http' command definition
define command{
        command_name    check_http
        command_line    $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
        }

となっているのだが、そもそもなぜ使われてもいない$ARG1$があるのか不明だが、これを以下に修正

# 'check_http' command definition
define command{
        command_name    check_http
        command_line    $USER1$/check_http -I $HOSTADDRESS$ -p $ARG1$
        }

localhost.cfgのもともとのHTTPを変更し、TOMCATを加える。

define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             HTTP
    check_command           check_http!80
    notifications_enabled       1
        }

define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             TOMCAT
    check_command           check_http!8080
    notifications_enabled       1
        }

nagiosを再起動してしばらく経過すると、以下の表示になる。

イベントハンドラ

さて、サービスが落ちてしまった場合に、そのサービスを復旧させるための仕組みとしてイベントハンドラがある。もともとは、何らかの状態変更が起こった場合に走らせるものらしい。

イベントハンドラに説明がある。

以下のような具合だ。

localhost.cfgに以下を加える。

define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             TOMCAT
    check_command           check_http!8080
    notifications_enabled       1
        event_handler                   restart_tomcat
        }

comands.cfgに以下を加える

define command{
    command_name    restart_tomcat
    command_line    /foo/bar/nagios-restart-tomcat  $SERVICESTATE$ $STATETYPE$ $SERVICEATTEMPT$
    }

/foo/bar/nagios-restart-tomcatというシェルスクリプトを書く

#!/bin/sh
case "$1" in
CRITICAL)
    case "$2" in
    HARD)
        # tomcatをリスタート
        ;;
    esac
    ;;
esac
exit 0