Table of Contents

check_multi feeds passive checks

Large environments are one of the major places to run check_multi in order to

On the other hand the check_multi has clear disadvantages (There can be only one) when

This was the basic motivation for the improved implementation with

Pros

Cons

Basic design

How does it work?

  1. check_multi acts as a normal active Nagios check and collects checks from a remote host.
  2. Each child check has a corresponding passive check in Nagios with the same name.
  3. check_multi takes the child checks output and RC and feeds it into the corresponding passive Nagios check.

That's all.

Implementation details

There is a design problem when executing multiple remote checks within one collector check and then return the results into the passive side of Nagios: the transport.

The solution is: use check_multi twice in a command chain:

  1. check_multi on the remote hosts gathers data.
  2. check_multi on the Nagios server feeds passive services.

The first check_multi passes its results via XML to the second one.

Note: the whole chain is started on the Nagios server. In case of DMZ host monitoring no inbound connections are used.

Remote examples

  1. SSH:
    check_by_ssh -H <hostname> -c '/path/to/check_multi -f multi.cmd -r 256' | check_multi -r 8192+8+1
    
  2. NRPE:
    check_nrpe -H <hostname> -c check_multi -a '-f multi.cmd -r 256' | check_multi -r 8192+8+1
    
  3. NSCA:
    check_nrpe -H <hostname> -c check_multi -a '-f multi.cmd -r 4096+8+1'


    This method needs a running nsca daemon on Nagios server. Inbound connections are used, therefore this approach is not recommended for DMZ setups.

For the curious: example installation

This example installation is part of the sample-config directory in the check_multi package.
Note: it's a setup for one machine, there is no remote access included in the configuration. For the basic understanding of the principle this does not matter anyway ;-)

Cooking list:

  1. download check_multi, latest SVN.
  2. ./configure; make all
  3. cd sample-config/feed_passive
  4. Install the feed_passive example files with the
    make install-config

    this will add a directory /path/to/nagios/etc/check_multi/feed_passive.

  5. add the feed_passive subdirectory as cfg_dir to nagios.cfg:
    cfg_dir=/usr/local/nagios/etc/check_multi/feed_passive
  6. reload / restart Nagios: et voila :-P

one of the example hosts

Installation

Prerequisites on the Nagios server

  1. mandatory - perl module XML::Simple
    Install XML::Simple on Nagios server, either from your Linux distribution or directly from CPAN. Its only needed for the receiving side (the Nagios server), the senders (remote clients) do not need XML::Simple.
  2. optional - nagios.cfg settings
    I recommend to set some attributes for performance tuning and to avoid unnecessary logging:
setting comment
child_processes_fork_twice=0 speeds up Nagios, one fork is enough
free_child_process_memory=0 Linux can free memory much faster than Nagios
log_initial_states=0 Otherwise each days log contains one unnecessary line per service
log_passive_checks=0 saves lots of space in the nagios.log
use_large_installation_tweaks=1 another performance boost (e.g. no summary macros

None of these attributes is mandatory, but it will speed up your infrastructure in large setups.

check_multi command file

Just as an example, your mileage may vary ;)

#--- multi.cmd
command [ system_disk  ] = check_disk -w 5% -c 2% -p /
command [ system_load  ] = check_load -w 10,8,6 -c 20,18,16
command [ system_swap  ] = check_swap -w 90 -c 80
command [ system_users ] = check_users -w 5 -c 10
command [ procs_num    ] = check_procs
command [ procs_cpu    ] = check_procs -w 10 -c 20 --metric=CPU -v
command [ procs_mem    ] = check_procs -w 100000 -c 2000000 --metric=RSS -v
command [ procs_zombie ] = check_procs -w 1 -c 2 -s Z
command [ proc_cron    ] = check_procs -c 1: -C cron
command [ proc_syslogd ] = check_procs -c 1: -C syslogd

#--- avoid redundant states
state   [ WARNING      ] = IGNORE
state   [ CRITICAL     ] = IGNORE
state   [ UNKNOWN      ] = IGNORE

check_multi active service definition

This service runs on the remote host and gathers data:

Passive ''feeded'' service definition

You can easily generate these passive services via check_multi report mode 2048:

check_multi -f multi.cmd -r 2048 -s service_definition_template=/path/to/service_definition.tpl > services_passive.cfg


Hint: create a oneliner which loops over your hosts and generates bulk service check definitions. Whenever a host is added, you rerun your script and reload Nagios to put the new passive services into effect.

Troubleshooting

Performance benchmarking

Hardware

Nagios configuration

SAR states

# sar -u
06:00:00 PM       CPU     %user     %nice   %system   %iowait     %idle
06:00:01 PM       all     32.28      0.00     28.37      0.71     38.63
06:10:01 PM       all     31.66      0.00     27.86      1.05     39.43
06:20:31 PM       all     31.60      0.00     28.06      1.25     39.09
06:30:01 PM       all     31.61      0.00     28.40      1.19     38.79
06:40:01 PM       all     31.55      0.00     28.39      1.16     38.90
06:50:01 PM       all     33.68      0.00     28.54      0.92     36.86
# sar -q
06:00:00 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
06:00:01 PM         3       166      1.99      2.52      3.75
06:10:01 PM         3       158      1.91      2.19      2.95
06:20:31 PM         2       155      1.53      1.93      2.44
06:30:01 PM         2       159      2.22      2.11      2.28
06:40:01 PM         2       155      1.76      1.96      2.10
06:50:01 PM         2       165      1.90      2.14      2.14

Nagiostats

Nagios Stats 3.1.2
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 06-23-2009
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/nagios/var/status.dat
Status File Age:                        0d 0h 0m 5s
Status File Version:                    3.1.2

Program Running Time:                   0d 1h 40m 5s
Nagios PID:                             4112
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         26001
Services Checked:                       26001
Services Scheduled:                     1001
Services Actively Checked:              1001
Services Passively Checked:             25000
Total Service State Change:             0.000 / 7.760 / 0.327 %
Active Service Latency:                 0.000 / 1.054 / 0.160 sec
Active Service Execution Time:          0.300 / 3.266 / 0.917 sec
Active Service State Change:            3.750 / 7.760 / 4.246 %
Active Services Last 1/5/15/60 min:     187 / 988 / 1001 / 1001
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 7.760 / 0.170 %
Passive Services Last 1/5/15/60 min:    4694 / 24676 / 25000 / 25000
Services Ok/Warn/Unk/Crit:              26001 / 0 / 0 / 0
Services Flapping:                      186
Services In Downtime:                   0

Total Hosts:                            1002
Hosts Checked:                          1002
Hosts Scheduled:                        1002
Hosts Actively Checked:                 1002
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    0.971 / 2.042 / 1.366 sec
Active Host Execution Time:             0.024 / 1.150 / 0.065 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        185 / 935 / 1002 / 1002
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  1002 / 0 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     220 / 968 / 2906
   Scheduled:                           220 / 968 / 2906
   On-demand:                           0 / 0 / 0
   Parallel:                            220 / 968 / 2906
   Serial:                              0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  200 / 1001 / 3003
   Scheduled:                           200 / 1001 / 3003
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 4975 / 25000 / 75000

External Commands Last 1/5/15 min:      0 / 0 / 0