Large environments are one of the major places to run check_multi in order to
On the other hand the check_multi has clear disadvantages (There can be only one) when
This was the basic motivation for the improved implementation with
That's all.
There is a design problem when executing multiple remote checks within one collector check and then return the results into the passive side of Nagios: the transport.
The solution is: use check_multi twice in a command chain:
The first check_multi passes its results via XML to the second one.
Note: the whole chain is started on the Nagios server. In case of DMZ host monitoring no inbound connections are used.
check_by_ssh -H <hostname> -c '/path/to/check_multi -f multi.cmd -r 256' | check_multi -f - -r 8192+8+1
check_nrpe -H <hostname> -c check_multi -a '-f multi.cmd -r 256' | check_multi -f - -r 8192+8+1
check_nrpe -H <hostname> -c check_multi -a '-f multi.cmd -r 4096+8+1'
This method needs a running nsca daemon on Nagios server. Inbound connections are used, therefore this approach is not recommended for DMZ setups.
check_multi easily can be integrated into Sven Nierleins mod_gearman.
mod_gearman is a NEB module which runs checks over a gearman scheduling framework and passes results back to Nagios. This allows thousands of checks pretty efficiently to be run within a single Nagios instance.
A specific client 'send_multi' is part of the mod_gearman package and can be used to feed the particular child check results into the gearman queues.
This client is a small C binary and by far less resource consuming as if you call check_multi itself to pass checks into Nagios.
Call example:
$ check_multi -f multi.cmd -r 256 | \ send_multi --server=<job server> --encryption=no --host="<hostname>" --service="<service>"
Note:
broker_module=/usr/local/share/nagios/mod_gearman.o \ server=localhost \ encryption=no \ eventhandler=no \ hosts=no \ services=no \ hostgroups=does_not_exist
This example installation is part of the sample-config directory in the check_multi package.
Note: it's a setup for one machine, there is no remote access included in the configuration. For the basic understanding of the principle this does not matter anyway
Cooking list:
./configure; make all
cd sample-config/feed_passive
make install-config
this will add a directory /path/to/nagios/etc/check_multi/feed_passive.
cfg_dir=/usr/local/nagios/etc/check_multi/feed_passive
<nagiosdir>/etc/check_multi/feed_passive and run perl gencfg nhosts
then reload nagios.
| setting | comment |
|---|---|
| child_processes_fork_twice=0 | speeds up Nagios, one fork is enough |
| free_child_process_memory=0 | Linux can free memory much faster than Nagios |
| log_initial_states=0 | Otherwise each days log contains one unnecessary line per service |
| log_passive_checks=0 | saves lots of space in the nagios.log |
| use_large_installation_tweaks=1 | another performance boost (e.g. no summary macros |
None of these attributes is mandatory, but it will speed up your infrastructure in large setups.
Just as an example, your mileage may vary ;)
#--- multi.cmd command [ system_disk ] = check_disk -w 5% -c 2% -p / command [ system_load ] = check_load -w 10,8,6 -c 20,18,16 command [ system_swap ] = check_swap -w 90 -c 80 command [ system_users ] = check_users -w 5 -c 10 command [ procs_num ] = check_procs command [ procs_cpu ] = check_procs -w 10 -c 20 --metric=CPU -v command [ procs_mem ] = check_procs -w 100000 -c 2000000 --metric=RSS -v command [ procs_zombie ] = check_procs -w 1 -c 2 -s Z command [ proc_cron ] = check_procs -c 1: -C cron command [ proc_syslogd ] = check_procs -c 1: -C syslogd #--- avoid redundant states state [ WARNING ] = IGNORE state [ CRITICAL ] = IGNORE state [ UNKNOWN ] = IGNORE
This service runs on the remote host and gathers data:
-r 256+4+1:-r 256 as XML output option-r 4 for ERROR output -r 1 for detailed results in the status linedefine service {
service_description multi_feed
host_name host1
check_command check_multi!-f multi_small.cmd -r 256+4+1 -v
event_handler multi_feed_passive
check_interval 5
use local-service
}
passive_checks_enabled 1active_checks_enabled 0define service {
service_description $THIS_NAME$
host_name $HOSTNAME$
passive_checks_enabled 1
active_checks_enabled 0
check_command check_dummy!0 "passive check"
use local-service
}
You can easily generate these passive services via check_multi report mode 2048:
check_multi -f multi.cmd -r 2048 -s service_definition_template=/path/to/service_definition.tpl > services_passive.cfg
Hint: create a oneliner which loops over your hosts and generates bulk service check definitions. Whenever a host is added, you rerun your script and reload Nagios to put the new passive services into effect.
Passive check result was received for service '%s' on host '%s', but the service could not be found!, you have to double check the passive service definitions.cat test.xml | check_multi -f - -r 8192
.
check_multi … | check_multi -f - -f xyz.cmd.(Hint: check_multi -f - reads from STDIN)
Hardware
Nagios configuration
# sar -u 06:00:00 PM CPU %user %nice %system %iowait %idle 06:00:01 PM all 32.28 0.00 28.37 0.71 38.63 06:10:01 PM all 31.66 0.00 27.86 1.05 39.43 06:20:31 PM all 31.60 0.00 28.06 1.25 39.09 06:30:01 PM all 31.61 0.00 28.40 1.19 38.79 06:40:01 PM all 31.55 0.00 28.39 1.16 38.90 06:50:01 PM all 33.68 0.00 28.54 0.92 36.86
# sar -q 06:00:00 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 06:00:01 PM 3 166 1.99 2.52 3.75 06:10:01 PM 3 158 1.91 2.19 2.95 06:20:31 PM 2 155 1.53 1.93 2.44 06:30:01 PM 2 159 2.22 2.11 2.28 06:40:01 PM 2 155 1.76 1.96 2.10 06:50:01 PM 2 165 1.90 2.14 2.14
Nagios Stats 3.1.2 Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org) Last Modified: 06-23-2009 License: GPL CURRENT STATUS DATA ------------------------------------------------------ Status File: /usr/local/nagios/var/status.dat Status File Age: 0d 0h 0m 5s Status File Version: 3.1.2 Program Running Time: 0d 1h 40m 5s Nagios PID: 4112 Used/High/Total Command Buffers: 0 / 0 / 4096 Total Services: 26001 Services Checked: 26001 Services Scheduled: 1001 Services Actively Checked: 1001 Services Passively Checked: 25000 Total Service State Change: 0.000 / 7.760 / 0.327 % Active Service Latency: 0.000 / 1.054 / 0.160 sec Active Service Execution Time: 0.300 / 3.266 / 0.917 sec Active Service State Change: 3.750 / 7.760 / 4.246 % Active Services Last 1/5/15/60 min: 187 / 988 / 1001 / 1001 Passive Service Latency: 0.000 / 0.000 / 0.000 sec Passive Service State Change: 0.000 / 7.760 / 0.170 % Passive Services Last 1/5/15/60 min: 4694 / 24676 / 25000 / 25000 Services Ok/Warn/Unk/Crit: 26001 / 0 / 0 / 0 Services Flapping: 186 Services In Downtime: 0 Total Hosts: 1002 Hosts Checked: 1002 Hosts Scheduled: 1002 Hosts Actively Checked: 1002 Host Passively Checked: 0 Total Host State Change: 0.000 / 0.000 / 0.000 % Active Host Latency: 0.971 / 2.042 / 1.366 sec Active Host Execution Time: 0.024 / 1.150 / 0.065 sec Active Host State Change: 0.000 / 0.000 / 0.000 % Active Hosts Last 1/5/15/60 min: 185 / 935 / 1002 / 1002 Passive Host Latency: 0.000 / 0.000 / 0.000 sec Passive Host State Change: 0.000 / 0.000 / 0.000 % Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0 Hosts Up/Down/Unreach: 1002 / 0 / 0 Hosts Flapping: 0 Hosts In Downtime: 0 Active Host Checks Last 1/5/15 min: 220 / 968 / 2906 Scheduled: 220 / 968 / 2906 On-demand: 0 / 0 / 0 Parallel: 220 / 968 / 2906 Serial: 0 / 0 / 0 Cached: 0 / 0 / 0 Passive Host Checks Last 1/5/15 min: 0 / 0 / 0 Active Service Checks Last 1/5/15 min: 200 / 1001 / 3003 Scheduled: 200 / 1001 / 3003 On-demand: 0 / 0 / 0 Cached: 0 / 0 / 0 Passive Service Checks Last 1/5/15 min: 4975 / 25000 / 75000 External Commands Last 1/5/15 min: 0 / 0 / 0