Some day a customer complained that he could not access a local intranet server. But I was sure that the server is running, nagios showed green lights everywhere.
But when I remotely connected to the customers PC, I had to notice that he is right and there is no connection from his network to the intranet server due to routing problems. Whoops… ![]()
The afternoon I thought about ways to cover this situation in our Nagios monitoring. Nagios has no obvious resp. generic solution for this problem except setting up multiple checks from different hosts.
But wait a little bit - there is a conceptional problem with this. All these checks are associated to other hosts as they're really belonging to. This will confuse the whole process (and the administrator as well
). Notifications and escalations are based on the wrong host and the statistics / SLAs are also affected.
check_multi provides a simple but effective solution for this scenario: a distributed monitoring which works as a service associated to the target host.

You need:
You will then:
That's it!
And more: you can do this with one generic check_multi command file. Some parameters will control which hosts are to be checked and what check_command is used to examine the service.
check_multi -f distributed.cmd \ -s CHECK_COMMAND="check_tcp -p 80 -H hostx -t 5" \ -s HOSTS="host1,host2,host3,host4,host5" \ -s THRESHOLDS="-w 'COUNT(WARNING)>3' -c 'COUNT(CRITICAL)>3'"
As you can see in the source below, you can also set other parameters from command line.
But there are already some reasonable defaults available:
check_by_ssh -H \$host -t $timeout$ -C )Most of the child checks are designed for parameter validation and visualization:
# # distributed.cmd # # Matthias Flacke, 21.11.2008 # # calls different remote hosts with parametrized check and returns # only critical if (nearly) all hosts return errors # # Call: check_multi -f distributed.cmd # -s CHECK_COMMAND="check_tcp -p 80 -H hostx -t 5" \ # -s HOSTS="host1,host2,host3,host4,host5" \ # -s THRESHOLDS="-w 'COUNT(WARNING)>3' -c 'COUNT(CRITICAL)>3'" # # caveat: take care of the different timeout thresholds! # eeval [ check_command ] = \ if ( "$CHECK_COMMAND$" ) { \ return "$CHECK_COMMAND$"; \ } else {\ print "Error: CHECK_COMMAND not defined. Exit.\n"; \ exit 3; \ } # eeval [ timeout ] = ( "$TIMEOUT$" eq "") ? "2" : "$TIMEOUT$"; # eeval [ hosts_to_check ] = \ if ( "$HOSTS$") { \ return "$HOSTS$"; \ } else {\ print "Error: no HOSTS defined. Exit.\n"; \ exit 3; \ } \ # eeval [ remote_check ] = \ if ( "$REMOTE_CHECK$") { \ return "$REMOTE_CHECK$"; \ } else { \ return "check_by_ssh -H \$host -t $timeout$ -C "; \ } # eeval [ host_checks ] = \ my $count=0; \ my $disttest=''; \ foreach my $host (split(/,/,'$HOSTS$')) { \ $disttest.="-x \'command [ $host ] = $remote_check$ \"$CHECK_COMMAND$\"\' "; \ $count++; \ } \ parse_lines("command [ distributed_check ] = check_multi -r 15 $disttest $THRESHOLDS$"); \ $count;
Discussion