The classical monitoring approach works with the reduction of all monitored items to small independent units. With these units the monitoring configuration is feeded: disks, processes, databases, log messages, etc. etc.
The basic problem of this processing is that the small units lose their context. It's not obvious any more that the disk belongs to the application, that the DB malfunction is the root cause for the application failure and that the dying daemon affects the application communication.
At this point the Business Process Views enter the stage: they are a basic means to present a consolidated view of different parameters and attributes belonging to a particular process.
This example describes the monitoring of just a plain web application, such as everybody runs in his company. This application runs on a standard *nix server of course on Apache web server, uses a MySQL DB for data storage and - for simplification purposes - provides its contents in the companies intranet. So far - so good.
If you miss something important: numerous monitoring items which may be significant for this web applicaton are left off here due to demonstration simplicity .
#--- Web-Application myapp command [ sys_ping ] = check_icmp -H myhost command [ sys_load ] = check_load -w 3,4,5 -c 6,8,10 command [ sys_disk ] = check_disk -w 5% -c 2% -p / -p /var -p /opt command [ web_apache ] = check_procs -c 1: -C httpd command [ db_mysqld ] = check_mysql command [ app_myapp ] = check_http -H myhost -u http ://myhost/myapp
OK, everything seems to work fine
nagios@thinkpad ~/libexec> ./check_multi -f myapp.cmd -r 0 WARNING - 6 plugins checked, 0 critical, 1 warning, 0 unknown, 5 ok [ 1] sys_ping OK - myhost: rta 110.090ms, lost 0% [ 2] sys_load WARNING - load average: 3.50, 3.40, 3.39 [ 3] sys_disk DISK OK - free space: / 2409 MB (20% inode=81%); [ 4] web_apache PROCS OK: 11 processes with command name 'httpd2-prefork' [ 5] db_mysql Uptime: 1134116 Threads: 1 Questions: 8888 Slow queries: 0 Opens: 12 Flush tables: 1 Open tables: 6 Queries per second avg: 0.008 [ 6] app_myapp HTTP OK - HTTP/1.0 302 Found - 0.290 second response time
myapp.cmd. No further restart of Nagios is needed if the service itself has been successfully included.
myapp.cmdand the new values go into effect after the next interval.
If the function of a business critical application needs to be garantueed it will often be clustered. There are several strategies to achieve this - especially for web applications one of the favourite setups is a server farm where all incoming requests are spreaded onto the farm members. For our setup it means that we have to duplicate the monitoring for all farm members. But there is a difference in the result evaluation. The overall result should only be critical if all members of the farm are reachable any more.
myapp.cmdwith one small change: instead of
myhostwe should mention
localhost, so it's the same filename on each host and we do not have to attend different files.
command [ myhost1 ] = check_nrpe -H myhost1 -c check_multi -a -f myapp.cmd command [ myhost2 ] = check_nrpe -H myhost2 -c check_multi -a -f myapp.cmd
state [ CRITICAL ] = COUNT(CRITICAL) > 1 state [ WARNING ] = COUNT(WARNING) > 0 || COUNT(CRITICAL) > 0