Translations of this page:

Transports, buffers and multiline

The incredibly simple plugin interface is a key to the mysteries of the success of Nagios. A return code of 0,1,2,3 and some explaining words, that's all.

Everybody was able to understand these two elements and started to write plugins. In all languages one can imagine and for all OS where Nagios is running on. Nagios got a famous reputation ('Yes we can plugin!'), and the only limitation was the skill of the plugin programmer.

As a side note: Some plugins actually are not of best quality, as we can see in the exchange repositories. But they cover the whole range of monitoring.


Plugin output limited

Let's talk about a small aspect of the plugin interface which is annoying and often frustrates especially Nagios beginners - the limited length of plugin output. It sounds pretty simple, but the devil is in the details.

If you take the output of the standard plugin check_disk, the length of output should not be a problem:

DISK OK - free space: / 947 MB (8% inode=74%);| /=10533MB;10884;9675;0;12094

But in the meantime there are plugins like check_multi, which transport much more data:

OK - 3 plugins checked, 3 ok
[ 1] disk DISK OK - free space: / 947 MB (8% inode=74%);
[ 2] load OK - load average: 1.43, 1.23, 1.52
[ 3] swap SWAP OK - 95% free (1935 MB out of 2048 MB) |check_multi::check_multi::plugins=3 time=0.046274 disk::check_disk::/=10533MB;11489;11852;0;12094 load::check_load::load1=1.430;5.000;10.000;0; load5=1.230;4.000;8.000;0; load15=1.520;3.000;6.000;0; swap::check_swap::swap=1935MB;0;0;0;2048

428 bytes instead of 83 bytes: If you still have Nagios 2 running, this would have blown the maximum length of plugin output.


Plugin buffer overflow? Nagios does not care

The plugin interface is simple, but it also means that Nagios does not care about the length of plugin output. If it exceeds the internal buffer length, nobody is informed and often nobody notices it. The content is simply cut.

Bad news for the performance data, which is appended to the output: If the buffer is not long enough to house the whole output, the performance data is missing or even worse, it is corrupted. And no warning lamp alarms the monitoring admin.

Nagios 2 allowed 332 bytes of plugin output, but for Nagios 3 this was increased drastically. Have a look on how the maximum plugin output length evolved over the Nagios timeline:
On Nagios side it's the constant named MAX_PLUGIN_OUTPUT_LENGTH:

Maximum plugin
output in bytes
Nagios version Include file
352 1.0 common/objects.h
348 2-0 include/objects.h
332 2-1 include/objects.h
4096 3-0a include/nagios.h
8192 3-0 include/nagios.h

Don't think that the 8K bytes are sufficient in all cases - check_multi's HTML mode can easily consume dozens of kilobytes.


Increasing MAX_PLUGIN_OUTPUT_LENGTH - and some more

In principle the idea of increasing the constant MAX_PLUGIN_OUTPUT_LENGTH is correct - increase it, recompile and restart Nagios, done. Ethan himself gives a hint in nagios.h to also increase MAX_EXTERNAL_COMMAND_LENGTH for passive checks:

NOTE: Plugin length is artificially capped at 8k to prevent runaway plugins from returning MBs/GBs of data
back to Nagios.  If you increase the 8k cap by modifying this value, make sure you also increase the value
of MAX_EXTERNAL_COMMAND_LENGTH in common.h to allow for passive checks results received through the external
command file. EG 10/19/07

One remark to the buffer size - generally it's a good idea to restrict it. But increasing it to 32K or 64K should not be a problem for modern servers in the gigabit world, even if there are runaway plugins.

Transports and oddities

Enlarging Nagios buffers is not all - since many of the plugins are running on remote machines. Their output has to be transferred to Nagios. Here several transports enter the stage:

  • NRPE - Nagios remote plugin executor
  • NSCA - Nagios service check acceptor
  • SSH - check_by_ssh

There are more, but these are the most important in the Nagios world. Let's take a look how they behave with large plugin output.

We will begin with the transport SSH, since it's not Nagios and in terms of transportation the simplest. I know that some people will not agree, but here are my 2 cents: if you manage the public key authentication with SSH, it's a simple, safe and robust transport. And if you transfer 10K or 100K, who cares…

NRPE is a bit more tricky, and this comes from the internal implementation. In the original version it is a one buffer transport and will fail if you don't adjust the small buffer sizes in common.h:

#define MAX_INPUT_BUFFER           2048    /* max size of most buffers we use */
[...]
#define MAX_PACKETBUFFER_LENGTH    1024    /* max amount of data we'll send in one query/response */

Ton Voon has provided an improvement which breaks this limitation. The best on Ton's patch is that you don't have to upgrade all machines at once. You can do it step by step which is helpful especially for large installations.
Note: if you are running NRPE on Linux machines before kernel release 2.6.11, you will only be able to transport one buffer. This is an effect caused by the old single buffered PIPE implementation. In 2.6.11 Linus Torvalds himself inplemented a ring buffer which allowed circular pipes. With the default kernel PIPE size of 4K and 16 buffers, NRPE can now transport 64K. So if you still have problems with cut NRPE data, watch for 2.6.10 and below.

NSCA is the nasty end - and if you ask me: it needs a reimplementation. There are several implementation itches which do not fit anymore in the current Nagios world:

  1. NSCA does not scale very well: it passes all messages to the Nagios CMD interface, which is well known for its traffic jam in large installations. And NSCA is often used just in such large installations to cirumvent the Nagios scheduling bottleneck.
    There are numerous enhancements on both NSCA's sender and receiver side, but IMHO the only well performing approach to insert checks into Nagios will use the checkresults interface.
  2. NSCA does not allow multiline: it reads the input up to the first newline and that's it.


Recommendations for check_multi?
After our small walk through the puzzling world of Nagios transports the conclusion for the use of check_multi is pretty clear: NRPE and SSH will work well, while NSCA is the black sheep in this family.
But this does not need to be a real disadvantage: in a check_multi driven Nagios infrastructure you don't need that much passive services with NSCA, because you can use active check_multi services instead.
In the end this means no more need for freshness checks and no more need for sophisticated distributed setups.

Discussion

Enter your comment
 
blog/2009/transports_buffers_and_multiline.txt · Last modified: 2010/03/09 09:23 by flackem
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0