[Xymon] Xymon server post-migration blues pt. 1 - "brown-outs"

24 Apr 2019


      [Apologies in advance for the too-wordy message.  Part 1 of 2	.]
Recently I violated one of the prime rules of being a SysAdmin - don't
ever change two things at once.
At my work we were forced to migrate our data center to a new facility,
so we bought a new monitoring PC to replace a RHEL 6.10 system that ran
Xymon 4.3.12.  The new monitoring PC runs RHEL 7.6 and Xymon 4.3.28
(using the Terabithia RPMs).
Ever since then, I have been running into two big problems that did not
exist before.  This is the first problem ("brown-outs"); I'll describe
the 2nd in part 2.
Randomly, a group of systems (or some subset of the group) will report
in as CRITICAL/RED due to failed xymonnet tests.  Mostly SSH, but some
SMTP and FTP as well.  The hosts/services are all actually fine and the
red alerts are incorrect/false positives.
The problem is getting worse - I'm now seeing several hundred red alerts
a day from these "brown-outs".  The hosts involved are more-or-less
random - different buildings/OSes, etc.  Sometimes all of them provoke
alerts; most of the time it's just a subset of the list.
When they fail, the alert message is always
--
Service <service> on <host> is not OK : Service listening but
unavailable (connect timeout)
To try and catch it in the act, I ran this test in a loop:
[root at mgmt xymon]# while true; do ( echo "["date"]" ; xymonnet
--report --ping --checkresponse --timing --debug --no-update 2>&1 >
/tmp/xymonnet.out ; grep 'err=[^0]' /tmp/xymonnet.out ); done
A couple of times I think I did catch it; here's an example:
--
Address=192.168.1.26:22, open=1, res=0, err=1, connecttime=0.004546,
totaltime=11.653128,
Address=192.168.1.25:22, open=1, res=0, err=1, connecttime=0.004510,
totaltime=11.653092,
Address=192.168.1.219:22, open=1, res=0, err=1, connecttime=0.003163,
totaltime=11.651745,
Address=192.168.1.151:22, open=1, res=0, err=1, connecttime=0.002923,
totaltime=11.651505,
Address=192.168.1.50:22, open=1, res=0, err=1, connecttime=0.002906,
totaltime=11.651488,
Address=137.78.80.38:22, open=1, res=0, err=1, connecttime=0.002819,
totaltime=11.651401,
[... another 10 elided ...]
Address=192.168.1.184:22, open=1, res=0, err=1, connecttime=0.001098,
totaltime=12.393879,
Address=192.168.1.174:25, open=1, res=0, err=1, connecttime=0.000426,
totaltime=12.364234,
Address=192.168.1.182:25, open=1, res=0, err=1, connecttime=0.000418,
totaltime=12.364226,
Address=192.168.1.25:25, open=1, res=0, err=1, connecttime=0.000411,
totaltime=12.364219,
Address=192.168.1.25:21, open=1, res=0, err=1, connecttime=0.022773,
totaltime=12.364044,
Notice the non-zero connecttime, but the exceeded-the-timeout totaltime
values.
The services always immediately recover in the next test pass.
Are there any knobs I can turn on to help debug this problem?  I'm
assuming it's network/router/switch-related, but I need a smoking gun.
Failing that, is there any way in a .cfg file setting to turn these
particular "Service listening but unavailable" statuses into a Yellow
alert rather than Red?  (I'd rather not have to resort to this but as a
stop-gap, I would.)
	- Greg

[Xymon] Xymon server post-migration blues pt. 1 - "brown-outs"

earle＠isolar.DynDNS.ORG

-- Service <service> on <host> is not OK : Service listening but unavailable (connect timeout)