[Apologies in advance for the too-wordy message. Part 1 of 2 .]
Recently I violated one of the prime rules of being a SysAdmin - don't ever change two things at once.
At my work we were forced to migrate our data center to a new facility, so we bought a new monitoring PC to replace a RHEL 6.10 system that ran Xymon 4.3.12. The new monitoring PC runs RHEL 7.6 and Xymon 4.3.28 (using the Terabithia RPMs).
Ever since then, I have been running into two big problems that did not exist before. This is the first problem ("brown-outs"); I'll describe the 2nd in part 2.
Randomly, a group of systems (or some subset of the group) will report in as CRITICAL/RED due to failed xymonnet tests. Mostly SSH, but some SMTP and FTP as well. The hosts/services are all actually fine and the red alerts are incorrect/false positives.
The problem is getting worse - I'm now seeing several hundred red alerts a day from these "brown-outs". The hosts involved are more-or-less random - different buildings/OSes, etc. Sometimes all of them provoke alerts; most of the time it's just a subset of the list.
When they fail, the alert message is always
-- Service <service> on <host> is not OK : Service listening but unavailable (connect timeout)
To try and catch it in the act, I ran this test in a loop:
[root at mgmt xymon]# while true; do ( echo "["date"]" ; xymonnet
--report --ping --checkresponse --timing --debug --no-update 2>&1 >
/tmp/xymonnet.out ; grep 'err=[^0]' /tmp/xymonnet.out ); done
A couple of times I think I did catch it; here's an example:
-- Address=192.168.1.26:22, open=1, res=0, err=1, connecttime=0.004546, totaltime=11.653128, Address=192.168.1.25:22, open=1, res=0, err=1, connecttime=0.004510, totaltime=11.653092, Address=192.168.1.219:22, open=1, res=0, err=1, connecttime=0.003163, totaltime=11.651745, Address=192.168.1.151:22, open=1, res=0, err=1, connecttime=0.002923, totaltime=11.651505, Address=192.168.1.50:22, open=1, res=0, err=1, connecttime=0.002906, totaltime=11.651488, Address=137.78.80.38:22, open=1, res=0, err=1, connecttime=0.002819, totaltime=11.651401,
[... another 10 elided ...]
Address=192.168.1.184:22, open=1, res=0, err=1, connecttime=0.001098, totaltime=12.393879, Address=192.168.1.174:25, open=1, res=0, err=1, connecttime=0.000426, totaltime=12.364234, Address=192.168.1.182:25, open=1, res=0, err=1, connecttime=0.000418, totaltime=12.364226, Address=192.168.1.25:25, open=1, res=0, err=1, connecttime=0.000411, totaltime=12.364219, Address=192.168.1.25:21, open=1, res=0, err=1, connecttime=0.022773, totaltime=12.364044,
Notice the non-zero connecttime, but the exceeded-the-timeout totaltime values.
The services always immediately recover in the next test pass.
Are there any knobs I can turn on to help debug this problem? I'm assuming it's network/router/switch-related, but I need a smoking gun.
Failing that, is there any way in a .cfg file setting to turn these particular "Service listening but unavailable" statuses into a Yellow alert rather than Red? (I'd rather not have to resort to this but as a stop-gap, I would.)
- Greg