Xymon server post-migration blues pt. 1 - "brown-outs"
[Apologies in advance for the too-wordy message. Part 1 of 2 .]
Recently I violated one of the prime rules of being a SysAdmin - don't ever change two things at once.
At my work we were forced to migrate our data center to a new facility, so we bought a new monitoring PC to replace a RHEL 6.10 system that ran Xymon 4.3.12. The new monitoring PC runs RHEL 7.6 and Xymon 4.3.28 (using the Terabithia RPMs).
Ever since then, I have been running into two big problems that did not exist before. This is the first problem ("brown-outs"); I'll describe the 2nd in part 2.
Randomly, a group of systems (or some subset of the group) will report in as CRITICAL/RED due to failed xymonnet tests. Mostly SSH, but some SMTP and FTP as well. The hosts/services are all actually fine and the red alerts are incorrect/false positives.
The problem is getting worse - I'm now seeing several hundred red alerts a day from these "brown-outs". The hosts involved are more-or-less random - different buildings/OSes, etc. Sometimes all of them provoke alerts; most of the time it's just a subset of the list.
When they fail, the alert message is always
-- Service <service> on <host> is not OK : Service listening but unavailable (connect timeout)
To try and catch it in the act, I ran this test in a loop:
[root at mgmt xymon]# while true; do ( echo "["date"]" ; xymonnet
--report --ping --checkresponse --timing --debug --no-update 2>&1 >
/tmp/xymonnet.out ; grep 'err=[^0]' /tmp/xymonnet.out ); done
A couple of times I think I did catch it; here's an example:
-- Address=192.168.1.26:22, open=1, res=0, err=1, connecttime=0.004546, totaltime=11.653128, Address=192.168.1.25:22, open=1, res=0, err=1, connecttime=0.004510, totaltime=11.653092, Address=192.168.1.219:22, open=1, res=0, err=1, connecttime=0.003163, totaltime=11.651745, Address=192.168.1.151:22, open=1, res=0, err=1, connecttime=0.002923, totaltime=11.651505, Address=192.168.1.50:22, open=1, res=0, err=1, connecttime=0.002906, totaltime=11.651488, Address=137.78.80.38:22, open=1, res=0, err=1, connecttime=0.002819, totaltime=11.651401,
[... another 10 elided ...]
Address=192.168.1.184:22, open=1, res=0, err=1, connecttime=0.001098, totaltime=12.393879, Address=192.168.1.174:25, open=1, res=0, err=1, connecttime=0.000426, totaltime=12.364234, Address=192.168.1.182:25, open=1, res=0, err=1, connecttime=0.000418, totaltime=12.364226, Address=192.168.1.25:25, open=1, res=0, err=1, connecttime=0.000411, totaltime=12.364219, Address=192.168.1.25:21, open=1, res=0, err=1, connecttime=0.022773, totaltime=12.364044,
Notice the non-zero connecttime, but the exceeded-the-timeout totaltime values.
The services always immediately recover in the next test pass.
Are there any knobs I can turn on to help debug this problem? I'm assuming it's network/router/switch-related, but I need a smoking gun.
Failing that, is there any way in a .cfg file setting to turn these particular "Service listening but unavailable" statuses into a Yellow alert rather than Red? (I'd rather not have to resort to this but as a stop-gap, I would.)
- Greg
On Tue, 2019-04-23 at 21:50 -0700, Greg Earle wrote:
At my work we were forced to migrate our data center to a new facility, so we bought a new monitoring PC to replace a RHEL 6.10 system that ran Xymon 4.3.12. The new monitoring PC runs RHEL 7.6 and Xymon 4.3.28 (using the Terabithia RPMs).
Yup, done the same recently (same versions). We had very few problems though.
When they fail, the alert message is always
-- Service <service> on <host> is not OK : Service listening but unavailable (connect timeout)
-- Address=192.168.1.26:22, open=1, res=0, err=1, connecttime=0.004546, totaltime=11.653128,
We saw something similar when we had 4 clients that the Xymon server could not connect to. As such the overall Xymonnet process time went past its timeout. I think the timeout is 30 seconds (see the xymonnet man page). I also think that xymonnet then ignores any further tests that it would normally do (hence causing other red alerts). If you look on the Xymon server display at the 'xymonnet' column, it should show you the time it took to do various tests and the overall time taken.
As in our case, I would suggest looking at the clients though. Check the logs to see if the new Xymon server is allowed to connect to it. Is something blocking it - local (client) firewall, blocked by IP address etc? (It was this in our case.)
John.
-- John Horne | Senior Operations Analyst | Technology and Information Services University of Plymouth | Drake Circus | Plymouth | Devon | PL4 8AA | UK
[http://www.plymouth.ac.uk/images/email_footer.gif]<http://www.plymouth.ac.uk/worldclass>
This email and any files with it are confidential and intended solely for the use of the recipient to whom it is addressed. If you are not the intended recipient then copying, distribution or other use of the information contained is strictly prohibited and you should not rely on it. If you have received this email in error please let the sender know immediately and delete it from your system(s). Internet emails are not necessarily secure. While we take every care, University of Plymouth accepts no responsibility for viruses and it is your responsibility to scan emails and their attachments. University of Plymouth does not accept responsibility for any changes made after it was sent. Nothing in this email or its attachments constitutes an order for goods or services unless accompanied by an official order form.
participants (2)
-
earle@isolar.DynDNS.ORG
-
john.horne@plymouth.ac.uk