On Thu, April 14, 2016 7:36 am, Matt Pannucci wrote:
Hello,
For the past two days, our xymon environment has been falsely reporting red for SSH and HTTP. It does seem to be random(happened one day at 10pm and two days later at 2:30am).
It happens to every server all at the same time. Then a couple minutes later everything goes back to green.
I've checked through some logs with no success. I'm not exactly sure if I'm looking in the correct places to find the answer.
Any help/suggestions would be great!
Thanks
Matt
Hi Matt,
You'll want to start by looking at the history ("histlog") snapshots from the red statuses and see what xymonnet (the process doing the http and ssh network testing) reported when it happened. Was it a timeout? DNS error? Premature TCP closure?
If you've never received any "conn" test failures (which are ICMP pings usually done by fping) at the same time, then it's probably not a general loss of connectivity, but there could be an issue at the TCP layer (packet loss leading to closure timeouts, firewall port limits, etc).
If TCP seems fine, then you'd want to look more closely at the server xymon is running on. Check the xymonnet logs (/var/log/xymon/xymonnet.log) for any errors around this time, see if the server itself was having performance problems, etc. You can try increasing test concurrency, or changing how DNS lookups are done, but that should only be done if the issue's been narrowed down.
The overarching question should also be: When did the problem start, and did anything change around that time?
HTH, -jc