On 19-03-2012 19:15, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC’s in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well).
Both of our xymon servers point to 2 windows DC’s in our production datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run.
The status will tell you more about what part of the network tests are taking too long.
This should also show up in the xymonnet.log file.
One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout.
Fixes for both of these issues are "in the pipeline" for the next major Xymon version.
Regards, Henrik