I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well).
Both of our xymon servers point to 2 windows DC's in our production datacenter in /etc/resolv.conf for DNS lookups.
Has anyone run into this before? Any ideas how it could be related? Or how to fix/prevent it?
We are running 4.3.4.
Thanks, -Ben
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben <poppy.ben at marshfieldclinic.org> wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC’s in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also.
You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts.
J
So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.
-----Original Message----- From: Jeremy Laidman [mailto:jlaidman at rebel-it.com.au] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben <poppy.ben at marshfieldclinic.org> wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also.
You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts.
J
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
On 19-03-2012 19:15, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC’s in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well).
Both of our xymon servers point to 2 windows DC’s in our production datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run.
The status will tell you more about what part of the network tests are taking too long.
This should also show up in the xymonnet.log file.
One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout.
Fixes for both of these issues are "in the pipeline" for the next major Xymon version.
Regards, Henrik
The DNS tests executed jumps to over 1500-2400 from the normal ~1 when those 4 DC's are down (which we are testing DNS, but are not the DNS servers set up in /etc/resolv.conf on the xymon servers). We are not doing any NTP tests against any hosts, nor do we do any special dns test, just the standard test.
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner Sent: Tuesday, March 20, 2012 2:11 AM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
On 19-03-2012 19:15, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple. The servers that went purple were not all in our DR datacenter, it was at all of our sites, and even included some tests to the xymon server itself (we monitor the HTTP web page of xymon itself as well).
Both of our xymon servers point to 2 windows DC's in our production datacenter in /etc/resolv.conf for DNS lookups.
Check the "xymonnet" status history. I suppose this status will show some yellow events during this, caused by the network tests taking too long to run.
The status will tell you more about what part of the network tests are taking too long.
This should also show up in the xymonnet.log file.
One likely culprit would be if you are doing "ntp" tests or custom DNS queries from Xymon against the DC's that are down. "ntp" tests use an external program (ntpdate) to perform the query, and it has a very long timeout when servers are not responding. DNS queries use the C-ARES library, and because I misunderstood how the timeout handling works in this library it can several minutes *per test* to timeout.
Fixes for both of these issues are "in the pipeline" for the next major Xymon version.
Regards, Henrik
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
participants (3)
-
henrik@hswn.dk
-
jlaidman@rebel-it.com.au
-
poppy.ben@marshfieldclinic.org