And a fiber cut to our DR datacenter caused another 5 hour purple storm.
The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.
While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.
At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?
Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..
Any other ideas from the list by chance?
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell Sent: Tuesday, March 20, 2012 5:11 PM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
I think the interesting sniffer would be on the DC's that remain up. Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site. You shutdown the DC's in the DR site and now queries are timing out (or something) to the DC's on the LAN.
If that's the case, I would first look at DNS on the DC's. If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response. It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries. If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system? I've seen some odd results with DNS caching. What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works. ...matter of fact, I think I'll do that myself....
Jamison Maxwell
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman Sent: Tuesday, March 20, 2012 3:05 PM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?
Maybe that will help you see what it is trying to reach when that is happening.
On 3/20/12 1:51 PM, "Poppy, Ben" <poppy.ben at marshfieldclinic.org> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.
-----Original Message----- From: Phil Crooker [mailto:Phil.Crooker at orix.com.au] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm
So, can you do DNS queries from the xymon server when DC3 & 4 are down?
"Poppy, Ben" 03/20/12 11:50 AM >>> So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.
-----Original Message----- From: Jeremy Laidman [mailto:jlaidman at rebel-it.com.au] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also.
You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts.
J
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
--
This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised.
If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email.
ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au .
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
On Thu, Apr 12, 2012 at 8:23 AM, Poppy, Ben <poppy.ben at marshfieldclinic.org>wrote:
Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..
Any other ideas from the list by chance?
Add to hosts.cfg: 0.0.0.0 .default. # testip
This turns off DNS lookups for all servers in hosts.cfg and uses the IP address instead.
Cheers Jeremy
I'm not convinced that this is a bug in Xymon.
I don't understand how secondary name servers in a DR site that are configured to be used as backups would cause nothing to resolve when they are unavailable. I've run a similar configuration to what I believe you are describing without a problem.... To make sure I'm understanding what you're saying, when the DR DNS servers are unavailable, then Xymon fails to accept the DNS query results? Another, possibly clearer, way of saying that is that the DNS queries from Xymon to your production DNS servers fail despite it absolutely receiving correct replies from your production DNS servers just because your DR site is unavailable?
Jamison Maxwell Jamison at newasterisk.com
-----Original Message----- From: Poppy, Ben [mailto:poppy.ben at marshfieldclinic.org] Sent: Wednesday, April 11, 2012 6:23 PM To: Jamison Maxwell; xymon at xymon.com Subject: RE: [Xymon] Purple storm
And a fiber cut to our DR datacenter caused another 5 hour purple storm.
The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.
While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.
At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?
Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..
Any other ideas from the list by chance?
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell Sent: Tuesday, March 20, 2012 5:11 PM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
I think the interesting sniffer would be on the DC's that remain up. Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site. You shutdown the DC's in the DR site and now queries are timing out (or something) to the DC's on the LAN.
If that's the case, I would first look at DNS on the DC's. If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response. It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries. If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system? I've seen some odd results with DNS caching. What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works. ...matter of fact, I think I'll do that myself....
Jamison Maxwell
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman Sent: Tuesday, March 20, 2012 3:05 PM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?
Maybe that will help you see what it is trying to reach when that is happening.
On 3/20/12 1:51 PM, "Poppy, Ben" <poppy.ben at marshfieldclinic.org> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.
-----Original Message----- From: Phil Crooker [mailto:Phil.Crooker at orix.com.au] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm
So, can you do DNS queries from the xymon server when DC3 & 4 are down?
"Poppy, Ben" 03/20/12 11:50 AM >>> So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.
-----Original Message----- From: Jeremy Laidman [mailto:jlaidman at rebel-it.com.au] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also.
You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts.
J
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
--
This message from ORIX Australia might contain confidential and/or privileged information. If you are not the intended recipient, any use, disclosure or copying of this message (or of any attachments to it) is not authorised.
If you have received this message in error, please notify the sender immediately and delete the message and any attachments from your system. Please inform the sender if you do not wish to receive future communications by email.
ORIX handles personal information according to a Privacy Policy that is consistent with the National Privacy Principles. Please let us know if you would like a copy. It is also available at http://www.orix.com.au .
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
To be honest, I'm not sure what the cause is 100%.
The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell Sent: Wednesday, April 11, 2012 10:22 PM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
I'm not convinced that this is a bug in Xymon.
I don't understand how secondary name servers in a DR site that are configured to be used as backups would cause nothing to resolve when they are unavailable. I've run a similar configuration to what I believe you are describing without a problem.... To make sure I'm understanding what you're saying, when the DR DNS servers are unavailable, then Xymon fails to accept the DNS query results? Another, possibly clearer, way of saying that is that the DNS queries from Xymon to your production DNS servers fail despite it absolutely receiving correct replies from your production DNS servers just because your DR site is unavailable?
Jamison Maxwell Jamison at newasterisk.com
-----Original Message----- From: Poppy, Ben [mailto:poppy.ben at marshfieldclinic.org] Sent: Wednesday, April 11, 2012 6:23 PM To: Jamison Maxwell; xymon at xymon.com Subject: RE: [Xymon] Purple storm
And a fiber cut to our DR datacenter caused another 5 hour purple storm.
The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.
While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.
At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?
Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..
Any other ideas from the list by chance?
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell Sent: Tuesday, March 20, 2012 5:11 PM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
I think the interesting sniffer would be on the DC's that remain up. Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site. You shutdown the DC's in the DR site and now queries are timing out (or something) to the DC's on the LAN.
If that's the case, I would first look at DNS on the DC's. If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response. It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries. If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.
I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system? I've seen some odd results with DNS caching. What order are the name servers in /etc/resolv.conf? I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.
Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works. ...matter of fact, I think I'll do that myself....
Jamison Maxwell
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman Sent: Tuesday, March 20, 2012 3:05 PM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?
Maybe that will help you see what it is trying to reach when that is happening.
On 3/20/12 1:51 PM, "Poppy, Ben" <poppy.ben at marshfieldclinic.org> wrote:
Yes, that's the strange part, we can still manually do digs and nslookups from the xymon server to other DNS servers.
-----Original Message----- From: Phil Crooker [mailto:Phil.Crooker at orix.com.au] Sent: Tuesday, March 20, 2012 12:41 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm
So, can you do DNS queries from the xymon server when DC3 & 4 are down?
"Poppy, Ben" 03/20/12 11:50 AM >>> So they are pointing to 2 DC's that stay up this entire time, we'll call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those servers are down, we begin to have issues.
-----Original Message----- From: Jeremy Laidman [mailto:jlaidman at rebel-it.com.au] Sent: Monday, March 19, 2012 7:46 PM To: Poppy, Ben Cc: xymon at xymon.com Subject: Re: [Xymon] Purple storm
On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben wrote:
I have an interesting problem that happened last night. We are working
on a DR test. Part of that test includes shutting down some DC's in our DR datacenter. When that happened, most tests that are initiated from the xymon servers (http, dns, ssh, ftp, etc) to the monitored server went purple.
For network tests, Xymon resolves the IP address from the servername (typically using DNS), and then uses that IP address to perform the test. The IP address in the hosts.cfg file is not normally used for network tests. So if your DNS fails, Xymon's network tests fail also.
You can prevent this, and use the IP address supplied in hosts.cfg, by adding "testip" to each hosts.cfg entry that requires it. You can add it to a ".default." entry so that it applies to all hosts.
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
On 12-04-2012 05:28, Poppy, Ben wrote:
To be honest, I'm not sure what the cause is 100%.
Me neither.
The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ?
I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled.
I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple.
I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long.
One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then
iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP iptables -I INPUT 1 -s 10.1.2.3 -j DROP
will cause all traffic to/from this server to be dropped.
Regards, Henrik
On 12-04-2012 07:47, Henrik Størner wrote:
I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch
The first part of the patch - the one for xymonnet/contest.c - is completely unrelated. You can remove this before applying if you like, but unless you have sites explicitly tested with https and SSLv2 it is harmless.
Regards, Henrik
I may have missed this in a past post, how do I apply this patch?
I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm? Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?
Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site? Should that effectively cause the purple storm?
Thanks for your help.
From: xymon-bounces at xymon.com [xymon-bounces at xymon.com] on behalf of Henrik Størner [henrik at hswn.dk] Sent: Thursday, April 12, 2012 12:47 AM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
On 12-04-2012 05:28, Poppy, Ben wrote:
To be honest, I'm not sure what the cause is 100%.
Me neither.
The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ?
I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled.
I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple.
I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long.
One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then
iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP iptables -I INPUT 1 -s 10.1.2.3 -j DROP
will cause all traffic to/from this server to be dropped.
Regards, Henrik
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben" <poppy.ben at marshfieldclinic.org> wrote:
I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-)
Save the attachment to /tmp/dnstimeout.patch, then:
cd xymon-4.3.7 patch -p0 </tmp/dnstimeout.patch make clean make
You can run "make install" afterwards, but a safer option would be to just copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin" directory, replacing the one that is already there.
I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm?
It is to simulate that your Xymon server loses connectivity to the DNS server on the primary site.
Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?
That is what I suspect, yes.
Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site? Should that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused the purple storm, without actually pulling the plug and disrupting the network between the two sites. If I understand you correctly, then the purple storm happened when you lost the connection between your two datacenters. Since I suspect that this is related to DNS lookups taking a very long time with the stock 4.3.7 Xymon version, you can use iptables to just block traffic from Xymon to the DNS server(s) in the other datacenter.
Regards, Henrik
I am not sure about how BBWIN is reporting its Windows client metrics; MrBig for Windows reports file system usage in the following format:
Filesystem 1k-blocks Used Avail Capacity Mounted C 15727603 8740006 6987597 55.6% /FIXED/C D 277225672 6502600 270723072 2.3% /FIXED/D
Limits:
Drive Yellow Red
D 70.0 80.0
C 70.0 80.0
Default 90.0 95.0
We have some clustered Windows 2008 servers that are using shared resources that are only referenced by their UNC paths - they do not have logical drive assignments. Was wondering whether this would cause any issues with the processing in do_disk.c section:
/* * Some Unix filesystem reports contain the word "Filesystem". * So check if there's a slash in the NT filesystem letter - if yes, * then it's really a Unix system after all. */ if ( (dsystype == DT_NT) && (*(columns[5])) && (strchr(columns[0], '/')) ) dsystype = DT_UNIX;
Where the columns[5] would be referring to the "Mounted" column in the MrBig output, and columns[0] would be referring to the Filesystem column. The above logic is not being performed as a block step against the overall client's disk output but rather on a line by line basis of the disk output. The first occurrence matched will flip the subsequent data processing flow from Windows based to Unix based even for subsequent disk lines from the same client.
My question is whether the presence of UNC only mounts would be represented with backslashes (\) with the BBWIN/MrBig/whatever Windows reporting modules or whether they may end up converted to forward slashes (/) and thus falsely trigger the dsystype switch. The only affect on the subsequent processing is whether the Filesystem column (Windows OS) is used for storing the diskname (and generation of the rrd file name) or the "Mounted on" column is used for the diskname (Unix OS). If the UNC representation can cause this false flip, then initial Windows disk lines that are using logical drive assignments would be stored and referenced using the one column reference and once a UNC line is encountered, then it and all subsequent lines (including logical drive assignment lines) would be using the other column reference convention. Things could get busy if the line order varies from time to time (i.e. addition/removal of a UNC resource); then a logical drive time stamped data line may go in at times as a "Filesystem" rrd file, other times as a "Mounted on" rrd file.
setupfn2("%s%s.rrd", testname, diskname);
As I don't yet have any clients installed in these clustered Windows boxes or UNC only mounts on other boxes, I don't know what the "Mounted on" column output would look like for the UNC resource. If BBWIN/MrBig/whatever are not reporting on UNC resources then that's going to be an (separate) issue for us too.
Possible fix if this is an issue:
Was wondering if all the Windows clients (BBWIN/MrBig/whatever) all reliable report the last column with the "Mounted" header tag and whether all the Unix/Linux/BSD variants report their last column with the "Mounted on" header tag. If so then maybe a better way to handle this is to run a check against the overall client disk msg block for the string pattern of "Mounted on" instead. Change do_disk.c section:
else if (strstr(msg, "Filesystem")) dsystype = DT_NT; else dsystype = DT_UNIX;
to
else if (strstr(msg, "Filesystem")) dsystype = DT_NT; /* This will trigger for Windows and Unix/Linux flavors */ else if (strstr(msg, "Mounted on")) dsystype = DT_UNIX; /* Assuming all unix/linux/BSD clients report with "Mounted on" and Windows clients only report with "Mounted" in their header line */
Can you make the default to testip but specify a host to use DNS?
Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
On Thu, Apr 12, 2012 at 4:43 AM, <henrik at hswn.dk> wrote:
On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben" <poppy.ben at marshfieldclinic.org> wrote:
I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-)
Save the attachment to /tmp/dnstimeout.patch, then:
cd xymon-4.3.7 patch -p0 </tmp/dnstimeout.patch make clean make
You can run "make install" afterwards, but a safer option would be to just copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin" directory, replacing the one that is already there.
I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm?
It is to simulate that your Xymon server loses connectivity to the DNS server on the primary site.
Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?
That is what I suspect, yes.
Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site? Should that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused the purple storm, without actually pulling the plug and disrupting the network between the two sites. If I understand you correctly, then the purple storm happened when you lost the connection between your two datacenters. Since I suspect that this is related to DNS lookups taking a very long time with the stock 4.3.7 Xymon version, you can use iptables to just block traffic from Xymon to the DNS server(s) in the other datacenter.
Regards, Henrik
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
On Fri, Apr 13, 2012 at 2:53 AM, Josh Luthman <josh at imaginenetworksllc.com>wrote:
Can you make the default to testip but specify a host to use DNS?
No, but you can define .default. more than once, and the defaults will change for subsequent hosts until the next .default. (if any). So you could do:
0.0.0.0 .default. # testip dialup
1.1.1.1 server1 # ssh smtp 1.1.1.2 server2 # ssh telnet
0.0.0.0 .default. # dialup 1.1.1.3 server3 # ssh http
0.0.0.0 .default. # testip dialup 1.1.1.4 server4 # ssh rdp 1.1.1.5 server5 # ssh telnet
All hosts except server3 will get the "testip" setting.
Please note that this is just how I think it should work, and I haven't tested it.
J
I put it at the top and included a host below pages and groups.
Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373 On Apr 12, 2012 11:46 PM, "Jeremy Laidman" <jlaidman at rebel-it.com.au> wrote:
On Fri, Apr 13, 2012 at 2:53 AM, Josh Luthman <josh at imaginenetworksllc.com
wrote:
Can you make the default to testip but specify a host to use DNS?
No, but you can define .default. more than once, and the defaults will change for subsequent hosts until the next .default. (if any). So you could do:
0.0.0.0 .default. # testip dialup
1.1.1.1 server1 # ssh smtp 1.1.1.2 server2 # ssh telnet
0.0.0.0 .default. # dialup 1.1.1.3 server3 # ssh http
0.0.0.0 .default. # testip dialup 1.1.1.4 server4 # ssh rdp 1.1.1.5 server5 # ssh telnet
All hosts except server3 will get the "testip" setting.
Please note that this is just how I think it should work, and I haven't tested it.
J
So far so good, I have been unable to reproduce the purple storm. We'll find out in a couple weeks when we do our next DR isolation test. Thanks so much for your help!
-----Original Message----- From: henrik at hswn.dk [mailto:henrik at hswn.dk] Sent: Thursday, April 12, 2012 3:43 AM To: Poppy, Ben Cc: xymon at xymon.com Subject: RE: [Xymon] Purple storm
On Thu, 12 Apr 2012 06:27:01 +0000, "Poppy, Ben" <poppy.ben at marshfieldclinic.org> wrote:
I may have missed this in a past post, how do I apply this patch?
Ah - ok, my developer-mind assumed everyone knows how to do that :-)
Save the attachment to /tmp/dnstimeout.patch, then:
cd xymon-4.3.7 patch -p0 </tmp/dnstimeout.patch make clean make
You can run "make install" afterwards, but a safer option would be to just copy the "xymon-4.3.7/xymonnet/xymonnet" binary into your Xymon "bin" directory, replacing the one that is already there.
I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm?
It is to simulate that your Xymon server loses connectivity to the DNS server on the primary site.
Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?
That is what I suspect, yes.
Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site? Should that effectively cause the purple storm?
What I'm trying to do is to simulate the situation you had which caused the purple storm, without actually pulling the plug and disrupting the network between the two sites. If I understand you correctly, then the purple storm happened when you lost the connection between your two datacenters. Since I suspect that this is related to DNS lookups taking a very long time with the stock 4.3.7 Xymon version, you can use iptables to just block traffic from Xymon to the DNS server(s) in the other datacenter.
Regards, Henrik
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
Hmm, interestingly enough, I have not been able to reproduce the purple storm cutting off communication to all 10 DC/DNS servers at our DR location.
I think I'm still going to move forward with updating to the latest stable release, as well as with the patch you provided. And if worse comes to worse, I'll kill monitoring of DR servers if/when we lose connectivity again.
Thanks for your help, it is so greatly appreciated!
-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner Sent: Thursday, April 12, 2012 12:47 AM To: xymon at xymon.com Subject: Re: [Xymon] Purple storm
On 12-04-2012 05:28, Poppy, Ben wrote:
To be honest, I'm not sure what the cause is 100%.
Me neither.
The setup we have, should not have any dependencies on our DR site. Our xymon servers at our primary site use DNS servers in our primary site. They monitor a bunch of servers at our DR site, but the dependency ends there (and all that should mean is the servers show RED when DR site is down).
Aha - but you DO have tests in each setup that checks systems on the other site ? Would that happen to include any DNS or NTP checks ?
I suspect that you have each of your Xymon's setup to test availability of the DNS servers on both the primary and the DR site. That could be a problem.
Another bit of information, during this 5 hour outage, both of our xymon servers went from showing properly (where DR servers were showing RED conn as they weren't reachable, but the servers we monitor in our primary site were up), to everything going purple in conn (and other tests).. It would alternate back and forth over the course of the outage (I didn't detect a regular timeframe of when it switched from RED to PURPLE)..
The interesting thing is that they switch to purple, indicating that something is stalled.
I have seen something like this happen when we had a number of DNS checks in the Xymon servers, and network access to these failed (broken switch to a customer network). This caused xymon to stall on these DNS checks, and all of the network tests went purple.
I know that this is difficult to test, because obviously you cannot just cut the connection between the two sites to try it out. But you could try applying this patch which changes the DNS lookup code to use the same kind of timeout settings as the development version - the 4.3.x versions suffer from a common misunderstanding about how the C-ARES library handles timeout that make DNS timeouts take much too long.
One possible way of testing it would be if you can firewall access from e.g. your DR site Xymon server to the primary site's DNS server. If you are running Xymon on a Linux server, then "iptables" can do that for you. If your primary site DNS server is 10.1.2.3, then
iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP iptables -I INPUT 1 -s 10.1.2.3 -j DROP
will cause all traffic to/from this server to be dropped.
Regards, Henrik
The contents of this message may contain private, protected and/or privileged information. If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within. Please contact the sender and advise of the erroneous delivery by return e-mail or telephone. Thank you for your cooperation.
participants (6)
-
henrik@hswn.dk
-
jamison@newasterisk.com
-
jlaidman@rebel-it.com.au
-
josh@imaginenetworksllc.com
-
Mark.Deiss@acs-inc.com
-
poppy.ben@marshfieldclinic.org