Greetings -
This evening I had the luck to experience a "purple storm" with Xymon v4.3.7, *and* I had enough time to try a few items to combat the problem. Here is what I tried:
applied patch from Henrik dated May 2 http://lists.xymon.com/pipermail/xymon/2012-May/034525.html patch -p0 <dns.patch ; make ; make install ; service xymon stop ; service xymon start
checked my /etc/hosts file
I do have a few non-standard Xymon configuration settings:
am NOT using FQDN
tasks.cfg: CMD xymonnet --report --ping --ping-tasks=4 --dns-timeout=5 --checkresponse --source-ip=130.111.135.60 --dns=ip --shuffle
hosts.cfg: 130.111.135.60 shepherd # testip ssh bbd smtp https://shepherd.uct.usm.maine.edu/
The xymonnet column for shepherd remained purple until the DNS servers came back. It almost seemed like Xymon could not "find itself"?
Any thoughts on ways to combat this issue?
Thanks (as always) for a great product, and for any assistance you can provide.
--
Jon Dustin - Network Specialist University of Southern Maine Portland, ME 207-780-4152
On 13-05-2012 04:41, Jon Dustin wrote:
This evening I had the luck to experience a "purple storm" with Xymon v4.3.7 [snip]
The xymonnet column for shepherd remained purple until the DNS servers came back. It almost seemed like Xymon could not "find itself"?
Any thoughts on ways to combat this issue?
What's logged in your xymonnet.log file ?
Are your network tests and the Xymon website on the same server, or different servers ?
Do you have a local caching DNS server, or does your resolv.conf point to remote DNS servers ?
Regards, Henrik
On 5/13/2012 at 10:56 AM, in message <4FAFCBAE.6010809 at hswn.dk>, Henrik Størner<henrik at hswn.dk> wrote: On 13-05-2012 04:41, Jon Dustin wrote: This evening I had the luck to experience a "purple storm" with Xymon v4.3.7 [snip]
The xymonnet column for shepherd remained purple until the DNS servers came back. It almost seemed like Xymon could not "find itself"?
Any thoughts on ways to combat this issue?
What's logged in your xymonnet.log file ?
All I found were the following two entries:
2012-05-12 20:58:14 WARNING: Runtime 481 longer than time limit (300) 2012-05-12 22:07:20 WARNING: Runtime 767 longer than time limit (300)
Are your network tests and the Xymon website on the same server, or different servers ?
Same server, physical SLES11SP1x64, 16 GiB RAM, not overloaded.
Do you have a local caching DNS server, or does your resolv.conf point to remote DNS servers ?
resolv.conf uses two Active Directory name servers, but the majority of tests go against domains provided by another entity in my University system. When their DNS servers go south, my Xymon server starts having troubles.
Also, the TTL for our DNS records is very low (5 minutes I believe). I'm going to see if we can increase this for our server names.
Thanks for reading.
--
Jon Dustin - Network Specialist University of Southern Maine Portland, ME 207-780-4152
On 14-05-2012 04:04, Jon Dustin wrote:
What's logged in your xymonnet.log file ?
All I found were the following two entries:
2012-05-12 20:58:14 WARNING: Runtime 481 longer than time limit (300) 2012-05-12 22:07:20 WARNING: Runtime 767 longer than time limit (300)
OK, if you look at the history of "xymonnet" status column, do you have a yellow status from around that time ? If you do, then check what line takes the longest time to complete.
How many systems are you testing, btw ?
There is one thing that I know of which can trigger this: xymonnet relies on two external tools (ntpdate and rpcinfo) for checking NTP-servers and RPC services. I know from personal experience that a failed NTP server can cause ntpdate to hang for a very long time, and this can block xymonnet from completing the test cycle.
Regards, Henrik
On 5/14/2012 at 1:32 AM, in message <4FB098DB.1000808 at hswn.dk>, Henrik Størner<henrik at hswn.dk> wrote: On 14-05-2012 04:04, Jon Dustin wrote: What's logged in your xymonnet.log file ?
All I found were the following two entries:
2012-05-12 20:58:14 WARNING: Runtime 481 longer than time limit (300) 2012-05-12 22:07:20 WARNING: Runtime 767 longer than time limit (300)
OK, if you look at the history of "xymonnet" status column, do you have a yellow status from around that time ? If you do, then check what line takes the longest time to complete.
Yes, I DO have a yellow test result (481 seconds), and it looks like LDAP was the culprit!
DNS lookups completed 4791966.865311
17.502010
Test engine setup completed 4791966.870284
0.004972
TCP tests completed 4791978.812050
11.941766
PING test completed (604 hosts) 4791979.652874
0.840824
PING test results sent 4791979.656317
0.003442
Test result collection completed 4791979.656625
0.000307
LDAP test engine setup completed 4791979.656705
0.000080
LDAP tests executed 4792364.927759
385.271054
LDAP tests result collection completed 4792364.927760
0.000000
DNS tests executed 4792429.956221
65.028460
These test times were *before* I added your DNS patch to Xymon.
How many systems are you testing, btw ?
726 hosts in the configuration report
There is one thing that I know of which can trigger this: xymonnet relies on two external tools (ntpdate and rpcinfo) for checking NTP-servers and RPC services. I know from personal experience that a
failed NTP server can cause ntpdate to hang for a very long time, and
this can block xymonnet from completing the test cycle.
I DO have a few NTP servers (and a couple of them were the failed DNS servers). No RPC tests however.
Thanks for reading.
--
Jon Dustin - Network Specialist University of Southern Maine Portland, ME 207-780-4152
On Mon, 14 May 2012 06:35:08 -0400, "Jon Dustin" <jdustin at usm.maine.edu> wrote:
Yes, I DO have a yellow test result (481 seconds), and it looks like LDAP was the culprit!
LDAP tests executed ... 385.271054
Makes sense, really. LDAP tests use whatever LDAP library your system has, and Xymon currently has very little control over timeout handling once it hands over control to the LDAP library. Some libraries implement a timeout setting - but only for queries, not when connecting to the server.
Newer OpenLDAP libraries have real timeout handling, but Xymon 4.x hasn't been modified to use it. Will do in 5.x.
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
jdustin@usm.maine.edu