Xymon flapping: network slowness reality or delusion?
Hi, All ...
The other day, our Xymon (4.3.3) started sending out notifications due to flapping on various hosts, various network-based tests which lasted for a rather sharply-defined period. It caused a fair bit of angst and I was on the hot-seat to prove Xymon was functioning properly.
Here are some of the summary facts:
The flapping is pretty well documented in Xymon as occurring due to connection times exceeding our 10-second threshold - most of the , as configured in tasks.cfg CMD xymonnet --report --ping --checkresponse \ --timeout=10 --dns-timeout=2 \ --dnslog=/var/log/xymon-4.3.3/dns.log \ --concurrency=5INTERVAL 3m
Output from the "xymonnet" report (currently - not captured during the "storm") shows: xymonnet version 4.3.3 SSL library : OpenSSL 0.9.8l 5 Nov 2009 LDAP library: OpenLDAP 20416 Statistics: Hosts total : 2081 Hosts with no tests : 2 Total test count : 2864 Status messages : 2856 Alert status msgs : 0 Transmissions : 30 DNS statistics: # hostnames resolved : 3337 # succesful : 921 # failed : 1266 # calls to dnsresolve : 2850 TCP test statistics: # TCP tests total : 1769 # HTTP tests : 1244 # Simple TCP tests : 525 # Connection attempts : 1767 # bytes written : 235845 # bytes read : 2514747 TIME SPENT Event Start time Duration xymonnet startup 1040654.310651 - Service definitions loaded 1040654.319152 0.008501 Tests loaded 1040655.696733 1.377581 DNS lookups completed 1040656.213268 0.516534 Test engine setup completed 1040657.416739 1.203470 TCP tests completed 1040675.444183 18.027443 PING test completed (923 hosts) 1040699.991467 24.547283 PING test results sent 1040700.080247 0.088780 Test result collection completed 1040700.144033 0.063785 LDAP test engine setup completed 1040700.152852 0.008819 LDAP tests executed 1040700.360821 0.207968 LDAP tests result collection completed 1040700.360829 0.000007 DNS tests executed 1040700.441820 0.080991 NTP tests executed 1040722.413523 21.971702 Test results transmitted 1040723.295458 0.881935 xymonnet completed 1040723.313935 0.018476 TIME TOTAL 69.003284Rather sharply defined start-up / cut-off for the "storm": I can point to the 5-minute segment when it started / stoppedThe Xymon server OS/NIC hardware check out diagnosticallyAccording to our network team's records, the network connection bandwidth utilization coming in / out of the Xymon server was < 1% capacity (i.e. we have lots of bandwidth)According to our network team there were no significant loss of packets or congestion at the switch level (there's only one hop between the Xymon server and the rest of the hosts)The types of services affected seemed pretty random: mostly HTTP tests, but LOTs of SSH/ping/NTP/LDAP, etc. as well.
Any initial thoughts?
Thanks!
david
David Mills
Systems Administrator
Northrop Grumman
512-595-1238
david.mills at ngc.com
Hi, All ...
The other day, our Xymon (4.3.3) started sending out notifications due to flapping on various hosts, various network-based tests which lasted for a rather sharply-defined period. It caused a fair bit of angst and I was on the hot-seat to prove Xymon was functioning properly.
Here are some of the summary facts:
The flapping is pretty well documented in Xymon as occurring dueto connection times exceeding our 10-second threshold - most of the , as configured in tasks.cfg
CMD xymonnet --report --ping --checkresponse \ --timeout=10 --dns-timeout=2 \ --dnslog=/var/log/xymon-4.3.3/dns.log \ --concurrency=5INTERVAL 3m
Output from the "xymonnet" report (currently - not captured duringthe "storm") shows:
xymonnet version 4.3.3 SSL library : OpenSSL 0.9.8l 5 Nov 2009 LDAP library: OpenLDAP 20416 Statistics: Hosts total : 2081 Hosts with no tests : 2 Total test count : 2864 Status messages : 2856 Alert status msgs : 0 Transmissions : 30 DNS statistics: # hostnames resolved : 3337 # succesful : 921 # failed : 1266 # calls to dnsresolve : 2850 TCP test statistics: # TCP tests total : 1769 # HTTP tests : 1244 # Simple TCP tests : 525 # Connection attempts : 1767 # bytes written : 235845 # bytes read : 2514747 TIME SPENT Event Start timeDuration xymonnet startup 1040654.310651 - Service definitions loaded 1040654.319152 0.008501 Tests loaded 1040655.696733 1.377581 DNS lookups completed 1040656.213268 0.516534 Test engine setup completed 1040657.416739 1.203470 TCP tests completed 1040675.444183 18.027443 PING test completed (923 hosts) 1040699.991467 24.547283 PING test results sent 1040700.080247 0.088780 Test result collection completed 1040700.144033 0.063785 LDAP test engine setup completed 1040700.152852 0.008819 LDAP tests executed 1040700.360821 0.207968 LDAP tests result collection completed 1040700.360829 0.000007 DNS tests executed 1040700.441820 0.080991 NTP tests executed 1040722.413523 21.971702 Test results transmitted 1040723.295458 0.881935 xymonnet completed 1040723.313935 0.018476 TIME TOTAL 69.003284
Rather sharply defined start-up / cut-off for the "storm": I canpoint to the 5-minute segment when it started / stopped
The Xymon server OS/NIC hardware check out diagnosticallyAccording to our network team's records, the network connectionbandwidth utilization coming in / out of the Xymon server was < 1% capacity (i.e. we have lots of bandwidth)
According to our network team there were no significant loss ofpackets or congestion at the switch level (there's only one hop between the Xymon server and the rest of the hosts)
The types of services affected seemed pretty random: mostly HTTPtests, but LOTs of SSH/ping/NTP/LDAP, etc. as well.
Any initial thoughts?
Thanks!
david
David Mills Systems Administrator Northrop Grumman 512-595-1238 david.mills at ngc.com
Assuming you're saving status results in history (the default), can you look at the status messages from the down periods? Were they DNS timeouts or timeout timeouts? I'd start with the ping checks, since that's pretty cut-and-dried...
- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look at the 'trends' page for the polling host for that period and see if anything unusual happened around the same time?
HTH, -jc
-----Original Message----- From: cleaver at terabithia.org [mailto:cleaver at terabithia.org] Sent: Friday, March 15, 2013 1:31 PM To: Mills, David (IS) Cc: xymon at xymon.com Subject: EXT :Re: [Xymon] Xymon flapping: network slowness reality or delusion?
Hi, All ...
The other day, our Xymon (4.3.3) started sending out notifications due to flapping on various hosts, various network-based tests which lasted for a rather sharply-defined period. It caused a fair bit of angst and I was on the hot-seat to prove Xymon was functioning properly.
Here are some of the summary facts:
<snip>
Assuming you're saving status results in history (the default), can you look at the status messages from the down periods? Were they DNS timeouts or timeout timeouts? I'd start with the ping checks, since that's pretty cut-and-dried...
- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look at the 'trends' page for the polling host for that period and see if anything unusual happened around the same time?
HTH, -jc
== Thanks! After poking around on the Xymonnet history dumps, I found some very interesting stuff I don't know what to make of:
- For the top 20 worst times in a 24 hour period, the three categories of networking that had significantly elevated levels were "TCP tests completed", "DNS tests executed" and "NTP tests executed".
- Oddly, after graphing the respective times for these categories in a spreadsheet, it became obvious that the DNS and TCP tests were roughly inversions of each other: when one was super-high, the other would go low.
- Even weirder, the PING tests were ... NORMAL!! While the rest of the Xymon network tests were jumping off a cliff, good old 'ping' was chugging along without (mostly) mishap. This last datum seems to blow a hole in the theory that this is truly a network problem (vs. a Xymon server/host problem).
Any other thoughts?
david
participants (2)
-
cleaver@terabithia.org
-
David.Mills@ngc.com