On Tue, February 16, 2016 1:44 am, L-M-J wrote:
Hi,
I'm still running into troubles every night between ~0h30 and ~2h40 :-(
- I checked the backup on my physical XYmon server : around 9pm and runs for 4:45 min.
- We cross-monitored the DNS server from another monitoring tool : no DNS outage detected.
- I monitored the Xymon server network link state with "mii-tool" every seconds : no troubles detected
- I pinged my Xymon servers from 2 differents network places all night long : no troubles detected.
- No firewalls between my Xymon server and the monitored hosts
- Over 500 hosts, only ~30 are in trouble every night and mostly the same
- Hosts are VM, physical servers, public internet website
Here is what I've found in the xymond.log today : 2016-02-16 02:02:57 Flapping detected for www.foo1.com:http - 5 changes in 1708 seconds 2016-02-16 02:02:57 Flapping detected for www.foo2.com:http - 5 changes in 1708 seconds 2016-02-16 02:02:57 Flapping detected for www.microsoft.com:http - 5 changes in 1708 seconds 2016-02-16 02:06:14 Flapping detected for server01:http - 5 changes in 1678 seconds 2016-02-16 02:06:14 Flapping detected for server02:http - 5 changes in 1678 seconds 2016-02-16 02:06:29 Flapping detected for server03:conn - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server04:ldap - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server06:ssh - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server05:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server07:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server08:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server09:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for foo.bar1.com:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for foo.bar2.com:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for foo.bar3.fr:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server10:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server11-t:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server12:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server13:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server14:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server15:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server16:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server17:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server18:http - 5 changes in 1745 seconds 2016-02-16 02:07:21 Flapping detected for server19:http - 5 changes in 1745 seconds
Here is a part of the configuration + errors displayed in the XYmon HTTP interface : hosts.cfg : 0.0.0.0 server03 # conn NAME:"server03" DESCR:"VM FOO BAR" Error : conn NOT ok : DNS lookup failed / Unable to resolve hostname server03 System unreachable for 2 poll periods (86 seconds)
Everything looks like the DNS resolution failed.
hosts.cfg : 10.X.Y.188 server05 # conn tse NAME:"Server 05" DESCR:"My comment" http://server05/ Error : DNS error red http://server05/ - DNS error
- Why I have a "DNS error" here ? I set up the IP yesterday to this host to solve the issue. The "conn" error disappear since yesterday evening but the http still remains.
All signs do point to an issue with DNS resolution here.
Was this a custom compile or are you using a package? If custom, what version of c-ares is on your system? That's the underlying resolution library that xymonnet is using by default to handle DNS lookups. The fact that the 'conn' test remained good after you added the local hosts entry matches that, since HTTP tests are performed using their own secondary DNS lookup (to deal with vhosts, etc) unless the IP is specified there as well.
Xymon otherwise does not cache DNS records or anything else when it comes to network polling like this, since xymonnet is a brand new execution for each run.
Try adding the '--dnslog=' option to xymonnet during this period to get a log of exactly what's happening with DNS resolution, and --debug as well (but just once or twice). You can also try testing using '--no-ares', however the system resolver is much slower and less predictable than c-ares (normally).
Another potential help might be altering your --concurrency=N setting to something lower than the system default (which will typically be 256).
There's clearly *something* going on that's specific to that period, but signs do point to something more on the host. This is especially true if you add a local DNS cache and you're still seeing the problem.
HTH, -jc