We recently had a major change in our network's design. The entire topology had the change, including IP addresses of servers and the "routes" through switches between things. Ever since then, Xymon is reporting very brief (10-20 seconds) outages of one server or another every 30-60 minutes.
I tried this suggestion: http://xymon.sourceforge.net/docs/known-issues.html#netfail
No luck. A few minutes later, the server Xymon is running on allegedly failed the "conn" test for about 1 second.
Any other ideas? If you want to see the symptoms, look at my Xymon instance at http://cns.cairodurham.org/hobbit/bb2.html. This will show the brief outages that I'm talking about.
Thanks in advance, Jaime Kikpole
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
I had that same issue (with conn tests). The server was on a mini itx board with a VIA CPU. I moved it to our ESXi box and haven't seen it since.
Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
On Thu, Nov 18, 2010 at 11:59 AM, Jaime Kikpole <jkikpole at cairodurham.org> wrote:
We recently had a major change in our network's design. The entire topology had the change, including IP addresses of servers and the "routes" through switches between things. Ever since then, Xymon is reporting very brief (10-20 seconds) outages of one server or another every 30-60 minutes.
I tried this suggestion: http://xymon.sourceforge.net/docs/known-issues.html#netfail
No luck. A few minutes later, the server Xymon is running on allegedly failed the "conn" test for about 1 second.
Any other ideas? If you want to see the symptoms, look at my Xymon instance at http://cns.cairodurham.org/hobbit/bb2.html. This will show the brief outages that I'm talking about.
Thanks in advance, Jaime Kikpole
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
To unsubscribe from the xymon list, send an e-mail to xymon-unsubscribe at xymon.com
I would do ping tests from your Xymon servers to your other servers first to make sure there isn't packets being dropped. You might have some hardware issues you don't know about after the IP change.
Jason Chambers IT Help Desk Associate
GEOSOFT INC. freedom to explore T +1 416.369.0111 #344 F +1 416.369.9599
Visit our site at www.geosoft.com
-----Original Message----- From: Jaime Kikpole [mailto:jkikpole at cairodurham.org] Sent: November-18-10 11:59 AM To: xymon at xymon.com Subject: [xymon] Brief red alarms
We recently had a major change in our network's design. The entire topology had the change, including IP addresses of servers and the "routes" through switches between things. Ever since then, Xymon is reporting very brief (10-20 seconds) outages of one server or another every 30-60 minutes.
I tried this suggestion: http://xymon.sourceforge.net/docs/known-issues.html#netfail
No luck. A few minutes later, the server Xymon is running on allegedly failed the "conn" test for about 1 second.
Any other ideas? If you want to see the symptoms, look at my Xymon instance at http://cns.cairodurham.org/hobbit/bb2.html. This will show the brief outages that I'm talking about.
Thanks in advance, Jaime Kikpole
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
To unsubscribe from the xymon list, send an e-mail to xymon-unsubscribe at xymon.com
Hobbit may not be the issue at all here. You could test this by writing a small script to do a constant ping of host $foo and record the results to a file. You might see similar patterns outside of hobbit.
Looking at the current summary page it appears that these hosts are more problematic. Yet, looking at a report (short version attached) for Nov1 to Nov 17 there are clearly quite a few more.
2 10.1.0.32 2 10.1.0.40 2 10.1.0.73 2 10.1.0.92 2 10.3.23.242 4 163.153.65.139 4 atlas.cairodurham.org
On Thu, Nov 18, 2010 at 12:39 PM, Tim McCloskey <tm at freedom.com> wrote:
What makes these three hosts different?
Well, for one, they didn't exist until recently. :)
We just had a major topology change. Every server and switch is on a new IP address and nearly every switch was replaced with new hardware. New subnets exist, too. So this really has to be seen from a perspective of setting up Xymon for the first time in the last 2 days and ignoring older data all together.
I'm running a extended ping test to 10.1.0.73 now to see if we have intermittent issues with network traffic.
Thanks, Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
On Thu, Nov 18, 2010 at 12:53 PM, Jaime Kikpole <jkikpole at cairodurham.org> wrote:
I'm running a extended ping test to 10.1.0.73 now to see if we have intermittent issues with network traffic.
For what its worth:
^C --- 10.1.0.73 ping statistics --- 528 packets transmitted, 528 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.621/2.458/52.160/5.032 ms
Any thoughts?
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
On Thursday 18 November 2010 01:00:59 pm Jaime Kikpole wrote:
On Thu, Nov 18, 2010 at 12:53 PM, Jaime Kikpole
<jkikpole at cairodurham.org> wrote:
I'm running a extended ping test to 10.1.0.73 now to see if we have intermittent issues with network traffic.
For what its worth:
^C --- 10.1.0.73 ping statistics --- 528 packets transmitted, 528 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.621/2.458/52.160/5.032 ms
Any thoughts?
Jaime
Check your DNS server(s). Chances are that one or more systems has an entry for the old IP addresses. A continuous ping won't show this, as address resolution occurs just once.
Tom
On Thu, Nov 18, 2010 at 1:17 PM, Tom Kauffman <tommyk66 at newsguy.com> wrote:
Check your DNS server(s). Chances are that one or more systems has an entry for the old IP addresses. A continuous ping won't show this, as address resolution occurs just once.
Just checked. Nothing out of place in DNS (forward or reverse.)
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
You've probably already done the obvious.... First, I'd make sure that all the hosts/switch ports have proper media settings (speed/duplex/autoneg or not). While it may sound odd (for your new IP network) I would also make sure that all of the arp cache for the environment gets flushed.
What interval do you have in bbtest-net? server/etc/hobbitlaunch.cfg [bbnet] ENVFILE server/etc/hobbitserver.cfg NEEDS hobbitd CMD bbtest-net --report --ping --checkresponse LOGFILE $BBSERVERLOGS/bb-network.log INTERVAL 5m
Tim
From: Jaime Kikpole [jkikpole at cairodurham.org] Sent: Thursday, November 18, 2010 10:00 AM To: xymon at xymon.com Subject: Re: [xymon] Brief red alarms
On Thu, Nov 18, 2010 at 12:53 PM, Jaime Kikpole <jkikpole at cairodurham.org> wrote:
I'm running a extended ping test to 10.1.0.73 now to see if we have intermittent issues with network traffic.
For what its worth:
^C --- 10.1.0.73 ping statistics --- 528 packets transmitted, 528 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.621/2.458/52.160/5.032 ms
Any thoughts?
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
To unsubscribe from the xymon list, send an e-mail to xymon-unsubscribe at xymon.com
On Thu, Nov 18, 2010 at 1:21 PM, Tim McCloskey <tm at freedom.com> wrote:
What interval do you have in bbtest-net?
I had the defaults. I changed it to --concurreny=100, but that didn't help. Neither did --concurrency=50.
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
[bbnet] ENVFILE server/etc/hobbitserver.cfg NEEDS hobbitd CMD bbtest-net --report --ping --checkresponse LOGFILE $BBSERVERLOGS/bb-network.log INTERVAL 5m
I'm using 4.2.0 so maybe the setting above is different in 4.2.3...? Too short of an interval test can result in this something like you are seeing. The server might not be able to ping all of the hosts in under nn time. If the network is fine then the next question is this a xymon install that worked before the new IP's etc....? Or is this a brand new install? If it is new, and the network is fine, try reducing the INTERVAL to something like 5 minutes are let it cook for a day or so.
Regards,
Tim
From: Jaime Kikpole [jkikpole at cairodurham.org] Sent: Thursday, November 18, 2010 10:29 AM To: xymon at xymon.com Subject: Re: [xymon] Brief red alarms
On Thu, Nov 18, 2010 at 1:21 PM, Tim McCloskey <tm at freedom.com> wrote:
What interval do you have in bbtest-net?
I had the defaults. I changed it to --concurreny=100, but that didn't help. Neither did --concurrency=50.
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
To unsubscribe from the xymon list, send an e-mail to xymon-unsubscribe at xymon.com
On Thu, Nov 18, 2010 at 1:45 PM, Tim McCloskey <tm at freedom.com> wrote:
If the network is fine then the next question is this a xymon install that worked before the new IP's etc....? Or is this a brand new install? If it is new, and the network is fine, try reducing the INTERVAL to something like 5 minutes are let it cook for a day or so.
Its an existing install. We changed the IPs and all the switches. In theory, the bandwidth for all links is the same or higher than before. 100Mbps to 10Gbps, depending on the link. This behavior started after the changes, though.
For what its worth, we have been using "INTERVAL 5m" in the bbnet section of hobbitlaunch.cfg since I first installed Xymon.
Any thoughts?
Thanks, Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
I have seen this kind of behavior when a single bad route keeps getting reintroduced to the network routers (generally from a single router). When the bad route hits the router the ping attempts it return route on, the ping fails.
Just a thought on something might check into.
....Bruce
Bruce White Senior Enterprise Systems Engineer | Phone: 630-671-5169 | Fax: 630-893-1648 | bewhite at fellowes.com | http://www.fellowes.com/ Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc. -----Original Message----- From: Jaime Kikpole [mailto:jkikpole at cairodurham.org] Sent: Tuesday, November 23, 2010 10:25 AM To: xymon at xymon.com Subject: Re: [xymon] Brief red alarms
On Thu, Nov 18, 2010 at 1:45 PM, Tim McCloskey <tm at freedom.com> wrote:
If the network is fine then the next question is this a xymon install that worked before the new IP's etc....? Or is this a brand new install? If it is new, and the network is fine, try reducing the INTERVAL to something like 5 minutes are let it cook for a day or so.
Its an existing install. We changed the IPs and all the switches. In theory, the bandwidth for all links is the same or higher than before. 100Mbps to 10Gbps, depending on the link. This behavior started after the changes, though.
For what its worth, we have been using "INTERVAL 5m" in the bbnet section of hobbitlaunch.cfg since I first installed Xymon.
Any thoughts?
Thanks, Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
To unsubscribe from the xymon list, send an e-mail to xymon-unsubscribe at xymon.com
Also, are you running fping? If you are running the default "hobbitping", all bets are off on the pings actually being accurate.
.....Bruce
Bruce White Senior Enterprise Systems Engineer | Phone: 630-671-5169 | Fax: 630-893-1648 | bewhite at fellowes.com | http://www.fellowes.com/ Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc. -----Original Message----- From: Jaime Kikpole [mailto:jkikpole at cairodurham.org] Sent: Tuesday, November 23, 2010 10:25 AM To: xymon at xymon.com Subject: Re: [xymon] Brief red alarms
On Thu, Nov 18, 2010 at 1:45 PM, Tim McCloskey <tm at freedom.com> wrote:
If the network is fine then the next question is this a xymon install that worked before the new IP's etc....? Or is this a brand new install? If it is new, and the network is fine, try reducing the INTERVAL to something like 5 minutes are let it cook for a day or so.
Its an existing install. We changed the IPs and all the switches. In theory, the bandwidth for all links is the same or higher than before. 100Mbps to 10Gbps, depending on the link. This behavior started after the changes, though.
For what its worth, we have been using "INTERVAL 5m" in the bbnet section of hobbitlaunch.cfg since I first installed Xymon.
Any thoughts?
Thanks, Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
To unsubscribe from the xymon list, send an e-mail to xymon-unsubscribe at xymon.com
On Wed, Nov 24, 2010 at 11:06 AM, White, Bruce <bewhite at fellowes.com> wrote:
Also, are you running fping? If you are running the default "hobbitping", all bets are off on the pings actually being accurate.
Taking a quick look:
atlas:etc>grep fping *.cfg hobbitserver.cfg:# Make sure the path includes the directories where you have fping, mail and (optionally) ntpdate installed,
atlas:etc>grep hobbitping *.cfg hobbitserver.cfg:FPING="hobbitping" # Path and options for the ping program.
If I'm reading this right, I'm using hobbitping and not fping. The odd thing is that this symptom did not exist before the equipment and topology changes. So I'll be checking on my "route:..." statements and any routing protocols (like RIP) in the new equipment. Either of these ideas might help.
Thanks!
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
A coincidence just lead me to a log file with lines like these: +pid 44909 (hobbitping), uid 280: exited on signal 11 +pid 45111 (hobbitping), uid 280: exited on signal 11 +pid 46426 (hobbitping), uid 280: exited on signal 11
I wonder if there is something else at play here.
The Unix box that I installed Xymon onto had its IP change at the same time as the network changes. Is it possible that there was a side effect from the IP change?
Thanks, Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
In <AANLkTik=m+iEg3MvV-k4cKXSit8hS91CTOr06GP9UP1s at mail.gmail.com> Jaime Kikpole <jkikpole at cairodurham.org> writes:
A coincidence just lead me to a log file with lines like these: +pid 44909 (hobbitping), uid 280: exited on signal 11 +pid 45111 (hobbitping), uid 280: exited on signal 11 +pid 46426 (hobbitping), uid 280: exited on signal 11
Signal 11 i SEGV, so this obviously should not be happening. Which version is this (sorry if this is in a previous message)? There were some fixes for hobbitping in beta-3.
It would be nice if you could try fping - then at least we could narrow down the problem.
The Unix box that I installed Xymon onto had its IP change at the same time as the network changes. Is it possible that there was a side effect from the IP change?
It shouldn't matter, unless you have another box with the same IP on your network. And then you would have much bigger problems, I think.
Regards, Henrik
On Mon, Nov 29, 2010 at 11:55 AM, Henrik Størner <henrik at hswn.dk> wrote:
Signal 11 i SEGV, so this obviously should not be happening. Which version is this (sorry if this is in a previous message)? There were some fixes for hobbitping in beta-3.
I'm running 4.2.3. Sorry.
It would be nice if you could try fping - then at least we could narrow down the problem.
Is that just a matter of putting "fping" in the FPING variable of hobbitserver.cfg?
The Unix box that I installed Xymon onto had its IP change at the same time as the network changes. Is it possible that there was a side effect from the IP change?
It shouldn't matter, unless you have another box with the same IP on your network. And then you would have much bigger problems, I think.
Thanks.
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
On Mon, 29 Nov 2010 12:04:26 -0500, Jaime Kikpole wrote:
On Mon, Nov 29, 2010 at 11:55 AM, Henrik Størner <henrik at hswn.dk> wrote:
Signal 11 i SEGV, so this obviously should not be happening. Which version is this (sorry if this is in a previous message)? There were some fixes for hobbitping in beta-3.
I'm running 4.2.3. Sorry.
Nothing to be sorry for. Like I said, there were some fixes to hobbitping done in the 4.3.0 beta-3 release, so hopefully it shouldn't segfault now.
It would be nice if you could try fping - then at least we could narrow down the problem.
Is that just a matter of putting "fping" in the FPING variable of hobbitserver.cfg?
Yes.
Regards, Henrik
On Mon, Nov 29, 2010 at 4:38 PM, Henrik Størner <henrik at hswn.dk> wrote:
Nothing to be sorry for. Like I said, there were some fixes to hobbitping done in the 4.3.0 beta-3 release, so hopefully it shouldn't segfault now.
Good to know. When I can manage to schedule the next upgrade, this will be something to look forward to. Thanks.
Is that just a matter of putting "fping" in the FPING variable of hobbitserver.cfg?
Yes.
I did this and there have been no more false alarms. Just as a test, I unplugged one server's network cable and xymon reported it down within 1-2 minutes. I plugged it back in and it reported it back up about 1-2 minutes later.
I'd like to thank everyone for the help. Xymon has been so good for our needs that my coworker said that he feels naked whenever xymon is offline. (A major switch failed at one point, so he couldn't see the web server xymon is on. He had to diagnose things the old fashioned way, i.e. lots of pings.) I recommend it to sysadmins all over the place now. :)
Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
Would changing out hobbitping for fping be as simple as changing the line: FPING="hobbitping" ...to the line... FPING="fping" ...in the file hobbitserver.cfg? I already have fping at /usr/local/sbin and that path is already in the PATH variable in hobbitserver.cfg.
Thanks, Jaime
-- Network Administrator Cairo-Durham Central School District http://cns.cairodurham.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/25/2010 10:54 PM, Jaime Kikpole wrote:
Would changing out hobbitping for fping be as simple as changing the line: FPING="hobbitping" ...to the line... FPING="fping" ...in the file hobbitserver.cfg? I already have fping at /usr/local/sbin and that path is already in the PATH variable in hobbitserver.cfg.
This sounds reasonable to me, though I'd check the documentation to be sure.
- ---- _ _ _ _ ___ _ _ _ |Y#| | | |\/| | \ |\ | | |Ryan Novosielski - Sr. Systems Programmer |$&| |__| | | |__/ | \| _| |novosirj at umdnj.edu - 973/972.0922 (2-0922) \__/ Univ. of Med. and Dent.|IST/CST-Academic Svcs. - ADMC 450, Newark -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkz0FIIACgkQmb+gadEcsb4KUwCglrljjMSEpscnzFRVyafGGbgl +l0AoKoGthb6PqJZSRCfFJBPeTn+wKT6 =lumV -----END PGP SIGNATURE-----
participants (8)
-
bewhite@fellowes.com
-
henrik@hswn.dk
-
Jason.Chambers@geosoft.com
-
jkikpole@cairodurham.org
-
josh@imaginenetworksllc.com
-
novosirj@umdnj.edu
-
tm@freedom.com
-
tommyk66@newsguy.com