Bug in ping tests?

josh＠imaginenetworksllc.com

18 Jul 2012 18 Jul '12

6:23 p.m.

I have a front page with about a dozen hosts and then sub pages. Every CONN test on the front page failed. Each and every host on the subpages (well over a dozen) was just fine. After 6 minutes I restarted the hobbitd processes. They all came right back.

I am running 4.2.3. Using fping to check

hobbitserver.cfg:FPING="/usr/sbin/fping"

Has anyone seen this?

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

Show replies by date

cleaver＠terabithia.org

18 Jul 18 Jul

7:23 p.m.

...

I have a front page with about a dozen hosts and then sub pages. Every CONN test on the front page failed. Each and every host on the subpages (well over a dozen) was just fine. After 6 minutes I restarted the hobbitd processes. They all came right back.

I am running 4.2.3. Using fping to check

hobbitserver.cfg:FPING="/usr/sbin/fping"

Has anyone seen this?

Hmm. It's possible that hobbitnet (?) died or was hung up... Or that the pages weren't representative of the same run (eg, bbgen could have died during its generation).

Questions: Do you recall the page timestamps being the same? If you clicked through to the tests when it was happening, did the (dynamic) test page match the (static) color in the grid? Has the problem started recently, is it repeating, and was there anything interesting in the logs at the time?

-jc

josh＠imaginenetworksllc.com

7:27 p.m.

I haven't seen this in the last year or more on this server. I had sporadic issues on another service, but by simply moving hardware (from dedicated Atom to a ESXi platform) it was resolved.

The page said it was red for 2-6 minutes. I knew the test happens every 5, so I would have expected a retest to clear it (hosts were ping responsive from the shell).

What log are you referring to?

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Wed, Jul 18, 2012 at 3:23 PM, <cleaver at terabithia.org> wrote:

...

...
I have a front page with about a dozen hosts and then sub pages. Every CONN test on the front page failed. Each and every host on the subpages (well over a dozen) was just fine. After 6 minutes I restarted the hobbitd processes. They all came right back.

I am running 4.2.3. Using fping to check

hobbitserver.cfg:FPING="/usr/sbin/fping"

Has anyone seen this?

Hmm. It's possible that hobbitnet (?) died or was hung up... Or that the pages weren't representative of the same run (eg, bbgen could have died during its generation).

Questions: Do you recall the page timestamps being the same? If you clicked through to the tests when it was happening, did the (dynamic) test page match the (static) color in the grid? Has the problem started recently, is it repeating, and was there anything interesting in the logs at the time?

-jc

cleaver＠terabithia.org

7:44 p.m.

...

I haven't seen this in the last year or more on this server. I had sporadic issues on another service, but by simply moving hardware (from dedicated Atom to a ESXi platform) it was resolved.

The page said it was red for 2-6 minutes. I knew the test happens every 5, so I would have expected a retest to clear it (hosts were ping responsive from the shell).

What log are you referring to?

hobbitnet (or bbnet, I forget the process name in 4.2.3)'s output log. Also hobbitlaunch.log from around the time, just to see if something abnormally quit.

-jc

...

On Wed, Jul 18, 2012 at 3:23 PM, <cleaver at terabithia.org> wrote:

...
...
I have a front page with about a dozen hosts and then sub pages. Every CONN test on the front page failed. Each and every host on the subpages (well over a dozen) was just fine. After 6 minutes I restarted the hobbitd processes. They all came right back.

I am running 4.2.3. Using fping to check

hobbitserver.cfg:FPING="/usr/sbin/fping"

Has anyone seen this?

Hmm. It's possible that hobbitnet (?) died or was hung up... Or that the pages weren't representative of the same run (eg, bbgen could have died during its generation).

Questions: Do you recall the page timestamps being the same? If you clicked through to the tests when it was happening, did the (dynamic) test page match the (static) color in the grid? Has the problem started recently, is it repeating, and was there anything interesting in the logs at the time?

-jc

josh＠imaginenetworksllc.com

7:56 p.m.

Did a grep 2012-07-18 on the logs and excluding rrd-* this is all I have.

These lines are because a DNS server I monitor is down. I disabled it when I discovered the campus lost power.

bb-network.log: 2012-06-17 21:16:24 WARNING: Runtime 581 longer than time limit (300)

This would be the right time:

clientdata.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument

history.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument 2012-07-18 14:27:36 Will not update /home/hobbituser/data/hist/foohostname,imaginenetworksllc,com.bbd - color unchanged (green) #last line repeated for every host that experience this problem

hobbitd.log 2012-07-18 14:19:49 Setup complete

page.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Wed, Jul 18, 2012 at 3:44 PM, <cleaver at terabithia.org> wrote:

...

...
I haven't seen this in the last year or more on this server. I had sporadic issues on another service, but by simply moving hardware (from dedicated Atom to a ESXi platform) it was resolved.

The page said it was red for 2-6 minutes. I knew the test happens every 5, so I would have expected a retest to clear it (hosts were ping responsive from the shell).

What log are you referring to?

hobbitnet (or bbnet, I forget the process name in 4.2.3)'s output log. Also hobbitlaunch.log from around the time, just to see if something abnormally quit.

-jc

...
On Wed, Jul 18, 2012 at 3:23 PM, <cleaver at terabithia.org> wrote:

...
...
I have a front page with about a dozen hosts and then sub pages. Every CONN test on the front page failed. Each and every host on the subpages (well over a dozen) was just fine. After 6 minutes I restarted the hobbitd processes. They all came right back.

I am running 4.2.3. Using fping to check

hobbitserver.cfg:FPING="/usr/sbin/fping"

Has anyone seen this?

Hmm. It's possible that hobbitnet (?) died or was hung up... Or that the pages weren't representative of the same run (eg, bbgen could have died during its generation).

Questions: Do you recall the page timestamps being the same? If you clicked through to the tests when it was happening, did the (dynamic) test page match the (static) color in the grid? Has the problem started recently, is it repeating, and was there anything interesting in the logs at the time?

-jc

henrik＠hswn.dk

8:18 p.m.

On 18-07-2012 21:56, Josh Luthman wrote:

...

This would be the right time:

clientdata.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument

history.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument 2012-07-18 14:27:36 Will not update /home/hobbituser/data/hist/foohostname,imaginenetworksllc,com.bbd - color unchanged (green) #last line repeated for every host that experience this problem

hobbitd.log 2012-07-18 14:19:49 Setup complete

page.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument

The "BOARDBUSY" and "Setup complete" point to Hobbit being restarted. I noticed you did this, so that would be around that time.

The "Will not update" is normal - hobbitd_history does a sanity check when it receives the first status update for each status, to make sure the history-file is in sync with the current status (updates may have been lost while Xymon was down).

In other words nothing here that would explain what you're seeing.

What reason was given in the detailed status log for the ping test failing ? DNS lookup failure, ping timeout, ... ?

Regards, Henrik

josh＠imaginenetworksllc.com

8:35 p.m.

Do you mean the red status details? Going to the host and checking this entry via history it says:

Wed Jul 18 14:13:39 2012 conn NOT ok Service conn on hostshere.imaginenetworksllc.com is not OK : Host does not respond to ping System unreachable for 1 poll periods (0 seconds)

I'm not sure if this is working or not - 0.0.0.0 .default. # testip

Looking at the hosts 4 of them do not have testip listed, they may do DNS. The remainder do not use DNS and specifically state testip.

Now one of the hosts (compass) give me an Internal Server Error for this one red entry. A couple other older red events work. Maybe a clue?

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Wed, Jul 18, 2012 at 4:18 PM, Henrik Størner <henrik at hswn.dk> wrote:

...

On 18-07-2012 21:56, Josh Luthman wrote:

...
This would be the right time:

clientdata.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument

history.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument 2012-07-18 14:27:36 Will not update /home/hobbituser/data/hist/**foohostname,**imaginenetworksllc,com.bbd - color unchanged (green) #last line repeated for every host that experience this problem

hobbitd.log 2012-07-18 14:19:49 Setup complete

page.log 2012-07-18 14:19:39 Tried to down BOARDBUSY: Invalid argument

The "BOARDBUSY" and "Setup complete" point to Hobbit being restarted. I noticed you did this, so that would be around that time.

The "Will not update" is normal - hobbitd_history does a sanity check when it receives the first status update for each status, to make sure the history-file is in sync with the current status (updates may have been lost while Xymon was down).

In other words nothing here that would explain what you're seeing.

What reason was given in the detailed status log for the ping test failing ? DNS lookup failure, ping timeout, ... ?

Regards, Henrik

______________________________**_________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/**mailman/listinfo/xymon<http://lists.xymon.com/mailman/listinfo/xymon>

henrik＠hswn.dk

8:42 p.m.

On 18-07-2012 22:35, Josh Luthman wrote:

...

Do you mean the red status details? Going to the host and checking this entry via history it says:

Wed Jul 18 14:13:39 2012 conn NOT ok Service conn on hostshere.imaginenetworksllc.com <http://hostshere.imaginenetworksllc.com> is not OK : Host does not respond to ping System unreachable for 1 poll periods (0 seconds)

OK, so it did run the ping test and just didn't get an answer.

...

I'm not sure if this is working or not - 0.0.0.0 .default.

testip

Looking at the hosts 4 of them do not have testip listed, they may do DNS. The remainder do not use DNS and specifically state testip.

If you have that ".default." entry at the top of the file, all of your hosts will use the IP in the Xymon hosts-file.

You can see this on the "info" status page.

...

Now one of the hosts (compass) give me an Internal Server Error for this one red entry. A couple other older red events work. Maybe a clue?

Not necessarily, it could just be that the detailed status log wasn't saved for some reason. Older Xymon versions could run into trouble when trying to display a status-page for which there was no logfile.

You can check in the ~hobbit/data/histlogs/HOSTNAME/conn/ directory, there should be a logfile with a timestamp matching the history-log you can see.

From what you've shown here, I cannot say why those 6 systems went red. My best guess is that your network was discarding ping's for some reason (ICMP is low priority packets, and routers/switches are free to drop them if the load gets too high).

Regards, Henrik

josh＠imaginenetworksllc.com

8:52 p.m.

...

You can see this on the "info" status page.

Neat!!! Network tests use:IP-address

The hosts that went red are in the same rack, across the street and 25 miles north. Other hosts farther and closer were just fine - but on different pages. I was pinging the host from the shell with 100% success while Xymon (hobbit) generated a new red page.

Not really critical at this point, since it's the first time I've had this problem on this hardware (and actually looking in the */conn/ directories the first issue in 2012 for most). I'll just keep this in mind if I see it again. If there are any other ideas, though, I would certainly like to investigate more.

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Wed, Jul 18, 2012 at 4:42 PM, Henrik Størner <henrik at hswn.dk> wrote:

...

On 18-07-2012 22:35, Josh Luthman wrote:

...
Do you mean the red status details? Going to the host and checking this entry via history it says:

Wed Jul 18 14:13:39 2012 conn NOT ok Service conn on hostshere.imaginenetworksllc.**com<http://hostshere.imaginenetworksllc.com> <http://hostshere.**imaginenetworksllc.com<http://hostshere.imaginenetworksllc.com>> is not OK : Host does not

respond to ping System unreachable for 1 poll periods (0 seconds)

OK, so it did run the ping test and just didn't get an answer.

I'm not sure if this is working or not - 0.0.0.0 .default.

...
testip

Looking at the hosts 4 of them do not have testip listed, they may do DNS. The remainder do not use DNS and specifically state testip.

If you have that ".default." entry at the top of the file, all of your hosts will use the IP in the Xymon hosts-file.

You can see this on the "info" status page.

Now one of the hosts (compass) give me an Internal Server Error for this

...
one red entry. A couple other older red events work. Maybe a clue?

Not necessarily, it could just be that the detailed status log wasn't saved for some reason. Older Xymon versions could run into trouble when trying to display a status-page for which there was no logfile.

You can check in the ~hobbit/data/histlogs/**HOSTNAME/conn/ directory, there should be a logfile with a timestamp matching the history-log you can see.

From what you've shown here, I cannot say why those 6 systems went red. My best guess is that your network was discarding ping's for some reason (ICMP is low priority packets, and routers/switches are free to drop them if the load gets too high).

Regards, Henrik

______________________________**_________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/**mailman/listinfo/xymon<http://lists.xymon.com/mailman/listinfo/xymon>

5089

Age (days ago)

5089

Last active (days ago)

List overview

Download

8 comments

3 participants

participants (3)

cleaver＠terabithia.org
henrik＠hswn.dk
josh＠imaginenetworksllc.com

Bug in ping tests?

testip

testip

tags

participants (3)