[hobbit] Inexplicable purple on running services
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:
Consider the below. Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple. They are still up, running, and just fine in every respect.
The status message is even the same as when it was showing green. But now every ssh, ldaps and dns light is purple.
Purple is an indication that some part of your monitoring system has stopped.
All of the purple ones are network services ? Then it sounds as if your network tests have stopped running. Check the ~hobbit/server/logs/bb-network.log file for any errors.
Regards, Henrik
There's no entries in the network log since 10/28. Hobbit is running on the server, and the clients are running on the various clients.
CPU, Memory, Disk and Procs all remain green! SSH, ldaps, and dns on the clients are purple.
On the hobbit server itself, bbd is purple. Everything else is green. Network connectivity between all clients > server is functional.
I don't get it...
Henrik Stoerner wrote:
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:
Consider the below. Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple. They are still up, running, and just fine in every respect.
The status message is even the same as when it was showing green. But now every ssh, ldaps and dns light is purple.Purple is an indication that some part of your monitoring system has stopped.
All of the purple ones are network services ? Then it sounds as if your network tests have stopped running. Check the ~hobbit/server/logs/bb-network.log file for any errors.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
Since ssh, ldap, and dns are tests run from the serverside (cpu etc remaining green indicates the clients are running and communicating OK, right?), i ran
./bbtest-net --concurrency=50 --checkresponse --no-update --timing --debug
Now, i can ping and ssh to all clients from server just fine. But i see this:
2005-11-01 14:14:20 Adding to combo msg: status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok
Service conn on brassai is not OK : Host does not respond to ping
System unreachable for 3 poll periods (56 seconds)
Aha. Since the ping test fails, why test other net services? So now it makes sense; the net tests are not being run, hence the purple.
a'course, i don't know why the nettest is suddenly unable to ping anything. It is getting the right IPs internally:
2005-11-01 14:14:20 Got DNS result for host doisneau : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host brassai : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host moadib : 10.x.x.x
and i thought cranking the concurrency way down might help, but apparently it doesn't.
So, i'm glad i found the cause... now i just need to find out the cause's cause. o_O
Rob Munsch wrote:
There's no entries in the network log since 10/28. Hobbit is running on the server, and the clients are running on the various clients.
CPU, Memory, Disk and Procs all remain green! SSH, ldaps, and dns on the clients are purple.
On the hobbit server itself, bbd is purple. Everything else is green. Network connectivity between all clients > server is functional.
I don't get it...
Henrik Stoerner wrote:
On Mon, Oct 31, 2005 at 05:32:44PM -0500, Rob Munsch wrote:
Consider the below. Approx. 25 minutes ago, across all monitored systems, all net monitored services - ssh, ldaps and dns - went to purple. They are still up, running, and just fine in every respect. The status message is even the same as when it was showing green. But now every ssh, ldaps and dns light is purple.
Purple is an indication that some part of your monitoring system has stopped.
All of the purple ones are network services ? Then it sounds as if your network tests have stopped running. Check the ~hobbit/server/logs/bb-network.log file for any errors.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
Last email for a while, i promise; i'm chainsmoking packets at this point. but i found this-
2005-11-01 14:14:20 TCP tests completed normally 2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99 2005-11-01 14:14:20 Sending results for service conn
Okay, it can't find fping. But...
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
Make sure the path includes the directories where you have fping, mail
and (optionally) ntpdate installed, FPING="/usr/sbin/fping" # Path and options for the 'fping' program. hobbit at randomaccess ~/server/bin $ /usr/sbin/fping -Ae brassai 10.10.10.15 is alive (0.15 ms) hobbit at randomaccess ~/server/bin $
So it should be finding fping just fine, and fping is working. The path is in hobbitserver.cfg:
Make sure the path includes the directories where you have fping, mail
and (optionally) ntpdate installed,
as well as the BBHOME/bin directory where all of the Hobbit programs
reside. PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/home/hobbit/server/bin" ...
For bbtest-net
... FPING="/usr/sbin/fping"
Path and options for the 'fping' program.
and
[bbnet] ENVFILE /home/hobbit/server/etc/hobbitserver.cfg
So, by all the above: fping is functional, it is accessible by the 'hobbit' user, it can reach the clients, it is in the PATH, it is defined in the ENVFILE bbnet is using.
So what's gone wrong??
Rob Munsch wrote:
Since ssh, ldap, and dns are tests run from the serverside (cpu etc remaining green indicates the clients are running and communicating OK, right?), i ran
./bbtest-net --concurrency=50 --checkresponse --no-update --timing --debug
Now, i can ping and ssh to all clients from server just fine. But i see this:
2005-11-01 14:14:20 Adding to combo msg: status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok status brassai.conn red <!-- [flags:ordAstILe] --> Tue Nov 1 14:14:20 2005 conn NOT ok
Service conn on brassai is not OK : Host does not respond to ping
System unreachable for 3 poll periods (56 seconds)
Aha. Since the ping test fails, why test other net services? So now it makes sense; the net tests are not being run, hence the purple.
a'course, i don't know why the nettest is suddenly unable to ping anything. It is getting the right IPs internally:
2005-11-01 14:14:20 Got DNS result for host doisneau : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host brassai : 10.x.x.x 2005-11-01 14:14:20 Got DNS result for host moadib : 10.x.x.x
and i thought cranking the concurrency way down might help, but apparently it doesn't.
So, i'm glad i found the cause... now i just need to find out the cause's cause. o_O
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
On Tue, Nov 01, 2005 at 02:40:48PM -0500, Rob Munsch wrote:
Last email for a while, i promise; i'm chainsmoking packets at this point. but i found this-
2005-11-01 14:14:20 TCP tests completed normally 2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99 2005-11-01 14:14:20 Sending results for service conn
Okay, it can't find fping. But...
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
Make sure the path includes the directories where you have fping, mail
and (optionally) ntpdate installed, FPING="/usr/sbin/fping" # Path and options for the 'fping' program.
This is pretty odd, because with that FPING setting you should also see /usr/sbin/fping in the logfile entry - it should read 2005-11-01 14:14:20 Execution of '/usr/sbin/fping -Ae' failed with error-code 99
Could you check your hobbitlaunch.cfg file ? [bbnet] section should be
[bbnet] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg NEEDS hobbitd CMD bbtest-net --report --ping --checkresponse LOGFILE $BBSERVERLOGS/bb-network.log INTERVAL 5m
I suspect that maybe the ENVFILE setting is missing or points to the wrong file ...
[bbnet] ENVFILE /home/hobbit/server/etc/hobbitserver.cfg
Hrm, so you did that.
What happens if you run
bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping
?
Henrik
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug
It seems to have worked, pingwise:
2005-11-01 16:27:24 Sending results for service conn 2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status orizo.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status moadib.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status brassai.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok 2005-11-01 16:27:24 Adding to combo msg: status doisneau.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok
but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now. It completes the conn test, reports the results, and that's that.
I've checked the permissions on hobbitserver.cfg and they're correct.
There's no reason the hobbit user shouldn't be able to read it.
I then ran bbcmd again without specifying the env, and got identical
results.
I don't understand why ssh et. al. is yielding nothing at all..? If it failed to connect in some way it'd be red, wouldn't it? They remain purple... for some reason the tests aren't being done at all. I haven't altered the services definitions in any way.
Henrik Stoerner wrote:
On Tue, Nov 01, 2005 at 02:40:48PM -0500, Rob Munsch wrote:
Last email for a while, i promise; i'm chainsmoking packets at this point. but i found this-
2005-11-01 14:14:20 TCP tests completed normally 2005-11-01 14:14:20 Execution of 'fping -Ae' failed with error-code 99 2005-11-01 14:14:20 Sending results for service conn
Okay, it can't find fping. But...
hobbit at randomaccess ~/server/bin $ more ../etc/hobbitserver.cfg |grep fping
Make sure the path includes the directories where you have fping, mail
and (optionally) ntpdate installed, FPING="/usr/sbin/fping" # Path and options for the 'fping' program.
This is pretty odd, because with that FPING setting you should also see /usr/sbin/fping in the logfile entry - it should read 2005-11-01 14:14:20 Execution of '/usr/sbin/fping -Ae' failed with error-code 99
Could you check your hobbitlaunch.cfg file ? [bbnet] section should be
[bbnet] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg NEEDS hobbitd CMD bbtest-net --report --ping --checkresponse LOGFILE $BBSERVERLOGS/bb-network.log INTERVAL 5m
I suspect that maybe the ENVFILE setting is missing or points to the wrong file ...
[bbnet] ENVFILE /home/hobbit/server/etc/hobbitserver.cfg
Hrm, so you did that.
What happens if you run
bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping
?
Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
On Tue, Nov 01, 2005 at 04:38:00PM -0500, Rob Munsch wrote:
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug
It seems to have worked, pingwise:
2005-11-01 16:27:24 Sending results for service conn 2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok
but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now. It completes the conn test, reports the results, and that's that.
Run "bbcmd bbtest-net --ping --report --debug 2>&1 >debug.log" and send the full debug logfile directly to me (henrik at hswn.dk).
Henrik
Right then. Straightened out thanks to Henrik's generosity with his time. As a warning to my fellow knobs, here's the postmortem:
In bb-hosts, the group-only definitions controls the display of the tests. Group-only arguments do NOT call for the actual tests themselves; they must be specified after the hashmark on the client line normally. (Client tests are reported by the client and don't get specified after the hash.) So, this fails to provide the expected info:
group-only ldaps|ssh|cpu|memory|disk|procs hobbitses 10.10.10.15 brassai
and this succeeds:
group-only ldaps|ssh|cpu|memory|disk|procs hobbitses 10.10.10.15 brassai # ldaps ssh
In the former, ldaps and ssh will have null info. cpu -> procs will display correctly, but no net tests will be called on brassai other than ping, by default, as --ping is specified by default in hobbitlaunch.cfg under the [bbnet] section.
I hope this helps anyone else new and blundering their way around the config files as i was.
Henrik Stoerner wrote:
On Tue, Nov 01, 2005 at 04:38:00PM -0500, Rob Munsch wrote:
hobbit at randomaccess ~/server/etc $ ../bin/bbcmd --env=/home/hobbit/server/etc/hobbitserver.cfg bbtest-net --ping --debug
It seems to have worked, pingwise:
2005-11-01 16:27:24 Sending results for service conn 2005-11-01 16:27:24 Adding to combo msg: status randomaccess.conn green <!-- [flags:OrdAstILe] --> Tue Nov 1 16:27:24 2005 conn ok
but still doesn't test anything else... no checks for ssh, ldaps, dns... nothing it was checking (and showing as green) two days ago seems to be getting tested now. It completes the conn test, reports the results, and that's that.
Run "bbcmd bbtest-net --ping --report --debug 2>&1 >debug.log" and send the full debug logfile directly to me (henrik at hswn.dk).
Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rob Munsch Systems Analyst, Solutions for Progress http://www.solutionsforprogress.com
participants (2)
-
henrik@hswn.dk
-
rmunsch@solutionsforprogress.com