Restarting failed processes on the client
Hi,
it is possible to monitor processes using PROC statements in hobbit-clients.cfg.
But is there also a proper way in Hobbit to take action on failed processes? Let's say calling "sudo /etc/init.d/ssh start" in case no sshd processes are found? Ideally configurable on the server (e.g. PROC sshd ACTION=/etc/init.d/ssh start), so that the configuration which processes to monitor and which process to restart does not need to be specified twice.
Ciao, Thomas
Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk at westend.com D-52072 Aachen Fax 0241/911879
Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb
Not sure if this is the right place...
I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...
<msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs>
Is this syntax correct?
Hello Stewart,
You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules.
2007/7/11, Stewart Larsen <stl19847 at yahoo.com>:
Not sure if this is the right place...
I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...
<msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs>
Is this syntax correct?
The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other.
If you really doesn't want the event to be reported, may be you should ignore it definitively.
Regards,
-- Etienne GRIGNON
Thanks. I've read the manual, but the syntax below does not seem to behave the way I expect.
The first error I get with that EventID triggers an alert. I thought with the syntax given, I would need to see 10 log entries within a 30 minute period before I get an alert.
Is this a bug in BBWin, or am I doing something incorrect here?
Stewart
Hello Stewart,
You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules.
2007/7/11, Stewart Larsen <stl19847 at yahoo.com>:
Not sure if this is the right place...
I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...
<msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs>
Is this syntax correct?
The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other.
If you really doesn't want the event to be reported, may be you should ignore it definitively.
Regards,
-- Etienne GRIGNON
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Stewart Larsen
In case you didn't show us your whole <msgs> section, make sure your match rule is before any other more general match rule (such as the default red/error and yellow/warning rules). I believe the first match wins.
Cheers. D
-----Original Message----- From: Stewart Larsen [mailto:stl19847 at yahoo.com] Sent: Thursday, July 12, 2007 11:42 AM To: hobbit at hswn.dk Cc: hobbit at hswn.dk Subject: Re: [hobbit] BBWin and Message problems
Thanks. I've read the manual, but the syntax below does not seem to behave the way I expect.
The first error I get with that EventID triggers an alert. I thought with the syntax given, I would need to see 10 log entries within a 30 minute period before I get an alert.
Is this a bug in BBWin, or am I doing something incorrect here?
Stewart
Hello Stewart,
You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules.
2007/7/11, Stewart Larsen <stl19847 at yahoo.com>:
Not sure if this is the right place...
I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...
<msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs>
Is this syntax correct?
The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other.
If you really doesn't want the event to be reported, may be you should ignore it definitively.
Regards,
Etienne GRIGNON
Stewart Larsen
On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:
But is there also a proper way in Hobbit to take action on failed processes?
No. Hobbit only monitors things, it doesn't act to recover from any failures.
If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:
First, setup monitoring of the "sshd" process in hobbit-clients.cfg with PROC sshd GROUP=ssh You need the "GROUP" setting to be able to distinguish between different types of "procs" alerts.
Next, create /usr/local/bin/sshRecover.sh with the commands needed to restart ssh - you can use $BBHOSTNAME to get the name of the host that has the problem.
Finally, in hobbit-alerts.cfg you should have HOST=hostA,hostB,hostC SERVICE=procs GROUP=ssh SCRIPT /usr/local/bin/sshRecover.sh 0 to trigger the sshRecover.sh script when the "procs" column goes red due to the "sshd" process missing. The "0" at the end is a mandatory parameter in hobbit-alerts.cfg (the "recipient" if you read the man-page) but here it's just a dummy parameter.
Regards, Henrik
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:
If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:
[snipped]
Particularly for ssh, running the recovery script from the Hobbit server might not be easy - since ssh is usually the only way you can remote-login to the server and gets things (re-)started.
So to implement the same functionality on the client-side, you can write a client-side extension script that does:
#!/bin/sh
PROCSTATUS=$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}'
if test "$PROCSTATUS" = "red"
then
/etc/init.d/sshd restart
fi
exit 0
This triggers the "sshd restart" whenever the "procs" status goes red. So it won't be able to tell if it's the sshd process that triggers a red if you're monitoring multiple processes on each host. So alternatively, you could add network-monitoring of "ssh", and then query the "ssh" column instead of the "procs" column.
Regards, Henrik
As a last resort, if you also have rsh running, you could
- set hosts.equiv to allow the hobbit user coming in from the hobbit server to login as user x without a password,
- then give user x sudo ( with NOPASSWD ) rights to restart sshd.
I have a bunch automated fixes i setup, restart ntpd, kill processes, etc, using the SCRIPT alert & ssh keys.
In your case you could do this to restart the local or remote ssh service
< from hobbit-alerts.cfg> ... PAGE=bla COLOR=red SCRIPT /opt/hobbit/server/bin/autofix_ssh autofix_ssh SERVICE=ssh DURATION<10m MAIL admin at sample.com DURATION>10m REPEAT=30m
<autofix_ssh>
#!/bin/bash
if [ $BBHOSTNAME -eq hostname ] ; then
sudo /etc/init.d/sshd restart
else
rsh $BBHOSTNAME -l userx sudo /etc/init.d/sshd restart";
fi
hope this helps
Daniel Bourque Systems/Network Administrator WeatherData Service Inc An Accuweather Company
Office (316) 266-8013 Office (316) 265-9127 ext. 3013 Mobile (316) 640-1024
Henrik Stoerner wrote:
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:
If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:
[snipped]
Particularly for ssh, running the recovery script from the Hobbit server might not be easy - since ssh is usually the only way you can remote-login to the server and gets things (re-)started.
So to implement the same functionality on the client-side, you can write a client-side extension script that does:
#!/bin/sh
PROCSTATUS=
$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}'if test "$PROCSTATUS" = "red" then /etc/init.d/sshd restart fiexit 0
This triggers the "sshd restart" whenever the "procs" status goes red. So it won't be able to tell if it's the sshd process that triggers a red if you're monitoring multiple processes on each host. So alternatively, you could add network-monitoring of "ssh", and then query the "ssh" column instead of the "procs" column.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Hi Henrik,
On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:
On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:
But is there also a proper way in Hobbit to take action on failed processes?
No. Hobbit only monitors things, it doesn't act to recover from any failures.
thanks for your suggestions how to solve the problem. However if Hobbit is aimed at monitoring it's probably better not to misuse the alert functionality for restarting processes.
Your second solution solves this problem and may also be used to act on further problems - not only "procs". So I think this would be the best solution.
Ciao, Thomas
Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk at westend.com D-52072 Aachen Fax 0241/911879
Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb
participants (6)
-
dbourque@weatherdata.com
-
dddugan@iastate.edu
-
etienne.grignon@gmail.com
-
henrik@hswn.dk
-
stl19847@yahoo.com
-
tk@westend.com