Restarting failed processes on the client

tk＠westend.com

11 Jul 2007 11 Jul '07

12:01 p.m.

Hi,

it is possible to monitor processes using PROC statements in hobbit-clients.cfg.

But is there also a proper way in Hobbit to take action on failed processes? Let's say calling "sudo /etc/init.d/ssh start" in case no sshd processes are found? Ideally configurable on the server (e.g. PROC sshd ACTION=/etc/init.d/ssh start), so that the configuration which processes to monitor and which process to restart does not need to be specified twice.

Ciao, Thomas

Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk at westend.com D-52072 Aachen Fax 0241/911879

Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb

Show replies by date

stl19847＠yahoo.com

11 Jul 11 Jul

2:13 p.m.

New subject: BBWin and Message problems

Not sure if this is the right place...

I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...

Is this syntax correct?

etienne.grignon＠gmail.com

12 Jul 12 Jul

3:58 p.m.

New subject: [hobbit] BBWin and Message problems

Hello Stewart,

You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules.

2007/7/11, Stewart Larsen <stl19847 at yahoo.com>:

...

Not sure if this is the right place...

I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...

<msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs>

Is this syntax correct?

The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other.

If you really doesn't want the event to be reported, may be you should ignore it definitively.

Regards,

-- Etienne GRIGNON

stl19847＠yahoo.com

4:41 p.m.

New subject: [hobbit] BBWin and Message problems

Thanks. I've read the manual, but the syntax below does not seem to behave the way I expect.

The first error I get with that EventID triggers an alert. I thought with the syntax given, I would need to see 10 log entries within a 30 minute period before I get an alert.

Is this a bug in BBWin, or am I doing something incorrect here?

Stewart

...

Hello Stewart,

You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules.

2007/7/11, Stewart Larsen <stl19847 at yahoo.com>:

...
Not sure if this is the right place...

I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...

<msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs>

Is this syntax correct?

The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other.

If you really doesn't want the event to be reported, may be you should ignore it definitively.

Regards,

-- Etienne GRIGNON

To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk

-- Stewart Larsen

dddugan＠iastate.edu

6:22 p.m.

New subject: [hobbit] BBWin and Message problems

In case you didn't show us your whole <msgs> section, make sure your match rule is before any other more general match rule (such as the default red/error and yellow/warning rules). I believe the first match wins.

Cheers. D

...

-----Original Message----- From: Stewart Larsen [mailto:stl19847 at yahoo.com] Sent: Thursday, July 12, 2007 11:42 AM To: hobbit at hswn.dk Cc: hobbit at hswn.dk Subject: Re: [hobbit] BBWin and Message problems

Thanks. I've read the manual, but the syntax below does not seem to behave the way I expect.

The first error I get with that EventID triggers an alert. I thought with the syntax given, I would need to see 10 log entries within a 30 minute period before I get an alert.

Is this a bug in BBWin, or am I doing something incorrect here?

Stewart

...
Hello Stewart,

You can set a counter so you may only receive alert if several events on the same rule are matched. Default is 0 which means that the first event matched will generate an alert. count must be a positive number. count has no effect on ignore rules.

2007/7/11, Stewart Larsen <stl19847 at yahoo.com>:

...
Not sure if this is the right place...

I have a particular error in my logs. It's not a real issus unbless I see 10 of them in a 30 minute period, so I set up a rule in the msgs section...

<msgs> <setting name="summary" value="true" /> <match logfile="Application" eventid="3317" count="10" delay="30m" /> </msgs>

Is this syntax correct?

The syntax is good but actually, count option just helps to trigger events that appear often in the last 30 minutes (your delay setting). If count is reached, msgs agent will still report all of the events because depending the rules, events can be different each other.

If you really doesn't want the event to be reported, may be you should ignore it definitively.

Regards,

Etienne GRIGNON

Stewart Larsen

henrik＠hswn.dk

11 Jul 11 Jul

2:13 p.m.

New subject: [hobbit] Restarting failed processes on the client

On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:

...

But is there also a proper way in Hobbit to take action on failed processes?

No. Hobbit only monitors things, it doesn't act to recover from any failures.

If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:

First, setup monitoring of the "sshd" process in hobbit-clients.cfg with PROC sshd GROUP=ssh You need the "GROUP" setting to be able to distinguish between different types of "procs" alerts.

Next, create /usr/local/bin/sshRecover.sh with the commands needed to restart ssh - you can use $BBHOSTNAME to get the name of the host that has the problem.

Finally, in hobbit-alerts.cfg you should have HOST=hostA,hostB,hostC SERVICE=procs GROUP=ssh SCRIPT /usr/local/bin/sshRecover.sh 0 to trigger the sshRecover.sh script when the "procs" column goes red due to the "sshd" process missing. The "0" at the end is a mandatory parameter in hobbit-alerts.cfg (the "recipient" if you read the man-page) but here it's just a dummy parameter.

Regards, Henrik

henrik＠hswn.dk

2:20 p.m.

New subject: [hobbit] Restarting failed processes on the client

On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:

...

If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:

[snipped]

Particularly for ssh, running the recovery script from the Hobbit server might not be easy - since ssh is usually the only way you can remote-login to the server and gets things (re-)started.

So to implement the same functionality on the client-side, you can write a client-side extension script that does:

#!/bin/sh

PROCSTATUS=$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}' if test "$PROCSTATUS" = "red" then /etc/init.d/sshd restart fi

exit 0

This triggers the "sshd restart" whenever the "procs" status goes red. So it won't be able to tell if it's the sshd process that triggers a red if you're monitoring multiple processes on each host. So alternatively, you could add network-monitoring of "ssh", and then query the "ssh" column instead of the "procs" column.

Regards, Henrik

dbourque＠weatherdata.com

12 Jul 12 Jul

2:50 p.m.

New subject: [hobbit] Restarting failed processes on the client

As a last resort, if you also have rsh running, you could

set hosts.equiv to allow the hobbit user coming in from the hobbit server to login as user x without a password,
then give user x sudo ( with NOPASSWD ) rights to restart sshd.

I have a bunch automated fixes i setup, restart ntpd, kill processes, etc, using the SCRIPT alert & ssh keys.

In your case you could do this to restart the local or remote ssh service

< from hobbit-alerts.cfg> ... PAGE=bla COLOR=red SCRIPT /opt/hobbit/server/bin/autofix_ssh autofix_ssh SERVICE=ssh DURATION<10m MAIL admin at sample.com DURATION>10m REPEAT=30m

<autofix_ssh>

#!/bin/bash

if [ $BBHOSTNAME -eq hostname ] ; then sudo /etc/init.d/sshd restart else rsh $BBHOSTNAME -l userx sudo /etc/init.d/sshd restart"; fi

hope this helps

Daniel Bourque Systems/Network Administrator WeatherData Service Inc An Accuweather Company

Office (316) 266-8013 Office (316) 265-9127 ext. 3013 Mobile (316) 640-1024

Henrik Stoerner wrote:

...

On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:

...
If You really want this, then the easiest way is probably to have a script on the Hobbit server that handles the service restart, and trigger it from an alerting script. Here's how:

[snipped]

Particularly for ssh, running the recovery script from the Hobbit server might not be easy - since ssh is usually the only way you can remote-login to the server and gets things (re-)started.

So to implement the same functionality on the client-side, you can write a client-side extension script that does:

#!/bin/sh

PROCSTATUS=$BB $BBDISP "query $MACHINE.procs" | awk '{print $1}' if test "$PROCSTATUS" = "red" then /etc/init.d/sshd restart fi

exit 0

This triggers the "sshd restart" whenever the "procs" status goes red. So it won't be able to tell if it's the sshd process that triggers a red if you're monitoring multiple processes on each host. So alternatively, you could add network-monitoring of "ssh", and then query the "ssh" column instead of the "procs" column.

Regards, Henrik

To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk

tk＠westend.com

7:58 a.m.

New subject: [hobbit] Restarting failed processes on the client

Hi Henrik,

On Wed, Jul 11, 2007 at 04:13:56PM +0200, Henrik Stoerner wrote:

...

On Wed, Jul 11, 2007 at 02:01:13PM +0200, Thomas Kaehn wrote:

...
But is there also a proper way in Hobbit to take action on failed processes?

No. Hobbit only monitors things, it doesn't act to recover from any failures.

thanks for your suggestions how to solve the problem. However if Hobbit is aimed at monitoring it's probably better not to misuse the alert functionality for restarting processes.

Your second solution solves this problem and may also be used to act on further problems - not only "procs". So I think this would be the best solution.

Ciao, Thomas

Thomas Kähn WESTEND GmbH | Internet-Business-Provider Technik CISCO Systems Partner - Authorized Reseller Im Süsterfeld 6 Tel 0241/701333-18 tk at westend.com D-52072 Aachen Fax 0241/911879

Die Gesellschaft ist eingetragen im Handelsregister Aachen unter HRB 7608 Geschäftsführer: Thomas Neugebauer, Thomas Heller, Michael Kolb

6922

Age (days ago)

6923

Last active (days ago)

List overview

Download

8 comments

6 participants

participants (6)

dbourque＠weatherdata.com
dddugan＠iastate.edu
etienne.grignon＠gmail.com
henrik＠hswn.dk
stl19847＠yahoo.com
tk＠westend.com