Hello Hernik,
Although it *has* worked before, our Hobbit server 4.0.4 has a problem now whereas it is not running external alert-script anymore?! When running hobbitd_alert it says it would run the script where in fact, it does not; again anymore?!
Hobbit has been installed using the sources, running "rpmbuild --rebuild hobbit-4.0.4-1.src.rpm". It did run the external scripts, so what has changed? We applied during our maintenance window the SLES9 patches. It has not reported since then, when I think of it. So, is there a library missing? Any suggestions?
Regards,
Peter
orwell # ldd /usr/lib/hobbit/server/bin/hobbitd_alert linux-gate.so.1 => (0xffffe000) libpcre.so.0 => /usr/lib/libpcre.so.0 (0x4001d000) libc.so.6 => /lib/tls/libc.so.6 (0x40029000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
hobbit at orwell:~> server/bin/hobbitd_alert --test orwell disk 30 red 00012428 2005-08-16 15:46:11 send_alert orwell:disk state Paging 00012428 2005-08-16 15:46:11 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 183 00012428 2005-08-16 15:46:11 Failed 'HOST=%(nagger) EXSERVICE=http' (hostname not in include list) 00012428 2005-08-16 15:46:11 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 189 00012428 2005-08-16 15:46:11 *** Match with 'HOST=%(orwell)' *** 00012428 2005-08-16 15:46:11 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 190 00012428 2005-08-16 15:46:11 *** Match with '$UNIXDAG' *** 00012428 2005-08-16 15:46:11 Mail alert with command 'mail -s "Hobbit [12345] orwell:disk CRITICAL (RED)" email at adress_removed.nl' 00012428 2005-08-16 15:46:11 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 192 00012428 2005-08-16 15:46:11 *** Match with '$UNIXNACHT' *** 00012428 2005-08-16 15:46:11 Mail alert with command 'mail -s "Hobbit [12345] orwell:disk CRITICAL (RED)" email at iaddress_removed.nl' 00012428 2005-08-16 15:46:11 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 193 00012428 2005-08-16 15:46:11 *** Match with '$UNIXSEMAFOON_BEHEER' *** 00012428 2005-08-16 15:46:11 Script alert with command '/usr/local/bb/consigne.ksh' and recipient 006XXXXXXXX
2005/5/31, Henrik Stoerner <henrik at hswn.dk>:
On Tue, May 31, 2005 at 09:43:40AM +1200, Andy France wrote:
Since updating to 4.0.4, I have had a couple of "reds" which have not generated all of my alerts.
Could you check the notifications.log file for any mention of these alerts being sent out by Hobbit, and the page.log file for any errors from your scripts ? Both should be in the /var/log/hobbit/ directory.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Tue, Aug 16, 2005 at 04:03:06PM +0200, Peter Welter wrote:
Although it *has* worked before, our Hobbit server 4.0.4 has a problem now whereas it is not running external alert-script anymore?! When running hobbitd_alert it says it would run the script where in fact, it does not; again anymore?!
Anything unusual in the /var/log/hobbit/page.log file ?
Hobbit has been installed using the sources, running "rpmbuild --rebuild hobbit-4.0.4-1.src.rpm". It did run the external scripts, so what has changed? We applied during our maintenance window the SLES9 patches. It has not reported since then, when I think of it. So, is there a library missing? Any suggestions? '/usr/local/bb/consigne.ksh' and recipient 006XXXXXXXX
What happens if you login as the hobbit user and run the script like this:
BBCOLORLEVEL=red
BBALPHAMSG="Just a test"
ACKCODE="12345"
RCPT="006xxxxxx"
BBHOSTNAME="some.host.name"
MACHIP"10.0.0.1"
BBSVCNAME="conn"
BBSVCNUM="300"
BBHOSTSVC="some.host.name.conn"
BBHOSTSVCCOMMAS="some,host,name.conn"
BBNUMERIC="30001000000000112345"
/usr/local/bb/consigne.ksh
That's basically what Hobbit does when running the script.
Henrik
Anything unusual in the /var/log/hobbit/page.log file ? hobbit at orwell:~> more /var/log/hobbit/page.log 2005-08-16 14:34:43 Tried to down BOARDBUSY: Invalid argument
What happens if you login as the hobbit user and run the script like this:
That's basically what Hobbit does when running the script.
I ran this script, and, yes the buzzer rang.
Thanks, Peter
Hello Henrik,
Since I'm totally flabbergasted of Hobbit not running an external script anymore, there must be a simple explanation for it and I'm sure I'll have a few laughs afterwards :-/
Since Hobbit is very important to us, and I don't wanna rush into things (updating the entire server from 4.0.4 to the newest version), I will first try to make a hobbit-alert.cfg as small and simple as possible containing only the stuff needed. A test-alert-config, which contains one external script that will be run in a yellow condition and this script will sent an email. This should be a simple dummy-test that will check the entire Hobbit-alert-setup just to be sure.
Will let you know the results asap!
Regards,
Peter
On Wed, Aug 17, 2005 at 04:56:59AM +0200, Peter Welter wrote:
Hello Henrik,
Since I'm totally flabbergasted of Hobbit not running an external script anymore, there must be a simple explanation for it and I'm sure I'll have a few laughs afterwards :-/
Since Hobbit is very important to us, and I don't wanna rush into things (updating the entire server from 4.0.4 to the newest version), I will first try to make a hobbit-alert.cfg as small and simple as possible containing only the stuff needed. A test-alert-config, which contains one external script that will be run in a yellow condition and this script will sent an email. This should be a simple dummy-test that will check the entire Hobbit-alert-setup just to be sure.
You might want to add "--trace=/tmp/alerttrace.log" to the hobbitd_alert command in hobbitlaunch.cfg. That will give you a closer watch on how each alert is handled by the alert module.
Do the missing alerts show up in the notifications.log file ?
Regards, Henrik
2005/8/17, Henrik Stoerner <henrik at hswn.dk>:
On Wed, Aug 17, 2005 at 04:56:59AM +0200, Peter Welter wrote: You might want to add "--trace=/tmp/alerttrace.log" to the hobbitd_alert command in hobbitlaunch.cfg. That will give you a closer watch on how each alert is handled by the alert module. Thanks, I will do so now.
Do the missing alerts show up in the notifications.log file ? No, unfortunately.
I'll keep you posted.
Status update:
After adapting the hobbit-alert.cfg to a minimum, enabling the trace facility, it becomes clear to me that after restarting Hobbit, the downtime for a service is completely recalculated. It finds a match for a service whch is down for an 1hour and 17 minutes and it says:
00003590 2005-08-17 09:46:33 Matching host:service:page 'burad12:raid:DNO/SAPEPROC' against rule line 196 00003590 2005-08-17 09:46:33 Failed '$UNIXDAG' (min. duration 0<360) 00003590 2005-08-17 09:46:33 Matching host:service:page 'burad12:raid:DNO/SAPEPROC' against rule line 197 00003590 2005-08-17 09:46:33 Failed '$UNIXTEST' (min. duration 0<1800)
Hmmm... I am restarting Hobbit now and then, fi. because 'hobbit.sh rotate' does not work at my installation and the rotatelogs for linux moves the notification.log to notification.log.1 which keeps being used without restarting.
So, monitoring this logs seems to clarify things to me... Now let's trim the point where a script is being called.To be continued...
2005/8/17, Peter Welter <peter.welter at gmail.com>:
2005/8/17, Henrik Stoerner <henrik at hswn.dk>:
On Wed, Aug 17, 2005 at 04:56:59AM +0200, Peter Welter wrote: You might want to add "--trace=/tmp/alerttrace.log" to the hobbitd_alert command in hobbitlaunch.cfg. That will give you a closer watch on how each alert is handled by the alert module. Thanks, I will do so now.
Do the missing alerts show up in the notifications.log file ? No, unfortunately.
I'll keep you posted.
On Wed, Aug 17, 2005 at 09:54:27AM +0200, Peter Welter wrote:
Status update:
After adapting the hobbit-alert.cfg to a minimum, enabling the trace facility, it becomes clear to me that after restarting Hobbit, the downtime for a service is completely recalculated. It finds a match for a service whch is down for an 1hour and 17 minutes and it says:
It should pick up the old duration from the checkpoint file. What's your hobbitd and hobbitd_alert commands in hobbitlaunch.cfg ?
Henrik
It should pick up the old duration from the checkpoint file. What's your hobbitd and hobbitd_alert commands in hobbitlaunch.cfg ?
This is the main Hobbit daemon. You cannot live without this one.
[hobbitd] HEARTBEAT ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP
[bbpage]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
NEEDS hobbitd
CMD hobbitd_channel --channel=page
--log=$BBSERVERLOGS/page.log hobbitd_alert --trace=/tmp/alerttrace.log
The directory /usr/lib/hobbit/server/tmp/ contains:
-rw-r--r-- 1 hobbit hobbit 318947 2005-08-17 10:32 hobbitd.chk
Regards, Peter
Henrik, Status update: I found out that setting the values the same for sending an email and executing the consigne-script, worked out fine; see alerttrace.txt: 00007902 2005-08-17 11:02:50 *** Match with '$UNIXDAG' *** 00007902 2005-08-17 11:02:50 Mail alert with command 'mail -s "Hobbit [329697] burad12:raid CRITICAL (RED)" someaddress at somedomain.somewhere' 00007902 2005-08-17 11:02:50 Matching host:service:page 'burad12:raid:DNO/SAPEPROC' against rule line 197 00007902 2005-08-17 11:02:50 *** Match with '$UNIXTEST' *** 00007902 2005-08-17 11:02:50 Script alert with command '/usr/local/bb/consigne.ksh' and recipient 00665022245 00007902 2005-08-17 11:03:13 send_alert lucifer:disk state Paging 00007902 2005-08-17 11:03:13 Matching host:service:page 'lucifer:disk:DNO/UB' against rule line 195 00007902 2005-08-17 11:03:13 Failed 'HOST=%(burad12|burad14|burad11|burad15)' (hostname not in include list) 00007597 2005-08-17 11:03:21 @@page igrsxc002:msgs:WINDOWS=red 00007597 2005-08-17 11:03:21 state 1->1 00007597 2005-08-17 11:04:14 @@page igrsdm001:disk:WINDOWS=yellow 00007597 2005-08-17 11:04:14 state 1->1 However, since I bumped up the duration to 30m before the script is executed and (probably) restarted hobbit several times (sooner than the 30 minutes interval), the script seems not to execute yesterday :-( However it worked fine today, twice. For the moment I'll leave the debug-file /tmp/alerttrace.txt in hobbit-alerts.cfg; it sure comes in handy! Regards, Peter
Hi Henrik,
Today with the alerttrace still on and, yes, yesterday the script was executed correctly in a tiny test-config. The original config still gives me problems. I checked for control characters in the hobbit-alerts.cfg-file (vi -> set list), and nothing weird found.
Part of the hobbit-alerts.cfg
-some macro's:
Enabled now and then for testing purposes.
###$UNIXTEST=MAIL me at somedomain.nl DURATION>6m TIME=W:0800:1730 REPEAT=1d RECOVERED COLOR=yellow,red,purple
$UNIXDAG=MAIL somewhere at somedomain.nl DURATION>6m TIME=W:0800:1730 REPEAT=1d RECOVERED
$UNIXNACHT=MAIL somewhere at somedomain.nl TIME=*:0000:2359 DURATION>30m REPEAT=1d SERVICE=!cpu,!msgs RECOVERED COLOR=!yellow
$UNIXSEMAFOON_BEHEER=SCRIPT /usr/local/bb/consigne.ksh 00765327285 FORMAT=SMS TIME=*:0000:2359 DURATION>30m REPEAT=60m SERVICE=!cpu,!msgs,!smtp,!bbgen,!bbtest,!hobbitd COLOR=!yellow
-A host not responding for $UNIXSEMAFOON_BEHEER while the yellow mail $UNIXDAG has been sent:
HOST=%(orwell) $UNIXDAG $UNIXTEST $UNIXNACHT $UNIXSEMAFOON_BEHEER
The host does give me an email for a threshold exceeded (disk>95%) and that can be seen in the trace (I only grepped the host specific entries):
00013241 2005-08-18 10:04:45 *** Match with 'HOST=%(orwell)' *** 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 191 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 193 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 194 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 196 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 203 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 209 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 216 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 223 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 229 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 236 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 242 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 254 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 261 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 268 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 275 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 282 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 287 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 294 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 300 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 304 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 311 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 322 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 332 00013241 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 340 00013241 2005-08-18 10:04:45 Failed 'HOST=%(orwell)' (hostname not in include list) 00015024 2005-08-18 10:04:45 send_alert orwell:disk state Paging 00015024 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 184 00015024 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 190 00015024 2005-08-18 10:04:45 *** Match with 'HOST=%(orwell)' *** 00015024 2005-08-18 10:04:45 Matching host:service:page 'orwell:disk:DNO/SBEHEER' against rule line 191 00015024 2005-08-18 10:04:45 Mail alert with command 'mail -s "Hobbit [25437] orwell:disk CRITICAL (RED)" central at somedomain.nl'
But the next (expected) step can not be seen in the trace and it does not occur.
All this could be just a configuration issue, so I restored another tiny config and restarted Hobbit, and that worked fine. So no problems with the mail or script etc :-]
So, now I did the following: -I restored the hobbit-alert.cfg we must use. -I uncommented my $UNIXTEST-macro to prevent empty lines in HOST-sections in the hobbit-alert.cfg knowing that Hobbit can have problems with 2 or more spaces (perhaps newlines too?) -moved the $UNIXTEST-macro to the end of each HOST-section for times I comment out the previous line ;-) -Restarted Hobbit. -Now the first alert is being sent as it should, but the one alert that should page after 30 minutes fails and nothing that triggers something in the logfile.
Regards,
Peter
I've been digging the Hobbit-emaillist and found something that might be applicable to this problem. First, the email correspondence between you and Peter Murray:
[snip] "On Tue, Jul 26, 2005 at 08:16:56AM -0400, Peter Murray wrote:
HOST=testhost.syr.edu RECOVERED MAIL user1 at syr.edu FORMAT=TEXT DURATION>10 REPEAT=20 MAIL user2 at syr.edu FORMAT=SMS DURATION>20 REPEAT=20
What happens is the first alert (FORMAT=TEXT) goes out at 10 minutes, nothing at 20 minutes, both at 30 miuntes, nothing at 40 minutes, both at 50 minutes, and so on.
Confirmed that this is a bug in all current Hobbit versions. It will be fixed in 4.1.2 - you can pick up the latest snapshot for a working version." [snip]
Second, from the Changes-file from 4.1.1 -> 4.1.2 (I run 4.0.4):
[snip] "* When multiple recipients of an alert had different minimum duration and/or repeat-settings, they would mostly use only the settings for the first recipient." [snip]
Can you confirm this?
Regards, Peter
On Wed, Aug 17, 2005 at 10:37:14AM +0200, Peter Welter wrote:
It should pick up the old duration from the checkpoint file. What's your hobbitd and hobbitd_alert commands in hobbitlaunch.cfg ?
[bbpage] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg NEEDS hobbitd CMD hobbitd_channel --channel=page
--log=$BBSERVERLOGS/page.log hobbitd_alert --trace=/tmp/alerttrace.logThe directory /usr/lib/hobbit/server/tmp/ contains:
-rw-r--r-- 1 hobbit hobbit 318947 2005-08-17 10:32 hobbitd.chk
OK, You're running without the alert-module checkpoint file. There are two things we can do:
Add "--checkpoint-file=$BBTMP/alert.chk --checkpoint-interval=600" to the hobbitd_alert command in hobbitlaunch.cfg. That way it will remember all active alerts when you restart Hobbit.
When a new alert was first seen (also after a restart of Hobbit), the duration was reset to 0 - instead of using the information Hobbit already had about when the status change occurred. I've changed this in the code, so that it picks up the duration of the alert from the timestamp we keep for when the last status change happened.
Regards, Henrik
Hello Henrik,
two things we can do:
- Add "--checkpoint-file=$BBTMP/alert.chk --checkpoint-interval=600" to the hobbitd_alert command in hobbitlaunch.cfg. That way it will remember all active alerts when you restart Hobbit. I'll do that asap (coming monday). That will certainly resolve this issue.
- When a new alert was first seen (also after a restart of Hobbit), the duration was reset to 0 - instead of using the information Hobbit already had about when the status change occurred. I've changed this in the code, so that it picks up the duration of the alert from the timestamp we keep for when the last status change happened. Ok, but that usefull addition is for new/coming releases.
However, I think I found out why the entire problem showed up in the first place. I had a alert-config that first mailed on an occuring event and if that was not dealt with properly, ran a pager script 20 minutes later. After an evening of applying (OS-)patches, a reboot etc. it did not work anymore. Eventually I thought that it had to do with a alert-config modification, resulting in this email-conversation.
As suggested, I checked the alerttrace.log, but could not find a reason why this problem happened (I changed pagerscript to mail, but no result). It *does* worked fine when *all* the alerts are processed at the same time!
Exploring the mailinglist and Changes-file for each version, I think it can be brought down to a known bug in Hobbit that is to be fixed in 4.1.2; see my mail from August 19th, 11:42.
Since we are running 4.0.4, I'm thinking what is a wise thing to do? The workaround does work fine now (we are a 24*7 University), I thinking to wait untill 4.1.2 reaches the Stable status, since 4.1.1 does not solve this particular bug.
Regards, Peter
participants (2)
-
henrik@hswn.dk
-
peter.welter@gmail.com