[hobbit] Alert Rules - DURATION not working
As you can see from the out put below a DURATION of '15m' translates to
653760. Regardless, we cannot get DURATION to work under any
circumstance. I sent some early e-mails with some logging output.
Either we have something configured wrong or DURATION is broken?
All we would like is a rather simple rule. Any host that is red for longer than 15 minutes send an e-mail/page, repeat every 8 hours (shift change). The last rule is just a hack to get around DURATION not working for us. Perhaps I do not understand the config rules?
Rules:
HOST=% COLOR=yellow MAIL somebody at somehost.com REPEAT=8h DURATION>15 MAIL anybody at anyhost.com REPEAT=8h DURATION>15m
COLOR=red EXSERVICE=cpu,mem,tl1am SCRIPT /export/home/hobbit/server/bin/delay_page PAGE REPEAT=8h RECOVERED
Debug:
HOST=% COLOR=yellow MAIL somebody at somehost.com REPEAT=480 COLOR=yellow DURATION>15 MAIL anybody at anyhost.com REPEAT=480 COLOR=yellow DURATION>653760
EXSERVICE=cpu,mem,tl1am COLOR=red SCRIPT /export/home/hobbit/server/bin/delay_page PAGE FORMAT=SCRIPT REPEAT=480 COLOR=red RECOVERED
On Tue, Feb 01, 2005 at 01:02:58AM +0000, David Gore wrote:
As you can see from the out put below a DURATION of '15m' translates to 653760.
I'll look into that
Either we have something configured wrong or DURATION is broken?
HOST=% COLOR=yellow MAIL somebody at somehost.com REPEAT=8h DURATION>15 MAIL anybody at anyhost.com REPEAT=8h DURATION>15m
"HOST=%" is definitely wrong. "HOST=%.*" is what you want.
Henrik
Henrik, Thank you so much for replying. I caused a yellow alarm for procs on host rsoimpm1, I am expecting the rule to fire after 15 minutes. Here is what I see from the log file in more detail: 005-02-01 15:17:29 hobbitd_alert: Got message 37 @@page#37|1107271049.602362|166.34.57.23 9|rsoimpm1|procs|166.34.57.239|1107272849|yellow|green|1107271049|CAY/pmservers|947420 2005-02-01 15:17:29 Got page message from rsoimpm1:procs 2005-02-01 15:17:29 Alert status changed from 0 to 1 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs %.*:(NULL):(NULL) 2005-02-01 15:17:29 pcre_exec returned 1 2005-02-01 15:17:29 Checking explicit color setting 10000000020 against 4 gives 1 2005-02-01 15:17:29 Found a first matching rule 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<900 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<39225600 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 Checking explicit color setting 10000000040 against 4 gives 0 2005-02-01 15:17:29 No more secondary matching rule 2005-02-01 15:17:29 1 alerts to go 2005-02-01 15:17:29 Compiling regex .* 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs %.*:(NULL):(NULL) 2005-02-01 15:17:29 pcre_exec returned 1 2005-02-01 15:17:29 Checking explicit color setting 10000000020 against 4 gives 1 2005-02-01 15:17:29 Found a first matching rule 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<900 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<39225600 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 send_alert rsoimpm1:procs state 0 2005-02-01 15:17:29 Checking explicit color setting 10000000040 against 4 gives 0 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs %.*:(NULL):(NULL) 2005-02-01 15:17:29 No more secondary matching rule 2005-02-01 15:17:29 pcre_exec returned 1 2005-02-01 15:17:29 Checking explicit color setting 10000000020 against 4 gives 1 2005-02-01 15:17:29 Found a first matching rule 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<900 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 event start: 1107271049, failed minduration 0<39225600 2005-02-01 15:17:29 criteriamatch rsoimpm1:procs (NULL):(NULL):(NULL) 2005-02-01 15:17:29 Checking explicit color setting 10000000040 against 4 gives 0 2005-02-01 15:17:29 No more secondary matching rule I caused a yellow alarm at 15:17, so far OK. Alert status changed, criteria match, regex match, color match, found rule, checking minduration, which fails, not less than 15 minutes. Sorry, I did add to the debug print statement in the source code. 2005-02-01 15:22:29 hobbitd_alert: Got message 58 @@page#58|1107271349.301483|166.34.57.23 9|rsoimpm1|procs|166.34.57.239|1107273149|yellow|yellow|1107271049|CAY/pmservers|947420 2005-02-01 15:22:29 Got page message from rsoimpm1:procs 2005-02-01 15:22:29 0 alerts to go 2005-02-01 15:27:29 hobbitd_alert: Got message 79 @@page#79|1107271649.155212|166.34.57.23 9|rsoimpm1|procs|166.34.57.239|1107273449|yellow|yellow|1107271049|CAY/pmservers|947420 2005-02-01 15:27:29 Got page message from rsoimpm1:procs 2005-02-01 15:27:29 0 alerts to go 2005-02-01 15:32:28 hobbitd_alert: Got message 101 @@page#101|1107271948.980583|166.34.57. 239|rsoimpm1|procs|166.34.57.239|1107273748|yellow|yellow|1107271049|CAY/pmservers|947420 2005-02-01 15:32:28 Got page message from rsoimpm1:procs 2005-02-01 15:32:28 0 alerts to go 2005-02-01 15:37:28 hobbitd_alert: Got message 123 @@page#123|1107272248.884069|166.34.57. 239|rsoimpm1|procs|166.34.57.239|1107274048|yellow|yellow|1107271049|CAY/pmservers|947420 2005-02-01 15:37:28 Got page message from rsoimpm1:procs 2005-02-01 15:37:28 0 alerts to go So it's like nothing happens afterwards? Hopefully, I got all the relevant parts of the log file. I didn't want the posting to long. Any ideas? ~David Gore Henrik Stoerner wrote:
On Tue, Feb 01, 2005 at 01:02:58AM +0000, David Gore wrote:
As you can see from the out put below a DURATION of '15m' translates to 653760.
I'll look into that
Either we have something configured wrong or DURATION is broken?
HOST=% COLOR=yellow MAIL somebody at somehost.com REPEAT=8h DURATION>15 MAIL anybody at anyhost.com REPEAT=8h DURATION>15m
"HOST=%" is definitely wrong. "HOST=%.*" is what you want.
Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
David Gore wrote:
So it's like nothing happens afterwards? Hopefully, I got all the relevant parts of the log file. I didn't want the posting to long. Any ideas?
Have you made any progress on this? I can't get the DURATION variable to work either, and this time around I'm sure a typo is not the reason for not getting an alert email. Here's what I've done and what I see: I added the --debug switch to hobbitd_alert in hobbitlaunch.cfg: CMD hobbitd_channel --channel=page --log=$BBSERVERLOGS/page.log hobbitd_alert --debug My rule from hobbit-alerts.cfg. HOST=$FOUND_SYS MAIL broken at nandomedia.com SERVICE=procs COLOR=red DURATION>5 REPEAT=5 After I add this rule, I restart hobbit. I read on the list that restarting isn't necessary, but it has been my experience that changes made to hobbit-alerts.cfg do not always get put into effect unless hobbit is restarted. Excerpts from page.log: (note: I replaced a valid IP address with 0s in the 3rd field of the @@page line of this excerpt) 2005-02-02 08:11:12 hobbitd_alert: Got message 4 @@page#4|1107349872.146928|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351672|red|red|1107227163|web6|315344 2005-02-02 08:11:12 Got page message from foundry01.nandomedia.com:procs 2005-02-02 08:11:12 Alert status changed from 0 to 1 2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs %(foundry.*).nandomedia.com:(NULL):(NULL) 2005-02-02 08:11:12 pcre_exec returned 2 2005-02-02 08:11:12 Checking default color setting 70 against 5 gives 1 2005-02-02 08:11:12 Found a first matching rule 2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs (NULL):(NULL):procs 2005-02-02 08:11:12 failed minduration 0<300 So it looks like the duration variable was checked, which is good. The next time I see this server in the page.log, the min duration isn't checked. 2005-02-02 08:16:12 hobbitd_alert: Got message 16 @@page#16|1107350172.517352|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351972|red|red|1107227163|web6|315344 2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs 2005-02-02 08:16:12 0 alerts to go 2005-02-02 08:17:12 0 alerts to go This message will repeat from now on, varying only in the message count #, but alerts are not sent out: -bash-2.05b$ grep foundry data/acks/notifications.log -bash-2.05b$ I dunno what else to investigate at this point. Tom
Tom, No, I haven't found a solution. I was hoping Henrik might find something. Without a doubt we can launch rules every time WITHOUT a DURATION. One of my co-workers has put in a script that launches everytime we get an alert and just waits for 15 minutes before it sends out a page/email unless of course it recovered. Let me know, if you find out anything yourself Tom. It is always possible we have something configured wrong. I am running Hobbit on Solaris 9, if it matters. I also chose not to monitor any 's' (secure) services like https during setup. David Gore (v965-3670) Enhanced Technology Support (ETS) Network Management Systems (NMS) IMPACT Transport Team Lead - SCSA, SCNA Page: 1-800-PAG-eMCI pin 1406090 Vnet: 965-3676 Tom Georgoulias wrote:
David Gore wrote:
So it's like nothing happens afterwards? Hopefully, I got all the relevant parts of the log file. I didn't want the posting to long. Any ideas?
Have you made any progress on this? I can't get the DURATION variable to work either, and this time around I'm sure a typo is not the reason for not getting an alert email.
Here's what I've done and what I see:
I added the --debug switch to hobbitd_alert in hobbitlaunch.cfg:
CMD hobbitd_channel --channel=page --log=$BBSERVERLOGS/page.log hobbitd_alert --debug
My rule from hobbit-alerts.cfg.
HOST=$FOUND_SYS MAIL broken at nandomedia.com SERVICE=procs COLOR=red DURATION>5 REPEAT=5
After I add this rule, I restart hobbit. I read on the list that restarting isn't necessary, but it has been my experience that changes made to hobbit-alerts.cfg do not always get put into effect unless hobbit is restarted.
Excerpts from page.log:
(note: I replaced a valid IP address with 0s in the 3rd field of the @@page line of this excerpt)
2005-02-02 08:11:12 hobbitd_alert: Got message 4 @@page#4|1107349872.146928|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351672|red|red|1107227163|web6|315344
2005-02-02 08:11:12 Got page message from foundry01.nandomedia.com:procs 2005-02-02 08:11:12 Alert status changed from 0 to 1 2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs %(foundry.*).nandomedia.com:(NULL):(NULL) 2005-02-02 08:11:12 pcre_exec returned 2 2005-02-02 08:11:12 Checking default color setting 70 against 5 gives 1 2005-02-02 08:11:12 Found a first matching rule 2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs (NULL):(NULL):procs 2005-02-02 08:11:12 failed minduration 0<300
So it looks like the duration variable was checked, which is good. The next time I see this server in the page.log, the min duration isn't checked.
2005-02-02 08:16:12 hobbitd_alert: Got message 16 @@page#16|1107350172.517352|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351972|red|red|1107227163|web6|315344
2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs 2005-02-02 08:16:12 0 alerts to go 2005-02-02 08:17:12 0 alerts to go
This message will repeat from now on, varying only in the message count #, but alerts are not sent out:
-bash-2.05b$ grep foundry data/acks/notifications.log -bash-2.05b$
I dunno what else to investigate at this point.
Tom
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Henrik, Tom, My 15 minute DURATION fired. I don't think it is a coincidence that it fired at 1 day and 5 hours. I think the earlier possible bug where when you specify 15m you get a particularly large number is probably where the problem is. Tom Georgoulias wrote:
David Gore wrote:
So it's like nothing happens afterwards? Hopefully, I got all the relevant parts of the log file. I didn't want the posting to long. Any ideas?
Have you made any progress on this? I can't get the DURATION variable to work either, and this time around I'm sure a typo is not the reason for not getting an alert email.
Here's what I've done and what I see:
I added the --debug switch to hobbitd_alert in hobbitlaunch.cfg:
CMD hobbitd_channel --channel=page --log=$BBSERVERLOGS/page.log hobbitd_alert --debug
My rule from hobbit-alerts.cfg.
HOST=$FOUND_SYS MAIL broken at nandomedia.com SERVICE=procs COLOR=red DURATION>5 REPEAT=5
After I add this rule, I restart hobbit. I read on the list that restarting isn't necessary, but it has been my experience that changes made to hobbit-alerts.cfg do not always get put into effect unless hobbit is restarted.
Excerpts from page.log:
(note: I replaced a valid IP address with 0s in the 3rd field of the @@page line of this excerpt)
2005-02-02 08:11:12 hobbitd_alert: Got message 4 @@page#4|1107349872.146928|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351672|red|red|1107227163|web6|315344
2005-02-02 08:11:12 Got page message from foundry01.nandomedia.com:procs 2005-02-02 08:11:12 Alert status changed from 0 to 1 2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs %(foundry.*).nandomedia.com:(NULL):(NULL) 2005-02-02 08:11:12 pcre_exec returned 2 2005-02-02 08:11:12 Checking default color setting 70 against 5 gives 1 2005-02-02 08:11:12 Found a first matching rule 2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs (NULL):(NULL):procs 2005-02-02 08:11:12 failed minduration 0<300
So it looks like the duration variable was checked, which is good. The next time I see this server in the page.log, the min duration isn't checked.
2005-02-02 08:16:12 hobbitd_alert: Got message 16 @@page#16|1107350172.517352|0.0.0.0|foundry01.nandomedia.com|procs|0.0.0.0|1107351972|red|red|1107227163|web6|315344
2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs 2005-02-02 08:16:12 0 alerts to go 2005-02-02 08:17:12 0 alerts to go
This message will repeat from now on, varying only in the message count #, but alerts are not sent out:
-bash-2.05b$ grep foundry data/acks/notifications.log -bash-2.05b$
I dunno what else to investigate at this point.
Tom
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Wed, Feb 02, 2005 at 03:26:25PM +0000, David Gore wrote:
Henrik, Tom,
My 15 minute DURATION fired. I don't think it is a coincidence that it fired at 1 day and 5 hours. I think the earlier possible bug where when you specify 15m you get a particularly large number is probably where the problem is.
I tend to agree, but I've been too busy with "real" work these past days, so I haven't had time to investigate it.
And the server that crashed sunday kept me busy over the week-end. But if definitely showed that the repeat thing works ... I got about 900 mails for different services that failed because my external gateway was down.
Henrik
On Wed, Feb 02, 2005 at 03:26:25PM +0000, David Gore wrote:
My 15 minute DURATION fired. I don't think it is a coincidence that it fired at 1 day and 5 hours. I think the earlier possible bug where when you specify 15m you get a particularly large number is probably where the problem is.
I've been testing with DURATION>10 since I last posted to the list, which showed up as 600s and only tested against one time:
"failed minduration 0<600"
I would've expected to see something like this:
Start hobbit, it runs though all the alerts at time=0 when "duration=0"
page.log <snip> "failed minduration 0<600"
5 mins later, when it checks again with "duration=300"
<snip> "failed minduration 300<600"
5 mins later, duration=minduration and it doesn't fail the test, so it's time to send an alert.
Or, quite possibly, I don't know what I am talking about.
Henrik Stoerner wrote:
I tend to agree, but I've been too busy with "real" work these past days, so I haven't had time to investigate it.
If I can be of any assistance in helping debug this by testing patches or alert conditions, just ask.
But
if definitely showed that the repeat thing works ... I got about 900 mails for different services that failed because my external gateway was down.
:) Nothing better than a real world event to stress test monitoring system...
On Wed, Feb 02, 2005 at 08:56:22AM -0500, Tom Georgoulias wrote:
HOST=$FOUND_SYS MAIL broken at nandomedia.com SERVICE=procs COLOR=red DURATION>5 REPEAT=5
After I add this rule, I restart hobbit. I read on the list that restarting isn't necessary, but it has been my experience that changes made to hobbit-alerts.cfg do not always get put into effect unless hobbit is restarted.
It shouldn't be needed, but it doesn't harm.
2005-02-02 08:11:12 criteriamatch foundry01.nandomedia.com:procs (NULL):(NULL):procs 2005-02-02 08:11:12 failed minduration 0<300
OK
2005-02-02 08:16:12 Got page message from foundry01.nandomedia.com:procs 2005-02-02 08:16:12 0 alerts to go
And this looks suspicious.
What's supposed to happen is that after the alert is first reported to the hobbitd_alert module, this module is supposed to keep track of when the next alert is due (the REPEAT interval comes into play here), and if no alerts are due then you get the "0 alerts to go" message.
So something messes up the timekeeping, and we never get around to testing if the DURATION triggers after the first attempt.
[after looking over the code for 10 minutes]
I think I've got it, but there's been quite a few changes to various bits so I dont want to send one-line fixes now. I'll come up with a proper full package, which will also include fixes for many of the other bugs that have been reported for beta6.
Henrik
participants (3)
-
David.Gore@mci.com
-
henrik@hswn.dk
-
tgeorgoulias@nandomedia.com