acknowledgment for a yellow alert doesn't seem to work
Figures, just as I'm about to host a session to show off Hobbit to the support team and roll it out into production, I hit a situation that has me scratching my head.
I've got an add on script, bb-memory, that is monitoring the memory on my linux clients and on one system it is in the yellow state at the moment. A Hobbit alert was generated, it was sent to the appropriate email address, yada yada yada.
When I try to use the ack code from the subject line to acknowledge the alert, it doesn't work. Doesn't show up in ~/data/acks/acklog or on the web page, and using "!xxxxxx" doesn't have any effect either. Within the same hour, I had a red alert from a different host (disk usage) and I was able to acknowledge it successfully. Thought I'd be clever and clean up just enough to make it turn yellow, which it did, and was able to ack the yellow alert as well. SO it didn't have anything to do with color states.
Using hobbit 4.0, with the following patches from the list:
bbnet-iponly.patch hobbit-4.0.1-diskhistlog.patch eventlog-crash.patch post-4.0-includes.patch
Any ideas on what to check?
Tom
Tom Georgoulias wrote:
When I try to use the ack code from the subject line to acknowledge the alert, it doesn't work. Doesn't show up in ~/data/acks/acklog or on the web page, and using "!xxxxxx" doesn't have any effect either.
Any ideas on what to check?
I've been doing more troubleshooting on this, but I still haven't resolved it.
I've created a large test file and filled up the disk partition to 95%, which generates a yellow alert.
Then I used this command to acknowledge the alert:
~/hobbit/server/bin/bb 127.0.0.1 "hobbitdack 158136 10 command line acknowledgment"
Got this entry in my data/acks/acklog file:
1112878495 158136 10 158136 np_filename_not_used radm200p.nandomedia.com.disk yellow command line acknowledgment
So that worked.
Then I waited 10 mins, got my next page and tried to acknowledge it via the web, with maint.pl. Didn't work.
Do the cookies have a lifespan or a one-time use policy?
Tom
I'm seeing a similar problem, however my issue is with a red alert. In my case, acks are hit or miss. Occasionally, it will work, but it often fails. The apache logs show that the ack was received, but nothing shows up in the acklog.
So far I don't see a pattern. Any ideas to test this further?
-Dan
On Thu, 2005-04-07 at 10:09 -0400, Tom Georgoulias wrote:
Tom Georgoulias wrote:
When I try to use the ack code from the subject line to acknowledge the alert, it doesn't work. Doesn't show up in ~/data/acks/acklog or on the web page, and using "!xxxxxx" doesn't have any effect either.
Any ideas on what to check?
I've been doing more troubleshooting on this, but I still haven't resolved it.
I've created a large test file and filled up the disk partition to 95%, which generates a yellow alert.
Then I used this command to acknowledge the alert:
~/hobbit/server/bin/bb 127.0.0.1 "hobbitdack 158136 10 command line acknowledgment"
Got this entry in my data/acks/acklog file:
1112878495 158136 10 158136 np_filename_not_used radm200p.nandomedia.com.disk yellow command line acknowledgment
So that worked.
Then I waited 10 mins, got my next page and tried to acknowledge it via the web, with maint.pl. Didn't work.
Do the cookies have a lifespan or a one-time use policy?
Tom
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Daniel Deighton <dan at deightime.net>
Daniel Deighton wrote:
I'm seeing a similar problem, however my issue is with a red alert. In my case, acks are hit or miss. Occasionally, it will work, but it often fails. The apache logs show that the ack was received, but nothing shows up in the acklog.
So far I don't see a pattern. Any ideas to test this further?
I'm still trying to narrow it down myself. I know that you have to use the most recent cookie/ack code, so if you get multiple pages, use the last one.
Try using this command if it isn't working using the CGI form from the hobbit webpage:
~/hobbit/server/bin/bb 127.0.0.1 "hobbitdack ACKCODE TIME EXPLANATION MSG"
Tom Georgoulias wrote:
Daniel Deighton wrote:
I'm seeing a similar problem, however my issue is with a red alert. In my case, acks are hit or miss. Occasionally, it will work, but it often fails. The apache logs show that the ack was received, but nothing shows up in the acklog.
So far I don't see a pattern. Any ideas to test this further?
I'm still trying to narrow it down myself. I know that you have to use the most recent cookie/ack code, so if you get multiple pages, use the last one.
Seems strange, but it appears that once an alert that has been previously acknowledged expires, an email is sent out again that has the old ack code in teh subject.
If you wait until the next alert email, it'll have a new ack code.
If I use it via the webpage, it works again.
Just an observation.
Tom
On Thu, Apr 07, 2005 at 10:09:09AM -0400, Tom Georgoulias wrote:
Do the cookies have a lifespan or a one-time use policy?
Yes, they are only valid for 30 minutes after they've been generated.
Could you try the attached patch ? If causes hobbitd to log if it receives an ack-message that is discarded because the cookie was not valid.
Also, if you want to check what the current cookie value is, you can run
bb 127.0.0.1 "hobbitdboard host=HOSTNAME test=TESTNAME fields=hostname,testname,cookie"
It will respond with
HOSTNAME|TESTNAME|1029348
The cookie is the third ('|'-separated) field.
Regards, Henrik
Henrik Stoerner wrote:
On Thu, Apr 07, 2005 at 10:09:09AM -0400, Tom Georgoulias wrote:
Do the cookies have a lifespan or a one-time use policy?
Yes, they are only valid for 30 minutes after they've been generated.
Thanks for clarifying that. I was under the impression that a cookie was valid as long as the alert remained in that state or the ack period was still valid, no matter how long that was.
This seems to explain why I couldn't ack the yellow alert I mentioned at the beginning of this thread. I'm sure I didn't try to acknowledge the alert until a couple of hours after it first came through, and the alerts aren't sent unless the condition persists for more than 45mins, and resends are every hour. So the cookie must've gone stale by then.
Could you try the attached patch ? If causes hobbitd to log if it receives an ack-message that is discarded because the cookie was not valid.
Done. I'll let report back with my findings.
Also, if you want to check what the current cookie value is, you can run
bb 127.0.0.1 "hobbitdboard host=HOSTNAME test=TESTNAME fields=hostname,testname,cookie"
Very useful command. I've added it to my notes. ;)
Tom
Tom Georgoulias wrote:
Henrik Stoerner wrote:
Could you try the attached patch ? If causes hobbitd to log if it receives an ack-message that is discarded because the cookie was not valid.
Patch seems to work.
I have a yellow alert on a system.
Check the cookie: -bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie" radm200p.nandomedia.com|disk|406429
Wait a while, then check again: -bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie" radm200p.nandomedia.com|disk|712535
Use the old cookie to try and ack the alert, then check hobbitd.log:
bash-2.05b$ tail hobbitd.log 2005-04-06 14:55:48 Setup complete 2005-04-06 15:01:33 Setup complete 2005-04-07 13:21:55 Setup complete 2005-04-07 13:36:53 Cookie 406429 not found, dropping ack
Stale cookie didn't work, event was logged.
So now the real issue for me is how to use this piece of info about cookie lifespans when I put Hobbit into production. I don't want the support folks to have to log into my hobbit server and check for the latest cookie value before acknowledging an alert. I've also got a range of time & repeat delays for my alerts, depending on what system parameter is being measured, and I'd hate to have to use <30 mins across the board.
Something strange happened on my server. It seems that a cookie expired after only 9 minutes (or less). I've included the pertinent info below. What would cause this behavior?
-Dan
Email Notifications Headers
Subject:
Hobbit [753059] sundeigh.deightime.net:meta CRITICAL (RED) Date: Thu, 7 Apr 2005 16:29:54 -0400 (EDT)
notifications.log
Thu Apr 7 15:59:54 2005 sundeigh.deightime.net.meta (1.1.1.1) dan- hobbit at deightime.net 1112903993 999 Thu Apr 7 16:29:54 2005 sundeigh.deightime.net.meta (1.1.1.1) dan- hobbit at deightime.net 1112905794 999
hobbitd.log
2005-04-07 16:38:06 Cookie 753059 not found, dropping ack
After the ack failed, I ran the following (thanks for the patch, Henrik):
./bb 127.0.0.1 "hobbitdboard host=sundeigh.deightime.net test=meta fields=hostname,testname,cookie" sundeigh.deightime.net|meta|615614
date (run right after the above bb command)
Thu Apr 7 16:42:05 EDT 2005
On Thu, 2005-04-07 at 13:57 -0400, Tom Georgoulias wrote:
Tom Georgoulias wrote:
Henrik Stoerner wrote:
Could you try the attached patch ? If causes hobbitd to log if it receives an ack-message that is discarded because the cookie was not valid.
Patch seems to work.
I have a yellow alert on a system.
Check the cookie: -bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie" radm200p.nandomedia.com|disk|406429
Wait a while, then check again: -bash-2.05b$ ~/hobbit/server/bin/bb 127.0.0.1 "hobbitdboard host=radm200p.nandomedia.com test=disk fields=hostname,testname,cookie" radm200p.nandomedia.com|disk|712535
Use the old cookie to try and ack the alert, then check hobbitd.log:
bash-2.05b$ tail hobbitd.log 2005-04-06 14:55:48 Setup complete 2005-04-06 15:01:33 Setup complete 2005-04-07 13:21:55 Setup complete 2005-04-07 13:36:53 Cookie 406429 not found, dropping ack
Stale cookie didn't work, event was logged.
So now the real issue for me is how to use this piece of info about cookie lifespans when I put Hobbit into production. I don't want the support folks to have to log into my hobbit server and check for the latest cookie value before acknowledging an alert. I've also got a range of time & repeat delays for my alerts, depending on what system parameter is being measured, and I'd hate to have to use <30 mins across the board.
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Daniel Deighton <dan at deightime.net>
On Thu, Apr 07, 2005 at 05:22:26PM -0400, Daniel Deighton wrote:
Something strange happened on my server. It seems that a cookie expired after only 9 minutes (or less). I've included the pertinent info below. What would cause this behavior?
OK, I think I've found the root cause of this issue, and it is fundamentally a design flaw in how the cookies are generated.
Currently, a cookie is generated the moment a status changes from green to yellow/red/purple, and gets a lifetime of 30 minutes. But the cookie may not be delivered in an alert until some time after, depending on any DURATION>x settings in the alert config - and by then the cookie may be close to expiring. Combined with alerts only being repeated every 30 minutes (by default), you can end up in a situation where the cookie you get in the alert message will only be valid for a minute or so.
The *real* solution is to change the cookie-generation so it happens when the alert is sent out. That requires some serious changes to the code - so I'll postpone that a bit and make that together with the escalation-alert handling that is planned for 4.1.
So for now, the attached patch just changes the lifetime of a cookie to 24 hours. That should make it work.
Regards, Henrik
participants (3)
-
dan@deightime.net
-
henrik@hswn.dk
-
tgeorgoulias@mcclatchy.com