[hobbit] No pages when going from yellow to red
On Tue, Nov 01, 2005 at 05:25:33PM -0500, Larry.Barber at usda.gov wrote:
If you use separate rules for yellow and red alerts, say a hobbit-alerts.cfg that looked something like:
HOST a_host: MAIL xxxx at yyy.com COLOR=red DELAY=0 REPEAT=30m RECOVERED MAIL xxxx at yyy.com COLOR=yellow DELAY=0 REPEAT=30m RECOVERED
would you then get a separate email when the condition turned red at 22:45 below?
No, and that might be something that could change. The repeat- checking code currently identifies an alert by the combination of hostname, servicename and recipient; I could easily change that so a separate line in the config-file would result in a new set of repeat-checks.
22:05 Test goes yellow - alert (yellow) is sent 22:35 Test still yellow - repeat alert (yellow) is sent 22:45 Test goes red. No alert is sent because it is only 10 minutes since the last alert went out. 23:05 Test still red - now an alert (red) is sent.
Regards, Henrik
Okay, my recipients are different, but I'm using scripts instead of MAIL recipients:
SCRIPT /usr/local/bin/scripts/hobbit-mail UNIX_ADMIN SERVICE=%(cpu|disk|entstat|procs|ssh|telnet|vmio) COLOR=yellow,purple REPEAT=30d RECOVERED
SCRIPT /usr/local/bin/scripts/hobbit-mailpage $PATVAUGHAN_PAGERMAIL SERVICE=%(cpu|disk|entstat|memory|procs|ssh|telnet|vmio) COLOR=red TIME=12345:0800:1700 REPEAT=30d
On Tue, Nov 01, 2005 at 05:25:33PM -0500, Larry.Barber at usda.gov wrote:
If you use separate rules for yellow and red alerts, say a hobbit-alerts.cfg that looked something like:
HOST a_host: MAIL xxxx at yyy.com COLOR=red DELAY=0 REPEAT=30m RECOVERED MAIL xxxx at yyy.com COLOR=yellow DELAY=0 REPEAT=30m RECOVERED
would you then get a separate email when the condition turned red at 22:45 below?
No, and that might be something that could change. The repeat- checking code currently identifies an alert by the combination of hostname, servicename and recipient; I could easily change that so a separate line in the config-file would result in a new set of repeat-checks.
No, and that might be something that could change. The repeat- checking code currently identifies an alert by the combination of hostname, servicename and recipient; I could easily change that so a separate line in the config-file would result in a new set of repeat-checks.
Is this something that might make it into the next version? I'm almost ready to take a snapshot if I have to. This bit me again today. It seems to me that the most intelligent change would be to generate a new repeat-check for every line in the hobbit-alerts file or, and I haven't looked at the code at all, to reset the repeat timer every time a test changes color (possibly using a different keyword to keep current setups working as anticipated).
On Mon, Nov 07, 2005 at 03:56:37PM -0500, Pat Vaughan wrote:
No, and that might be something that could change. The repeat- checking code currently identifies an alert by the combination of hostname, servicename and recipient; I could easily change that so a separate line in the config-file would result in a new set of repeat-checks.
Is this something that might make it into the next version? I'm almost ready to take a snapshot if I have to. This bit me again today.
I did some work on this yesterday - while working on it, I found out that there is something buggy in the current version. From my Changes file (http://www.hswn.dk/beta/Changes):
- The handling of alerts was counting the duration of an event based on when the color last changed. This meant that each time the color changed, any DURATION counters were reset. This would cause alerts to not go out if a status was changing between yellow and red faster than any DURATION setting. Changed this to count the event start as the *first* time the status went into an alert state (yellow or red, usually).
I then also implemented the following change:
- When a status goes yellow->red, the repeat-interval is now cleared for any alerts. This makes sure you get an alert immediately for the most severe state seen. This only affects the first such transition; if the status later changes between yellow/red, this normal REPEAT interval applies.
So you'll now get an alert when it goes yellow, and another when it goes red (if your configuration includes alerts for these colors, obviously).
This is in the current snapshot, and will also be in the next release. I am tempted to do a 4.1.3 release fairly soon - this problem is fairly serious. And the disk graph problem that is also fixed in the current snapshot annoys quite a few people.
It seems to me that the most intelligent change would be to generate a new repeat-check for every line in the hobbit-alerts file or, and I haven't looked at the code at all, to reset the repeat timer every time a test changes color (possibly using a different keyword to keep current setups working as anticipated).
I'd rather not have the REPEAT handling tied to the physical layout of the configuration file - it makes it a lot harder to handle when the file is changed while alerts are active. I know I wrote something different in the message you've quoted, but after looking some more at the problem I've changed my mind.
I think the new code strikes a sensible balance between getting the necessary alerts and not being flooded with them. The current version works the way it does because I did not want to be flooded with alerts by a state that kept on changing between yellow and red - eg. a disk that is filled just about the limit between the warning and panic levels. The new code will give you that one extra alert telling you that the situation is critical, but once it has done that it will obey the REPEAT setting and only send you an alert every 30 minutes (or whatever your REPEAT interval is).
Regards, Henrik
I upgraded a hobitt instance from 4.0beta4 (yes terribly old but it has been working fine), to 4.1.2 (actually latest snapshot), and after the upgrade the main page still said it was version 4.0beta4, even after a new host that I added showed up on the display.
I was sure if it was a caching problem or what, so what I ended up doing was saving a copy of bb-hosts.cfg ,hobbitalerts.cfg, and to /tmp, then I totally deleted the "server" directory and reinstalled, put the config files back, fired up hobbit and all seemed well.
My question is, how does hobbit handle doing an upgrade? It didnt replace my bb-hosts file, so apparently it is aware of the existance of one different from the default and doesn't overwrite it...does it do the same checks for the various files in the other subdirectories?
Maybe there should be a makefile option ("make upgrade-install") to upgrade, that saves your config files but totally replaces all preexisting hobbit components, to make sure nothing old is hanging around, like I ended up having both a hobbit.sh and a starthobbit.sh (the old one) in my server directory.
-Charles
On Mon, Nov 07, 2005 at 02:48:34PM -0700, Charles Jones wrote:
I upgraded a hobitt instance from 4.0beta4 (yes terribly old but it has been working fine), to 4.1.2 (actually latest snapshot), and after the upgrade the main page still said it was version 4.0beta4, even after a new host that I added showed up on the display.
I was sure if it was a caching problem or what, so what I ended up doing was saving a copy of bb-hosts.cfg ,hobbitalerts.cfg, and to /tmp, then I totally deleted the "server" directory and reinstalled, put the config files back, fired up hobbit and all seemed well.
Not sure why that happened .. 4.0-beta4 is pretty old (in fact, it is the oldest release I have lying around - older than that, I'd have to fetch it from my RCS archive).
My question is, how does hobbit handle doing an upgrade? It didnt replace my bb-hosts file, so apparently it is aware of the existance of one different from the default and doesn't overwrite it...does it do the same checks for the various files in the other subdirectories?
With beta's and snapshots, all bets are off - but when upgrading from one official release to another, "make install" tries fairly hard to handle configuration files the right way:
It won't touch an existing bb-hosts file.
It will try to add new entries to the hobbitlaunch.cfg, hobbitserver.cfg, hobbitgraph.cfg, hobbitcgi.cfg and columndoc.csv files. It doesn't delete anything, since it could be entries that you've added yourself for custom scripts and extensions.
For the GIF's, webpage header/footer templates and the like, Hobbit uses an MD5 checksum to see if the version on your box is one of the default ones shipped with an older version of Hobbit. If it is, then it will replace it with the current version.
Files that have been renamed - currently, it's only hobbitd_larrd which was renamed to hobbitd_rrd - it will delete the old file, but set up a symlink so any references to the old filename still work, but hit the new file.
Maybe there should be a makefile option ("make upgrade-install") to upgrade, that saves your config files but totally replaces all preexisting hobbit components, to make sure nothing old is hanging around, like I ended up having both a hobbit.sh and a starthobbit.sh (the old one) in my server directory.
I think "make install" should just do what it does now. "starthobbit.sh" went away between RC4 and RC5.
Regards, Henrik
So you'll now get an alert when it goes yellow, and another when it goes red (if your configuration includes alerts for these colors, obviously).
That sounds like it should fix my problem perfectly.
I think the new code strikes a sensible balance between getting the necessary alerts and not being flooded with them. The current version works the way it does because I did not want to be flooded with alerts by a state that kept on changing between yellow and red - eg. a disk that is filled just about the limit between the warning and panic levels. The new code will give you that one extra alert telling you that the situation is critical, but once it has done that it will obey the REPEAT setting and only send you an alert every 30 minutes (or whatever your REPEAT interval is).
I like it, if we get a page and don't fix whatever is wrong so it goes back to a "green" state we should be flogged anyway. I'll be sure to grab the next version as soon as it's available.
participants (3)
-
henrik@hswn.dk
-
hobbit@pvaughan.us
-
jonescr@cisco.com