On Mon, Nov 07, 2005 at 03:56:37PM -0500, Pat Vaughan wrote:
No, and that might be something that could change. The repeat- checking code currently identifies an alert by the combination of hostname, servicename and recipient; I could easily change that so a separate line in the config-file would result in a new set of repeat-checks.
Is this something that might make it into the next version? I'm almost ready to take a snapshot if I have to. This bit me again today.
I did some work on this yesterday - while working on it, I found out that there is something buggy in the current version. From my Changes file (http://www.hswn.dk/beta/Changes):
- The handling of alerts was counting the duration of an event based on when the color last changed. This meant that each time the color changed, any DURATION counters were reset. This would cause alerts to not go out if a status was changing between yellow and red faster than any DURATION setting. Changed this to count the event start as the *first* time the status went into an alert state (yellow or red, usually).
I then also implemented the following change:
- When a status goes yellow->red, the repeat-interval is now cleared for any alerts. This makes sure you get an alert immediately for the most severe state seen. This only affects the first such transition; if the status later changes between yellow/red, this normal REPEAT interval applies.
So you'll now get an alert when it goes yellow, and another when it goes red (if your configuration includes alerts for these colors, obviously).
This is in the current snapshot, and will also be in the next release. I am tempted to do a 4.1.3 release fairly soon - this problem is fairly serious. And the disk graph problem that is also fixed in the current snapshot annoys quite a few people.
It seems to me that the most intelligent change would be to generate a new repeat-check for every line in the hobbit-alerts file or, and I haven't looked at the code at all, to reset the repeat timer every time a test changes color (possibly using a different keyword to keep current setups working as anticipated).
I'd rather not have the REPEAT handling tied to the physical layout of the configuration file - it makes it a lot harder to handle when the file is changed while alerts are active. I know I wrote something different in the message you've quoted, but after looking some more at the problem I've changed my mind.
I think the new code strikes a sensible balance between getting the necessary alerts and not being flooded with them. The current version works the way it does because I did not want to be flooded with alerts by a state that kept on changing between yellow and red - eg. a disk that is filled just about the limit between the warning and panic levels. The new code will give you that one extra alert telling you that the situation is critical, but once it has done that it will obey the REPEAT setting and only send you an alert every 30 minutes (or whatever your REPEAT interval is).
Regards, Henrik