[hobbit] No pages when going from yellow to red

7 Nov 2005 · *first*

      On Mon, Nov 07, 2005 at 03:56:37PM -0500, Pat Vaughan wrote:
...
...
No, and that might be something that could change. The repeat-
checking code currently identifies an alert by the combination
of hostname, servicename and recipient; I could easily change
that so a separate line in the config-file would result in a new
set of repeat-checks.
Is this something that might make it into the next version?  I'm almost
ready to take a snapshot if I have to.  This bit me again today.
I did some work on this yesterday - while working on it, I found
out that there is something buggy in the current version. From my
Changes file (http://www.hswn.dk/beta/Changes):

The handling of alerts was counting the duration of an event
based on when the color last changed. This meant that each
time the color changed, any DURATION counters were reset.
This would cause alerts to not go out if a status was changing
between yellow and red faster than any DURATION setting.
Changed this to count the event start as the *first* time the
status went into an alert state (yellow or red, usually).

I then also implemented the following change:

When a status goes yellow->red, the repeat-interval is
now cleared for any alerts. This makes sure you get an
alert immediately for the most severe state seen. This
only affects the first such transition; if the status
later changes between yellow/red, this normal REPEAT
interval applies.

So you'll now get an alert when it goes yellow, and another
when it goes red (if your configuration includes alerts for
these colors, obviously).
This is in the current snapshot, and will also be in the next
release. I am tempted to do a 4.1.3 release fairly soon - this
problem is fairly serious. And the disk graph problem that is
also fixed in the current snapshot annoys quite a few people.
...
It seems
to me that the most intelligent change would be to generate a new
repeat-check for every line in the hobbit-alerts file or, and I haven't
looked at the code at all, to reset the repeat timer every time a test
changes color (possibly using a different keyword to keep current setups
working as anticipated).
I'd rather not have the REPEAT handling tied to the physical layout
of the configuration file - it makes it a lot harder to handle when
the file is changed while alerts are active. I know I wrote something
different in the message you've quoted, but after looking some more
at the problem I've changed my mind.
I think the new code strikes a sensible balance between getting
the necessary alerts and not being flooded with them. The current
version works the way it does because I did not want to be
flooded with alerts by a state that kept on changing between
yellow and red - eg. a disk that is filled just about the
limit between the warning and panic levels. The new code will
give you that one extra alert telling you that the situation
is critical, but once it has done that it will obey the
REPEAT setting and only send you an alert every 30 minutes
(or whatever your REPEAT interval is).
Regards,
Henrik

[hobbit] No pages when going from yellow to red

henrik＠hswn.dk