[xymon] xymon_4.3.0-RC1: possible lost alerts

14 Feb 2011


      On 02/14/11 11:00 AM, Henrik Størner wrote:
...
In<4D556C14.5060207 at unil.ch>  Dominique Frise<dominique.frise at unil.ch>  writes:
...
I think I found a bug in xymond_alert.c.
...
Lets say there is a page msg for hostA.serviceA and this alert will not
be processed immediately because of this part of code:
...
816                  /*
817                   * When a burst of alerts happen, we get lots of alert messages
818                   * coming in quickly. So lets handle them in bunches and only
819                   * do the full alert handling once every 10 secs - that lets us
820                   * combine a bunch of alerts into one transmission process.
821                   */
822                  if (nowtimer&lt;  (lastxmit+10)) continue;
823                  lastxmit = nowtimer;
...
The main loop will then wait for a new msg from xymond (Want msg<num>,
startpos... etc).
...
Now if the next msg is a page recovery from the same hostA.serviceA,
the next processing of the active alerts (for loop) will then cleanup
the alert for hostA.serviceA without sending any alert.
I haven't tested your diagnosis, but it is probably correct
(from how I remember that this code works).
But is it a problem ?
If you get an alert that clears a few seconds later (that is why there
is a recovery message), then what is the point of sending an alert ?
The notification would be for data that is no longer valid, and
personally I would rather NOT be alerted a 3 AM if the problem no
longer exists.
So I am tempted to invoke the old "this is not a bug, it's a feature!"
meme :-)
I think the problem is rather that the behaviour is not deterministic.
Some alert/recovered transitions will get through (if the alert goes
into the alerts loop processing without waiting) or can get lost (if
alert and recovery are processed in the same loop).
Dominique

[xymon] xymon_4.3.0-RC1: possible lost alerts

dominique.frise＠unil.ch