On 02/14/11 11:00 AM, Henrik Størner wrote:
In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
I think I found a bug in xymond_alert.c.
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:
816 /* 817 * When a burst of alerts happen, we get lots of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer< (lastxmit+10)) continue; 823 lastxmit = nowtimer;The main loop will then wait for a new msg from xymond (Want msg<num>, startpos... etc).
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.
I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).
But is it a problem ?
If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.
So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)
I think the problem is rather that the behaviour is not deterministic. Some alert/recovered transitions will get through (if the alert goes into the alerts loop processing without waiting) or can get lost (if alert and recovery are processed in the same loop).
Dominique