Hi,
I think I found a bug in xymond_alert.c.
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:
816 /*
817 * When a burst of alerts happen, we get lots
of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer < (lastxmit+10)) continue; 823 lastxmit = nowtimer;
The main loop will then wait for a new msg from xymond (Want msg <num>, startpos... etc).
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.
Dominique
In <4D556C14.5060207 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:
I think I found a bug in xymond_alert.c.
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:
816 /* 817 * When a burst of alerts happen, we get lots of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer < (lastxmit+10)) continue; 823 lastxmit = nowtimer;
The main loop will then wait for a new msg from xymond (Want msg <num>, startpos... etc).
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.
I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).
But is it a problem ?
If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.
So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)
Regards, Henrik
On 02/14/11 11:00 AM, Henrik Størner wrote:
In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
I think I found a bug in xymond_alert.c.
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:
816 /* 817 * When a burst of alerts happen, we get lots of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer< (lastxmit+10)) continue; 823 lastxmit = nowtimer;The main loop will then wait for a new msg from xymond (Want msg<num>, startpos... etc).
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.
I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).
But is it a problem ?
If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.
So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)
I think the problem is rather that the behaviour is not deterministic. Some alert/recovered transitions will get through (if the alert goes into the alerts loop processing without waiting) or can get lost (if alert and recovery are processed in the same loop).
Dominique
In <4D59102A.2000507 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:
On 02/14/11 11:00 AM, Henrik St�rner wrote:
In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
I think I found a bug in xymond_alert.c.
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:
816 /* 817 * When a burst of alerts happen, we get lots of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer< (lastxmit+10)) continue; 823 lastxmit = nowtimer;The main loop will then wait for a new msg from xymond (Want msg<num>, startpos... etc).
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.
I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).
But is it a problem ?
If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.
So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)
I think the problem is rather that the behaviour is not deterministic. Some alert/recovered transitions will get through (if the alert goes into the alerts loop processing without waiting) or can get lost (if alert and recovery are processed in the same loop).
But it is "deterministic enough" that you will either get both of them (alert + recovery), or neither. You will not get an alert and then lose the recovery-message, or get a recovery-message without the alert having been sent.
Regards, Henrik
On 02/14/11 01:46 PM, Henrik Størner wrote:
In<4D59102A.2000507 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
On 02/14/11 11:00 AM, Henrik Størner wrote:
In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
I think I found a bug in xymond_alert.c.
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:
816 /* 817 * When a burst of alerts happen, we get lots of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer< (lastxmit+10)) continue; 823 lastxmit = nowtimer;The main loop will then wait for a new msg from xymond (Want msg<num>, startpos... etc).
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.
I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).
But is it a problem ?
If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.
So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)
I think the problem is rather that the behaviour is not deterministic. Some alert/recovered transitions will get through (if the alert goes into the alerts loop processing without waiting) or can get lost (if alert and recovery are processed in the same loop).
But it is "deterministic enough" that you will either get both of them (alert + recovery), or neither. You will not get an alert and then lose the recovery-message, or get a recovery-message without the alert having been sent.
This leads me to another question that never get answered: what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ? We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?
Dominique
In <4D593040.6090808 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:
what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ?
Then a "clear" status would trigger alerts, i.e. the xymond_alert module would begin to see alert-messages for a clear status (same as for yellow, red, purple).
I don't think you would actually see any alerts being sent, unless you also change ALERTCOLORS to include the "clear" status.
But that would be a bad idea, since "clear" is also used for e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...) when the server is down ("conn" status is red means client-side tests will not go purple - they go clear).
We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?
Kind of, yes. I don't recall if it was actually tested.
Regards, Henrik
Meilleures salutations, Dominique _______________UNIL - University of Lausanne_______________ Dominique Frise E-mail: Dominique.Frise at unil.ch UNIL, Centre Informatique Phone: +41 21 692 22 21 Quartier Sorge / Amphimax Fax: +41 21 692 22 05 1015 Lausanne, Switzerland URL: http://www.unil.ch/ci On 02/14/11 02:51 PM, Henrik Størner wrote:
In<4D593040.6090808 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ?
Then a "clear" status would trigger alerts, i.e. the xymond_alert module would begin to see alert-messages for a clear status (same as for yellow, red, purple).
I don't think you would actually see any alerts being sent, unless you also change ALERTCOLORS to include the "clear" status.
But that would be a bad idea, since "clear" is also used for e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...) when the server is down ("conn" status is red means client-side tests will not go purple - they go clear).
We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?
Kind of, yes. I don't recall if it was actually tested.
I dont't think it was ;-) Here below the little changes we made in xymond_alerts.c (version before your last changes) to achieve this: [super at iris xymond]# diff -u xymond_alert.c.dist xymond_alert.c --- xymond_alert.c.dist Sun Nov 14 18:21:19 2010 +++ xymond_alert.c Mon Feb 14 15:02:24 2011 @@ -355,7 +355,7 @@ char *msg; int seq; int argi; - int alertcolors, alertinterval; + int alertcolors, alertinterval, okcolors; char *configfn = NULL; char *checkfn = NULL; int checkpointinterval = 900; @@ -377,6 +377,7 @@ /* Load alert config */ alertcolors = colorset(xgetenv("ALERTCOLORS"), ((1 << COL_GREEN) | (1 << COL_BLUE))); alertinterval = 60*atoi(xgetenv("ALERTREPEAT")); + okcolors = colorset(xgetenv("OKCOLORS"), (1 << COL_RED)); /* Create our loookup-trees */ hostnames = rbtNew(name_compare); @@ -656,7 +657,7 @@ awalk->maxcolor = newcolor; } } - else { + else if ((okcolors & (1 << newcolor)) != 0) { /* * Send one "recovered" message out now, then go to A_DEAD. * Dont update the color here - we want recoveries to go out @@ -663,6 +664,11 @@ * only if the alert color triggered an alert */ awalk->state = A_RECOVERED; + } else { + /* + * This color should not trigger "recovered" messages. + */ + awalk->state = A_NORECIP; } With this in place we can better support alerting for SNMP traps (see previous discussion with Buchan http://www.xymon.com/archive/2011/02/msg00062.html), but then we want all short transitions from an alert state to a clear status to be processed by Xymon (not ignored). Dominique
On 02/14/11 02:51 PM, Henrik Størner wrote:
In<4D593040.6090808 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ?
Then a "clear" status would trigger alerts, i.e. the xymond_alert module would begin to see alert-messages for a clear status (same as for yellow, red, purple).
I don't think you would actually see any alerts being sent, unless you also change ALERTCOLORS to include the "clear" status.
But that would be a bad idea, since "clear" is also used for e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...) when the server is down ("conn" status is red means client-side tests will not go purple - they go clear).
We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?
Kind of, yes. I don't recall if it was actually tested.
(Sorry, same reply was sent before with garbage as top post.) I dont't think it was ;-) Here below the little changes we made in xymond_alerts.c (version before your last changes) to achieve this: [super at iris xymond]# diff -u xymond_alert.c.dist xymond_alert.c --- xymond_alert.c.dist Sun Nov 14 18:21:19 2010 +++ xymond_alert.c Mon Feb 14 15:02:24 2011 @@ -355,7 +355,7 @@ char *msg; int seq; int argi; - int alertcolors, alertinterval; + int alertcolors, alertinterval, okcolors; char *configfn = NULL; char *checkfn = NULL; int checkpointinterval = 900; @@ -377,6 +377,7 @@ /* Load alert config */ alertcolors = colorset(xgetenv("ALERTCOLORS"), ((1 << COL_GREEN) | (1 << COL_BLUE))); alertinterval = 60*atoi(xgetenv("ALERTREPEAT")); + okcolors = colorset(xgetenv("OKCOLORS"), (1 << COL_RED)); /* Create our loookup-trees */ hostnames = rbtNew(name_compare); @@ -656,7 +657,7 @@ awalk->maxcolor = newcolor; } } - else { + else if ((okcolors & (1 << newcolor)) != 0) { /* * Send one "recovered" message out now, then go to A_DEAD. * Dont update the color here - we want recoveries to go out @@ -663,6 +664,11 @@ * only if the alert color triggered an alert */ awalk->state = A_RECOVERED; + } else { + /* + * This color should not trigger "recovered" messages. + */ + awalk->state = A_NORECIP; } With this in place we can better support alerting for SNMP traps (see previous discussion with Buchan http://www.xymon.com/archive/2011/02/msg00062.html), but then we want all short transitions from an alert state to a clear status to be processed by Xymon (not ignored). Dominique
participants (2)
-
dominique.frise@unil.ch
-
henrik@hswn.dk