xymon_4.3.0-RC1: possible lost alerts

dominique.frise＠unil.ch

11 Feb 2011 11 Feb '11

5:04 p.m.

Hi,

I think I found a bug in xymond_alert.c.

Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:

816                  /*
817                   * When a burst of alerts happen, we get lots

of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer < (lastxmit+10)) continue; 823 lastxmit = nowtimer;

The main loop will then wait for a new msg from xymond (Want msg <num>, startpos... etc).

Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.

Dominique

Show replies by date

henrik＠hswn.dk

14 Feb 14 Feb

10 a.m.

New subject: [xymon] xymon_4.3.0-RC1: possible lost alerts

In <4D556C14.5060207 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:

...

I think I found a bug in xymond_alert.c.

...

Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:

...

816 /* 817 * When a burst of alerts happen, we get lots of alert messages 818 * coming in quickly. So lets handle them in bunches and only 819 * do the full alert handling once every 10 secs - that lets us 820 * combine a bunch of alerts into one transmission process. 821 */ 822 if (nowtimer < (lastxmit+10)) continue; 823 lastxmit = nowtimer;

...

The main loop will then wait for a new msg from xymond (Want msg <num>, startpos... etc).

...

Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)

Regards, Henrik

dominique.frise＠unil.ch

11:21 a.m.

New subject: [xymon] xymon_4.3.0-RC1: possible lost alerts

On 02/14/11 11:00 AM, Henrik Størner wrote:

...

In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:

...
I think I found a bug in xymond_alert.c.

...
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:

...
816                  /*
817                   * When a burst of alerts happen, we get lots of alert messages
818                   * coming in quickly. So lets handle them in bunches and only
819                   * do the full alert handling once every 10 secs - that lets us
820                   * combine a bunch of alerts into one transmission process.
821                   */
822                  if (nowtimer&lt;  (lastxmit+10)) continue;
823                  lastxmit = nowtimer;
...
The main loop will then wait for a new msg from xymond (Want msg<num>, startpos... etc).

...
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)

I think the problem is rather that the behaviour is not deterministic. Some alert/recovered transitions will get through (if the alert goes into the alerts loop processing without waiting) or can get lost (if alert and recovery are processed in the same loop).

Dominique

henrik＠hswn.dk

12:46 p.m.

New subject: [xymon] xymon_4.3.0-RC1: possible lost alerts

In <4D59102A.2000507 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:

...

On 02/14/11 11:00 AM, Henrik St�rner wrote:

...
In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:

...
I think I found a bug in xymond_alert.c.

...
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:

...
816                  /*
817                   * When a burst of alerts happen, we get lots of alert messages
818                   * coming in quickly. So lets handle them in bunches and only
819                   * do the full alert handling once every 10 secs - that lets us
820                   * combine a bunch of alerts into one transmission process.
821                   */
822                  if (nowtimer&lt;  (lastxmit+10)) continue;
823                  lastxmit = nowtimer;
...
The main loop will then wait for a new msg from xymond (Want msg<num>, startpos... etc).

...
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)

...

I think the problem is rather that the behaviour is not deterministic. Some alert/recovered transitions will get through (if the alert goes into the alerts loop processing without waiting) or can get lost (if alert and recovery are processed in the same loop).

But it is "deterministic enough" that you will either get both of them (alert + recovery), or neither. You will not get an alert and then lose the recovery-message, or get a recovery-message without the alert having been sent.

Regards, Henrik

dominique.frise＠unil.ch

1:38 p.m.

New subject: [xymon] xymon_4.3.0-RC1: possible lost alerts

On 02/14/11 01:46 PM, Henrik Størner wrote:

...

In<4D59102A.2000507 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:

...
On 02/14/11 11:00 AM, Henrik Størner wrote:

...
In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:

...
I think I found a bug in xymond_alert.c.

...
Lets say there is a page msg for hostA.serviceA and this alert will not be processed immediately because of this part of code:

...
 816                  /*
 817                   * When a burst of alerts happen, we get lots of alert messages
 818                   * coming in quickly. So lets handle them in bunches and only
 819                   * do the full alert handling once every 10 secs - that lets us
 820                   * combine a bunch of alerts into one transmission process.
 821                   */
 822                  if (nowtimer&lt;   (lastxmit+10)) continue;
 823                  lastxmit = nowtimer;
...
The main loop will then wait for a new msg from xymond (Want msg<num>, startpos... etc).

...
Now if the next msg is a page recovery from the same hostA.serviceA, the next processing of the active alerts (for loop) will then cleanup the alert for hostA.serviceA without sending any alert.

I haven't tested your diagnosis, but it is probably correct (from how I remember that this code works).

But is it a problem ?

If you get an alert that clears a few seconds later (that is why there is a recovery message), then what is the point of sending an alert ? The notification would be for data that is no longer valid, and personally I would rather NOT be alerted a 3 AM if the problem no longer exists.

So I am tempted to invoke the old "this is not a bug, it's a feature!" meme :-)
...
I think the problem is rather that the behaviour is not deterministic. Some alert/recovered transitions will get through (if the alert goes into the alerts loop processing without waiting) or can get lost (if alert and recovery are processed in the same loop).

But it is "deterministic enough" that you will either get both of them (alert + recovery), or neither. You will not get an alert and then lose the recovery-message, or get a recovery-message without the alert having been sent.

This leads me to another question that never get answered: what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ? We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?

Dominique

henrik＠hswn.dk

1:51 p.m.

New subject: [xymon] xymon_4.3.0-RC1: possible lost alerts

In <4D593040.6090808 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:

...

what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ?

Then a "clear" status would trigger alerts, i.e. the xymond_alert module would begin to see alert-messages for a clear status (same as for yellow, red, purple).

I don't think you would actually see any alerts being sent, unless you also change ALERTCOLORS to include the "clear" status.

But that would be a bad idea, since "clear" is also used for e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...) when the server is down ("conn" status is red means client-side tests will not go purple - they go clear).

...

We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?

Kind of, yes. I don't recall if it was actually tested.

Regards, Henrik

dominique.frise＠unil.ch

2:17 p.m.

New subject: [xymon] xymon_4.3.0-RC1: possible lost alerts

Meilleures salutations, Dominique _______________UNIL - University of Lausanne_______________ Dominique Frise E-mail: Dominique.Frise at unil.ch UNIL, Centre Informatique Phone: +41 21 692 22 21 Quartier Sorge / Amphimax Fax: +41 21 692 22 05 1015 Lausanne, Switzerland URL: http://www.unil.ch/ci On 02/14/11 02:51 PM, Henrik Størner wrote:

...

In<4D593040.6090808 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:

...
what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ?

Then a "clear" status would trigger alerts, i.e. the xymond_alert module would begin to see alert-messages for a clear status (same as for yellow, red, purple).

I don't think you would actually see any alerts being sent, unless you also change ALERTCOLORS to include the "clear" status.

But that would be a bad idea, since "clear" is also used for e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...) when the server is down ("conn" status is red means client-side tests will not go purple - they go clear).

...
We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?

Kind of, yes. I don't recall if it was actually tested.

I dont't think it was ;-) Here below the little changes we made in xymond_alerts.c (version before your last changes) to achieve this: [super at iris xymond]# diff -u xymond_alert.c.dist xymond_alert.c --- xymond_alert.c.dist Sun Nov 14 18:21:19 2010 +++ xymond_alert.c Mon Feb 14 15:02:24 2011 @@ -355,7 +355,7 @@ char *msg; int seq; int argi; - int alertcolors, alertinterval; + int alertcolors, alertinterval, okcolors; char *configfn = NULL; char *checkfn = NULL; int checkpointinterval = 900; @@ -377,6 +377,7 @@ /* Load alert config */ alertcolors = colorset(xgetenv("ALERTCOLORS"), ((1 << COL_GREEN) | (1 << COL_BLUE))); alertinterval = 60*atoi(xgetenv("ALERTREPEAT")); + okcolors = colorset(xgetenv("OKCOLORS"), (1 << COL_RED)); /* Create our loookup-trees */ hostnames = rbtNew(name_compare); @@ -656,7 +657,7 @@ awalk->maxcolor = newcolor; } } - else { + else if ((okcolors & (1 << newcolor)) != 0) { /* * Send one "recovered" message out now, then go to A_DEAD. * Dont update the color here - we want recoveries to go out @@ -663,6 +664,11 @@ * only if the alert color triggered an alert */ awalk->state = A_RECOVERED; + } else { + /* + * This color should not trigger "recovered" messages. + */ + awalk->state = A_NORECIP; } With this in place we can better support alerting for SNMP traps (see previous discussion with Buchan http://www.xymon.com/archive/2011/02/msg00062.html), but then we want all short transitions from an alert state to a clear status to be processed by Xymon (not ignored). Dominique

dominique.frise＠unil.ch

4:08 p.m.

New subject: [xymon] xymon_4.3.0-RC1: possible lost alerts

On 02/14/11 02:51 PM, Henrik Størner wrote:

...

In<4D593040.6090808 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:

...
what is suppose to happen if you remove the "clear" color from OKCOLORS in xymonserver.cfg ?

Then a "clear" status would trigger alerts, i.e. the xymond_alert module would begin to see alert-messages for a clear status (same as for yellow, red, purple).

I don't think you would actually see any alerts being sent, unless you also change ALERTCOLORS to include the "clear" status.

But that would be a bad idea, since "clear" is also used for e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...) when the server is down ("conn" status is red means client-side tests will not go purple - they go clear).

...
We would expect that not recovery message should be sent when a status goes from yellow/red to clear. Only the repeat interval should be reset. Does this make sense ?

Kind of, yes. I don't recall if it was actually tested.

(Sorry, same reply was sent before with garbage as top post.) I dont't think it was ;-) Here below the little changes we made in xymond_alerts.c (version before your last changes) to achieve this: [super at iris xymond]# diff -u xymond_alert.c.dist xymond_alert.c --- xymond_alert.c.dist Sun Nov 14 18:21:19 2010 +++ xymond_alert.c Mon Feb 14 15:02:24 2011 @@ -355,7 +355,7 @@ char *msg; int seq; int argi; - int alertcolors, alertinterval; + int alertcolors, alertinterval, okcolors; char *configfn = NULL; char *checkfn = NULL; int checkpointinterval = 900; @@ -377,6 +377,7 @@ /* Load alert config */ alertcolors = colorset(xgetenv("ALERTCOLORS"), ((1 << COL_GREEN) | (1 << COL_BLUE))); alertinterval = 60*atoi(xgetenv("ALERTREPEAT")); + okcolors = colorset(xgetenv("OKCOLORS"), (1 << COL_RED)); /* Create our loookup-trees */ hostnames = rbtNew(name_compare); @@ -656,7 +657,7 @@ awalk->maxcolor = newcolor; } } - else { + else if ((okcolors & (1 << newcolor)) != 0) { /* * Send one "recovered" message out now, then go to A_DEAD. * Dont update the color here - we want recoveries to go out @@ -663,6 +664,11 @@ * only if the alert color triggered an alert */ awalk->state = A_RECOVERED; + } else { + /* + * This color should not trigger "recovered" messages. + */ + awalk->state = A_NORECIP; } With this in place we can better support alerting for SNMP traps (see previous discussion with Buchan http://www.xymon.com/archive/2011/02/msg00062.html), but then we want all short transitions from an alert state to a clear status to be processed by Xymon (not ignored). Dominique

5609

Age (days ago)

5612

Last active (days ago)

List overview

Download

7 comments

2 participants

participants (2)

dominique.frise＠unil.ch
henrik＠hswn.dk