Yellow->red escalation, bug or feature?

betsy.schwartz＠gmail.com

9 Jan 2012 9 Jan '12

3:07 p.m.

I think this is a bug, but maybe it's a feature I haven't figured out yet:

Many of our alerts are set to email on yellow and page with escalation on red: alert1 after 10 minutes (repeat every 10), alert2 after 20 minutes , alert3 after 40 minutes, alert4 after an hour. When an alert is yellow, it sometimes sits around for a while. When an alert goes red, the alert1 person acks or fixes the alert and the alert4 person should never be woken up.

However, when an alert has been yellow for over an hour and *then* turns red, we are seeing that the entire escalation group is paged, as though the alert has been red for over an hour.

I think this is a bug - when the alert first goes red it should be treated as a NEW alert and not go waking up everyone.

Thoughts? Am I missing something? Our tier4 person is getting rather annoyed at being woken up for things that the tier1 person can handle.

thanks Betsy

Show replies by date

josh＠imaginenetworksllc.com

9 Jan 9 Jan

3:11 p.m.

You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Mon, Jan 9, 2012 at 10:07 AM, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:

...

I think this is a bug, but maybe it's a feature I haven't figured out yet:

Many of our alerts are set to email on yellow and page with escalation on red: alert1 after 10 minutes (repeat every 10), alert2 after 20 minutes , alert3 after 40 minutes, alert4 after an hour. When an alert is yellow, it sometimes sits around for a while. When an alert goes red, the alert1 person acks or fixes the alert and the alert4 person should never be woken up.

However, when an alert has been yellow for over an hour and *then* turns red, we are seeing that the entire escalation group is paged, as though the alert has been red for over an hour.

I think this is a bug - when the alert first goes red it should be treated as a NEW alert and not go waking up everyone.

Thoughts? Am I missing something? Our tier4 person is getting rather annoyed at being woken up for things that the tier1 person can handle.

thanks Betsy

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

betsy.schwartz＠gmail.com

4:15 p.m.

On Mon, Jan 9, 2012 at 10:11 AM, Josh Luthman <josh at imaginenetworksllc.com> wrote:

...

You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

Exactly. Red for five minutes, anyway :-) At least some of the time, I think there's a counter that isn't reset.

josh＠imaginenetworksllc.com

4:20 p.m.

What version is this? I don't think I've got that bug.

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Mon, Jan 9, 2012 at 11:15 AM, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:

...

On Mon, Jan 9, 2012 at 10:11 AM, Josh Luthman <josh at imaginenetworksllc.com> wrote:

...
You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

Exactly. Red for five minutes, anyway :-) At least some of the time, I think there's a counter that isn't reset.

betsy.schwartz＠gmail.com

6:56 p.m.

I am on 4.3.7 now; saw the behavior on earlier 4.3.x versions.

On Mon, Jan 9, 2012 at 11:20 AM, Josh Luthman <josh at imaginenetworksllc.com> wrote:

...

What version is this? I don't think I've got that bug.

Here's the most recent test that got everyone annoyed

Sat Jan 07 09:02:06 2012 green 2 days 4:48:10 Sat Jan 07 08:16:59 2012 red 0:45:07 Sat Jan 07 07:16:50 2012 yellow 1:00:09 Sat Jan 07 03:16:16 2012 green 4:00:34 Sat Jan 07 02:56:13 2012 red 0:20:03 Sat Jan 07 01:56:04 2012 yellow 1:00:09

notifications sent:

Sat Jan 7 01:56:04 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) xymon at example.com[139] 1325919364 0 Sat Jan 7 02:57:17 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) xymon at example.com[139] 1325923037 0 Sat Jan 7 02:57:17 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert1[149] 1325923037 0 Sat Jan 7 02:57:17 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert2[152] 1325923037 0 Sat Jan 7 02:57:17 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert3[153] 1325923037 0 Sat Jan 7 02:57:17 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert4[154] 1325923037 0 Sat Jan 7 03:07:21 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert1[149] 1325923641 0 Sat Jan 7 03:07:21 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert2[152] 1325923641 0 Sat Jan 7 03:07:21 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert3[153] 1325923641 0 Sat Jan 7 03:07:21 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert4[154] 1325923641 0 Sat Jan 7 07:16:53 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) xymon at example.com[139] 1325938613 0 Sat Jan 7 08:17:13 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) xymon at example.com[139] 1325942233 0 Sat Jan 7 08:17:13 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert1[149] 1325942233 0 Sat Jan 7 08:17:13 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert2[152] 1325942233 0 Sat Jan 7 08:17:13 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert3[153] 1325942233 0 Sat Jan 7 08:17:13 2012 edprocs3.example.com.watch_oelogs (10.100.4.57) alert4[154] 1325942233 0

You can see on Saturday it went yellow at 1:56 , emailing "xymon at example.com" which is our email alert, and then an hour later it went red and started emailing the world. Then at 7:00 am the same thing.

I note that all of these servers are on EDT, and this test went red exactly an hour after going yellow because it's a custom test that goes yellow after so many seconds and red an hour later.

betsy.schwartz＠gmail.com

7:12 p.m.

...

You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

I note that the previous example was for a custom test but I also have seen this for the disk test: (set to email every 8 hours when yellow)

Sat Dec 24 10:53:27 2011 red 0:49:09 Sun Dec 18 03:01:51 2011 yellow 6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324594479 100 Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324623280 100 Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324652087 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert1[149] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert2[152] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert3[153] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert4[154] 1324742067 100

skadz＠skadz.com

8:23 p.m.

I've seen this exact same issue going all the way back to hobbit, so this is not a new issue with 4.3. I would love to see it fixed though, as it's very annoying to get paged when you are second or third on call and everyone gets notified on the first red.

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <betsy.schwartz at gmail.com

...

wrote:

...

...
You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

I note that the previous example was for a custom test but I also have seen this for the disk test: (set to email every 8 hours when yellow)

Sat Dec 24 10:53:27 2011 red 0:49:09 Sun Dec 18 03:01:51 2011 yellow 6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324594479 100 Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324623280 100 Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324652087 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert1[149] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert2[152] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert3[153] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert4[154] 1324742067 100

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

spah＠syntec.co.uk

10 Jan 10 Jan

11:24 a.m.

I agree that this is not a new issue. I have discussed this before (http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply: http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and http://lists.xymon.com/archive/2008-September/020998.html).

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue. So I hope it can be changed!

BTW, The oldarchive is better for following threads (provided they don't cross month boundaries): http://lists.xymon.com/oldarchive/2008/09/msg00057.html Compare with the previous link. However, the new archive keeps attachments. It would be nice if the functionality of both archives were merged...

Kind regards,

SebA

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:

...

You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

I note that the previous example was for a custom test but I also have seen this for the disk test: (set to email every 8 hours when yellow)

Sat Dec 24 10:53:27 2011 red 0:49:09 Sun Dec 18 03:01:51 2011 yellow 6 days 7:51:36

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

betsy.schwartz＠gmail.com

7:15 p.m.

Me too. It's a very serious problem for us.

We need to avoid waking up the tier4 people!

It attracts a ton of attention when we have a small issue waking up everyone in the house.

Carl.Melgaard＠STAB.RM.DK

11 Jan 11 Jan

9:56 a.m.

Hi,

It would be interesting to see if this bug could be squashed, now that flap-detection is in the game. But I haven't seen Henrik on this list for a good time now - he's active on the developer-list, tho - so I'm crossposting it there.

Regards,

Carl Melgaard

Fra: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] På vegne af SebA Sendt: 10. januar 2012 12:24 Til: Xymon at xymon.com Emne: Re: [Xymon] Yellow->red escalation, bug or feature?

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue. So I hope it can be changed!

Kind regards,

SebA

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Ryan Skadberg Sent: 09 January 2012 20:23 To: Xymon at xymon.com Subject: Re: [Xymon] Yellow->red escalation, bug or feature? I've seen this exact same issue going all the way back to hobbit, so this is not a new issue with 4.3. I would love to see it fixed though, as it's very annoying to get paged when you are second or third on call and everyone gets notified on the first red.

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <betsy.schwartz at gmail.com<mailto:betsy.schwartz at gmail.com>> wrote:

...

You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour? I note that the previous example was for a custom test but I also have seen this for the disk test: (set to email every 8 hours when yellow)

Sat Dec 24 10:53:27 2011 red 0:49:09 Sun Dec 18 03:01:51 2011 yellow 6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com<mailto:xymon at example.com>[139] 1324594479 100 Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com<mailto:xymon at example.com>[139] 1324623280 100 Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com<mailto:xymon at example.com>[139] 1324652087 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com<mailto:xymon at example.com>[139] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert1[149] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert2[152] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert3[153] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert4[154] 1324742067 100

Xymon mailing list Xymon at xymon.com<mailto:Xymon at xymon.com> http://lists.xymon.com/mailman/listinfo/xymon

novosirj＠umdnj.edu

3:30 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

I've seen him post in the last week, so I know he does read this list periodically at least.

On 01/11/2012 04:56 AM, Carl Melgaard wrote:

...

Hi,

It would be interesting to see if this bug could be squashed, now that flap-detection is in the game. But I haven?t seen Henrik on this list for a good time now ? he?s active on the developer-list, tho ? so I?m crossposting it there.

*Fra:*xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] *På vegne af *SebA *Sendt:* 10. januar 2012 12:24 *Til:* Xymon at xymon.com *Emne:* Re: [Xymon] Yellow->red escalation, bug or feature?

I agree that this is not a new issue. I have discussed this before (http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply: http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and http://lists.xymon.com/archive/2008-September/020998.html).

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue. So I hope it can be changed!

BTW, The oldarchive is better for following threads (provided they don't cross month boundaries):

http://lists.xymon.com/oldarchive/2008/09/msg00057.html

Compare with the previous link. However, the new archive keeps attachments. It would be nice if the functionality of both archives were merged...

Kind regards,

SebA
------------------------------------------------------------------------

*From:*xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] *On
Behalf Of *Ryan Skadberg
*Sent:* 09 January 2012 20:23
*To:* Xymon at xymon.com
*Subject:* Re: [Xymon] Yellow->red escalation, bug or feature?

I've seen this exact same issue going all the way back to hobbit, so
this is not a new issue with 4.3.  I would love to see it fixed
though, as it's very annoying to get paged when you are second or
third on call and everyone gets notified on the first red.

 

Skadz

 

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz
&lt;betsy.schwartz at gmail.com &lt;mailto:betsy.schwartz at gmail.com>> wrote:

>You're saying yellow for an hour and red for a few seconds triggers
>like it was red for an hour?

I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)



Sat Dec 24 10:53:27 2011        red     0:49:09
Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)
xymon at example.com &lt;mailto:xymon at example.com>[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
xymon at example.com &lt;mailto:xymon at example.com>[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
xymon at example.com &lt;mailto:xymon at example.com>[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
xymon at example.com &lt;mailto:xymon at example.com>[139] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

_______________________________________________
Xymon mailing list
Xymon at xymon.com &lt;mailto:Xymon at xymon.com>
http://lists.xymon.com/mailman/listinfo/xymon

 
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

---- _ _ _ _ ___ _ _ _ |Y#| | | |\/| | \ |\ | | |Ryan Novosielski - Sr. Systems Programmer |$&| |__| | | |__/ | \| _| |novosirj at umdnj.edu - 973/972.0922 (2-0922) \__/ Univ. of Med. and Dent.|IST/EI-Academic Svcs. - ADMC 450, Newark -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8NqwsACgkQmb+gadEcsb6e8wCePPnQ2/d+zUtSmWft/2GezsRp WIkAnAqhJAKUKqqQddv5rFwXO2g6hN1Q =pvLI -----END PGP SIGNATURE-----

david.gore＠verizon.com

7:53 p.m.

Since it has been argued that it is not exactly a bug I would only humbly request that the current behavior is not changed but enhanced for those who want it to work differently. If an alert has been alarming for x time and then goes red do you want to wait even longer to be alerted. Yellow time + red time or yellow time and now its red so alert, provided the yellow time exceeds the red threshold.

~David

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Carl Melgaard Sent: Wednesday, January 11, 2012 04:56 To: 'xymon at xymon.com' Cc: 'xymon-developer at lists.sourceforge.net' Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

Hi,

Regards,

Carl Melgaard

Fra: xymon-bounces at xymon.com<mailto:xymon-bounces at xymon.com> [mailto:xymon-bounces at xymon.com] På vegne af SebA Sendt: 10. januar 2012 12:24 Til: Xymon at xymon.com<mailto:Xymon at xymon.com> Emne: Re: [Xymon] Yellow->red escalation, bug or feature?

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue. So I hope it can be changed!

Kind regards,

SebA

From: xymon-bounces at xymon.com<mailto:xymon-bounces at xymon.com> [mailto:xymon-bounces at xymon.com] On Behalf Of Ryan Skadberg Sent: 09 January 2012 20:23 To: Xymon at xymon.com<mailto:Xymon at xymon.com> Subject: Re: [Xymon] Yellow->red escalation, bug or feature? I've seen this exact same issue going all the way back to hobbit, so this is not a new issue with 4.3. I would love to see it fixed though, as it's very annoying to get paged when you are second or third on call and everyone gets notified on the first red.

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <betsy.schwartz at gmail.com<mailto:betsy.schwartz at gmail.com>> wrote:

...

You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour? I note that the previous example was for a custom test but I also have seen this for the disk test: (set to email every 8 hours when yellow)

Sat Dec 24 10:53:27 2011 red 0:49:09 Sun Dec 18 03:01:51 2011 yellow 6 days 7:51:36

Xymon mailing list Xymon at xymon.com<mailto:Xymon at xymon.com> http://lists.xymon.com/mailman/listinfo/xymon

josh＠imaginenetworksllc.com

7:55 p.m.

I think we need a new argument for this new condition, something like DURATIONWHILERED

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Wed, Jan 11, 2012 at 2:53 PM, Gore, David W (David) <david.gore at verizon.com> wrote:

...

Since it has been argued that it is not exactly a bug I would only humbly request that the current behavior is not changed but enhanced for those who want it to work differently. If an alert has been alarming for x time and then goes red do you want to wait even longer to be alerted. Yellow time + red time or yellow time and now its red so alert, provided the yellow time exceeds the red threshold.

~David

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Carl Melgaard Sent: Wednesday, January 11, 2012 04:56 To: 'xymon at xymon.com' Cc: 'xymon-developer at lists.sourceforge.net'

Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

Hi,

It would be interesting to see if this bug could be squashed, now that flap-detection is in the game. But I haven’t seen Henrik on this list for a good time now – he’s active on the developer-list, tho – so I’m crossposting it there.

Regards,

Carl Melgaard

Fra: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] På vegne af SebA Sendt: 10. januar 2012 12:24 Til: Xymon at xymon.com Emne: Re: [Xymon] Yellow->red escalation, bug or feature?

I agree that this is not a new issue. I have discussed this before (http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply: http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and http://lists.xymon.com/archive/2008-September/020998.html).

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue. So I hope it can be changed!

BTW, The oldarchive is better for following threads (provided they don't cross month boundaries):

http://lists.xymon.com/oldarchive/2008/09/msg00057.html

Compare with the previous link. However, the new archive keeps attachments. It would be nice if the functionality of both archives were merged...

Kind regards,

SebA

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Ryan Skadberg Sent: 09 January 2012 20:23 To: Xymon at xymon.com Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

I've seen this exact same issue going all the way back to hobbit, so this is not a new issue with 4.3. I would love to see it fixed though, as it's very annoying to get paged when you are second or third on call and everyone gets notified on the first red.

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:

...
You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

I note that the previous example was for a custom test but I also have seen this for the disk test: (set to email every 8 hours when yellow)

Sat Dec 24 10:53:27 2011 red 0:49:09 Sun Dec 18 03:01:51 2011 yellow 6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324594479 100 Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324623280 100 Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324652087 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert1[149] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert2[152] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert3[153] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert4[154] 1324742067 100

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

betsy.schwartz＠gmail.com

8:03 p.m.

If an alert's been yellow for a while and then goes red I do want to be alerted - but I only want the tier1 person to be alerted.

The current behavior of immediately paging all the way up the food chain to the tier4 people, the minute it goes red, seems wrong to me

and it is REALLY upsetting our tier4 people who are getting woken up at 3am for stuff the tier1 person can handle.

(This is happening for us most often with disk space. People are not super-fast at cleaning up disk space. But I'm waking up managers for disks that have hit 90% full and that's just not cool)

If other people like the behavior, making it a knob we can turn is fine. Just something I can do to keep from waking the whole crew up.

On Wed, Jan 11, 2012 at 2:55 PM, Josh Luthman <josh at imaginenetworksllc.com> wrote:

...

I think we need a new argument for this new condition, something like DURATIONWHILERED

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Wed, Jan 11, 2012 at 2:53 PM, Gore, David W (David) <david.gore at verizon.com> wrote:

...
Since it has been argued that it is not exactly a bug I would only humbly request that the current behavior is not changed but enhanced for those who want it to work differently. If an alert has been alarming for x time and then goes red do you want to wait even longer to be alerted. Yellow time + red time or yellow time and now its red so alert, provided the yellow time exceeds the red threshold.

~David

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Carl Melgaard Sent: Wednesday, January 11, 2012 04:56 To: 'xymon at xymon.com' Cc: 'xymon-developer at lists.sourceforge.net'

Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

Hi,

It would be interesting to see if this bug could be squashed, now that flap-detection is in the game. But I haven’t seen Henrik on this list for a good time now – he’s active on the developer-list, tho – so I’m crossposting it there.

Regards,

Carl Melgaard

Fra: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] På vegne af SebA Sendt: 10. januar 2012 12:24 Til: Xymon at xymon.com Emne: Re: [Xymon] Yellow->red escalation, bug or feature?

I agree that this is not a new issue. I have discussed this before (http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply: http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and http://lists.xymon.com/archive/2008-September/020998.html).

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue. So I hope it can be changed!

BTW, The oldarchive is better for following threads (provided they don't cross month boundaries):

http://lists.xymon.com/oldarchive/2008/09/msg00057.html

Compare with the previous link. However, the new archive keeps attachments. It would be nice if the functionality of both archives were merged...

Kind regards,

SebA

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Ryan Skadberg Sent: 09 January 2012 20:23 To: Xymon at xymon.com Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

I've seen this exact same issue going all the way back to hobbit, so this is not a new issue with 4.3. I would love to see it fixed though, as it's very annoying to get paged when you are second or third on call and everyone gets notified on the first red.

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:

...
You're saying yellow for an hour and red for a few seconds triggers like it was red for an hour?

I note that the previous example was for a custom test but I also have seen this for the disk test: (set to email every 8 hours when yellow)

Sat Dec 24 10:53:27 2011 red 0:49:09 Sun Dec 18 03:01:51 2011 yellow 6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324594479 100 Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324623280 100 Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324652087 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) xymon at example.com[139] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert1[149] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert2[152] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert3[153] 1324742067 100 Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33) alert4[154] 1324742067 100

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

henrik＠hswn.dk

9:39 p.m.

On 11-01-2012 20:53, Gore, David W (David) wrote:

...

Since it has been argued that it is not exactly a bug I would only humbly request that the current behavior is not changed but enhanced for those who want it to work differently. If an alert has been alarming for x time and then goes red do you want to wait even longer to be alerted. Yellow time + red time or yellow time and now its red so alert, provided the yellow time exceeds the red threshold.

If I understand it correctly, then the unhappiness with the current setup is that the DURATION setting in alerts.cfg counts both yellow and red time. So when a status goes yellow, stays there for a few hours time before going red - then a rule such as

MAIL cio at example.com COLOR=RED DURATION>3h

will trigger immediately.

Some would argue that if you haven't fixed a problem before it goes critical, then your CIO *should* be notified.

The other school of thought argues that this rule means the CIO only wants to be informed when something has been really hosed for at least three hours. So the yellow warning-time shouldn't count when evaluating the DURATION setting for that rule - only the critical time counts.

Is that a correct understanding of the arguments here ?

Let's say I implement the 3-hour delay before sending an escalation notice. What should happen if the status is yellow for two hours, then goes red for 2h50m, dips back into yellow for 10 minutes and then goes back to red ? Should the 2h50m count after the status was yellow for a while? Or does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?

I'm not trying to be too pedantic here, but it is the sort of things that do happen. So let's discuss how it can best be handled.

I think Josh is right that changing this will require some sort of additional configuration setting to indicate that "this duration value applies to the time it's been red only". It's for curbing escalation notices. And therefore it is obviously only an issue for those statuses that can be yellow - not those that can only be red or green.

It's been quite some time since I last dug into the alert-module code, so I cannot say how much effort it will take to add this. Right now I am not sure if the alert module has enough information about an alert to be able to implement it.

Meanwhile, may I draw your attention to the "SCRIPT" way of sending alerts. It's not an ideal solution, but I think it's a usable work-around for this problem:

The alert script gets triggered just the same as your MAIL alerts do. But your script can query xymond to see when the status last changed (to red, presumably) - it's the "lastchange" field stored for a status. So you could put something like this in your alert script:

#!/bin/sh

This script only handles red

if test "$BBCOLORLEVEL" != "red" then exit 0 fi

REDSTART=xymon 127.0.0.1 "xymondlog $BBHOSTNAME.$BBSVCNAME fields=lastchange" | head -n 1 NOW=date +%s REDDURATION=expr $NOW - $REDSTART if test $REDDURATION -lt 10800 # 3-hour (10800 secs) delay then exit 0 fi

... send the alert ...

(the "head -n 1" is needed, because xymondlog also sends you the full status message. On the other hand, that might be useful when generating the alert message).

Regards, Henrik

spah＠syntec.co.uk

12 Jan 12 Jan

12:07 p.m.

xymon-bounces at xymon.com wrote:

...

On 11-01-2012 20:53, Gore, David W (David) wrote:

...
Since it has been argued that it is not exactly a bug I would only humbly request that the current behavior is not changed but enhanced for those who want it to work differently. If an alert has been alarming for x time and then goes red do you want to wait even longer to be alerted. Yellow time + red time or yellow time and now its red so alert, provided the yellow time exceeds the red threshold.

Yes, I do want to wait even longer. I want to wait for the duration that was specified in the alert rule, for the colour that was specified in the alert rule. And I think this is how one would expect xymond_alert to behave given the syntax of the rule, with no prior knowledge of Xymon (and not having read the documentation).

...

If I understand it correctly, then the unhappiness with the current setup is that the DURATION setting in alerts.cfg counts both yellow and red time. So when a status goes yellow, stays there for a few hours time before going red - then a rule such as
MAIL cio at example.com COLOR=RED DURATION>3h
will trigger immediately.

Some would argue that if you haven't fixed a problem before it goes critical, then your CIO *should* be notified.

Sounds like, for people who want that behaviour, they need a (yet to be implemented) WARNINGDURATION> rule. This implies that tier1 support probably get alerts on yellows, which I expect could result in a lot of false positive alerts for them! But if that's how they want it, that's their affair.

...

The other school of thought argues that this rule means the CIO only wants to be informed when something has been really hosed for at least three hours. So the yellow warning-time shouldn't count when evaluating the DURATION setting for that rule - only the critical time counts.

Is that a correct understanding of the arguments here ?

Yes.

...

Let's say I implement the 3-hour delay before sending an escalation notice. What should happen if the status is yellow for two hours, then goes red for 2h50m, dips back into yellow for 10 minutes and then goes back to red ? Should the 2h50m count after the status was yellow for a while? Or does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?

I already responded to this issue in my old post here: http://lists.xymon.com/oldarchive/2009/02/msg00145.html, but I'll quote the relevant part:

"...since this test can flap between yellow and red and I consider yellow to be a sufficient degree of recovery that I don't want another alert as soon as it goes red again. If we look at disk in particular though, surely if it is flapping between yellow and red the problem isn't too serious. If one does want an alert for this, one can eliminate the DURATION rule. If one does not, the DURATION rule should be a way of preventing getting alerts for the flapping behaviour. This is what I've always considered the use of the DURATION rule (although I was wrong given the way it is currently working)."

...

I'm not trying to be too pedantic here, but it is the sort of things that do happen. So let's discuss how it can best be handled.

I think Josh is right that changing this will require some sort of additional configuration setting to indicate that "this duration value applies to the time it's been red only". It's for curbing escalation notices. And therefore it is obviously only an issue for those statuses that can be yellow - not those that can only be red or green.

Continuing my quote from my old post: "Perhaps a more flexible and useful solution, while still remaining easy to use, is to incorporate the change you suggest [which was (quote Henrik): "What would probably be best was for Xymon to calculate the duration based on the COLOR-settings defined for the alert"] with a RECOVERY= rule in the alerts. So each rule can specify what colour consistutes a recovery. This means that some tests can have yellow while others have green, allowing for different alerting behaviour for flapping depending on the test, and it also allows those who get notified of recoveries to have this information when they want. :)"

<snip>

...

Regards, Henrik

And, at the risk of dirtying this thread, a closely related issue is my original post in the same thread: http://lists.xymon.com/oldarchive/2009/01/msg00364.html Quote: "It seems the combination of TIME=W:0845:2355 and DURATION>15 in hobbit-alerts.cfg means the earliest an alert can be sent out is 9 am. Is this what you would expect? I would have expected these two rules to mean the test should be in an alarm colour for more than 15 minutes and be between the times of 08:45 and 23:55, weekdays. Instead it seems to be relating the DURATION with the time such that the DURATION only applies _during_ the TIME."

So, if the CIO has a DURATION > 3 hours for a particular alert and a global TIME=W:0845:2355 (to retain their beauty sleep) he (or she) will only get the alert after 11:45 am. Might not be what they want.

Kind regards,

SebA

betsy.schwartz＠gmail.com

14 Jan 14 Jan

10:49 p.m.

Exactly. If something is yellow, by definition, we've said it's NOT critical.

Our most frequent example is disk space. A disk which fills up 100% will cause a critical disruption to production. On many disks we go yellow at 80%, to give ourselves plenty of warning, and red at 95%. Now when a disk goes red, I do want someone to look at it immediately, but it doesn't really matter that it's been yellow for a long time. In fact, the LONGER it's been yellow the LESS urgent it is, because it's not filling up very quickly. Our senior team does NOT want to be paged for this!

If I wanted something to page when it's been yellow for three hours, I've already got the capability of paging after it's been yellow for three hours.

When something turns red, I want to follow the rules and timing for reds

...

Let's say I implement the 3-hour delay before sending an escalation notice. What should happen if the status is yellow for two hours, >then goes red for 2h50m, dips back into yellow for 10 minutes and then goes back to red ? Should the 2h50m count after the status >was yellow for a while? Or does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?

This case doesn't make a lot of sense to me. If something's been red for 2h50, I've probably already escalated it up to the hilt. The above scenario is only a problem in the case where a red alert is set to be ignored for the first three hours. I don't think that's a common scenario. Anything we could ignore for 3 hours is probably a yellow.

Having to write a custom test for every single red in our environment doesn't seem like a good alternative, especially for the built-in tests.

henrik＠hswn.dk

11 Jan 11 Jan

8:28 p.m.

New subject: Please don't cross-post to the developer list (was: Re: Yellow->red escalation, bug or feature?)

On 11-01-2012 10:56, Carl Melgaard wrote:

...

It would be interesting to see if this bug could be squashed, now that flap-detection is in the game. But I haven’t seen Henrik on this list for a good time now – he’s active on the developer-list, tho – so I’m crossposting it there.

As I wrote to a couple of others, I would appreciate it if you do not crosspost to the developer list - it really isn't on-topic. Mail me directly if you feel there is some discussion on the mailing list that I may have missed.

Regards, Henrik

hinkman＠hinkman.com

8:16 p.m.

...

I think there's a counter that isn't reset.

Just guessing, but I would say you are close. Seems more like there is a counter missing. As mentioned in the old discussion included in a previous email, there is a single alert duration clock when there really needs to be both yellow and red clocks. Alert state issue again, maybe? See my comments at the bottom about another long-standing "lack of alert state" issue.

One possible non-pretty, non-scalable work-around for your issue would be to create a "red" test, i.e. diskred, that only has red-level thresholds and alerts config, and take the red alerts config off of the non-red test (but leave the red threshold). This would give you the correct red duration for your red-level paging alerts. You could use bb-hosts tricks like NOPROPRED, etc. to not show this "red" test on the web pages if you didn't want to. The non-red test would still go yellow and red so you would see it on the web, it just wouldn't be doing the red paging. Like I said, not pretty, but possibly better than the false positives you are getting. Possibly.

If the powers-that-be are willing to open the question of "alert state", then please, please also look into the long standing recovery message issue. Specifically, if you are emailing on yellow and paging on red, a test that goes green->yellow->red->yellow->green will result in a red page but only an email recovery. See http://lists.xymon.com/archive/2008-July/020107.html and http://lists.xymon.com/archive/2008-July/020152.html. Apologies if this seems like a thread hijack, that is not the intent at all, but rather these issues seem very closely related with respect to maintaining alert state and to what degree.

-- Mark L. Hinkle hinkman at hinkman.com

henrik＠hswn.dk

9:47 p.m.

On 11-01-2012 21:16, Mark Hinkle wrote:

...

...
I think there's a counter that isn't reset.

Just guessing, but I would say you are close. Seems more like there is a counter missing. As mentioned in the old discussion included in a previous email, there is a single alert duration clock when there really needs to be both yellow and red clocks. Alert state issue again, maybe? [snip] If the powers-that-be are willing to open the question of "alert state", then please, please also look into the long standing recovery message issue. Specifically, if you are emailing on yellow and paging on red, a test that goes green->yellow->red->yellow->green will result in a red page but only an email recovery. See http://lists.xymon.com/archive/2008-July/020107.html and http://lists.xymon.com/archive/2008-July/020152.html. Apologies if this seems like a thread hijack, that is not the intent at all, but rather these issues seem very closely related with respect to maintaining alert state and to what degree.

You're probably quite correct that there is some state information in the alert module that does not keep track of "enough" state to handle both of these feature requests. So it would make sense to look at them at the same time.

Regards, Henrik

5275

Age (days ago)

5280

Last active (days ago)

List overview

Download

19 comments

9 participants

participants (9)

betsy.schwartz＠gmail.com
Carl.Melgaard＠STAB.RM.DK
david.gore＠verizon.com
henrik＠hswn.dk
hinkman＠hinkman.com
josh＠imaginenetworksllc.com
novosirj＠umdnj.edu
skadz＠skadz.com
spah＠syntec.co.uk

Yellow->red escalation, bug or feature?

skadz＠skadz.com

novosirj＠umdnj.edu

This script only handles red

hinkman＠hinkman.com

tags

participants (9)