Serious trouble, red after yellow didn't page at all tonight
Yesterday we had a red after yellow page all the way up the hierarchy immediately. Today we had a red after yellow not page at ALL. It did page in BB (test is going to both servers during this test period. Running Xymon 4.3.0 and really hoping to go live ASAP
Here are the hist log entries, see it go red at 22:07 for five minutes
Tue Apr 5 17:32:35 2011 red 1302039155 300 Tue Apr 5 17:37:35 2011 green 1302039455 900 Tue Apr 5 17:52:35 2011 yellow 1302040355 599 Tue Apr 5 18:02:34 2011 red 1302040954 601 Tue Apr 5 18:12:35 2011 yellow 1302041555 1199 Tue Apr 5 18:32:34 2011 red 1302042754 900 Tue Apr 5 18:47:34 2011 yellow 1302043654 12002 Tue Apr 5 22:07:36 2011 red 1302055656 300 Tue Apr 5 22:12:36 2011 yellow 1302055956
History shows critical status: Tue Apr 5 22:07:36 EDT 2011 OTHER Applications ( "mysqle1" ): CRITICAL
And it paged and emailed earlier in the evening: (domain name elided). It paged correctly at 6:34 and 6:45 but nothing at 10:07:
Tue Apr 5 17:34:28 2011 db0.other (10.100.4.51) techops[160] 1302039268 0 Tue Apr 5 17:34:28 2011 db0.com.other (10.100.4.51) alert1[162] 1302039268 0 Tue Apr 5 17:37:35 2011 db0.other (10.100.4.51) techops[160] 1302039455 0 300 Tue Apr 5 17:52:35 2011 db0.other (10.100.4.51) techops[160] 1302040355 0 Tue Apr 5 17:52:35 2011 db0.other (10.100.4.51) ticket[161] 1302040355 0 Tue Apr 5 18:04:18 2011 db0.other (10.100.4.51) techops[160] 1302041058 0 Tue Apr 5 18:04:18 2011 db0.other (10.100.4.51) alert1[162] 1302041058 0 Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert1[162] 1302042858 0 Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert2[163] 1302042858 0 Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert3[164] 1302042858 0 Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert1[162] 1302043502 0 Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert2[163] 1302043502 0 Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert3[164] 1302043502 0
And here are lines 159-165 in the hobbit-alerts.cfg: HOST=%^db EXHOST=%.*dl2.example* SERVICE=other MAIL techops REPEAT=1d RECOVERED MAIL ticket REPEAT=1d COLOR=yellow # open ticket email MAIL alert1 REPEAT=10 COLOR=red,purple FORMAT=SMS# page onshift or oncall at start RED, rep every 10 minutes MAIL alert2 DURATION>20 REPEAT=10 COLOR=red,purple FORMAT=SMS# page secondary after 20 mins RED . Repevery 10 minutes MAIL alert3 DURATION>40 REPEAT=10 COLOR=red,purple FORMAT=SMS# page tertiary after 40 mins RED. Rep every 10mins MAIL alert4 DURATION>60 REPEAT=10 COLOR=red,purple FORMAT=SMS# page team after 60 mins RED. Rpt every 10mins
I don't believe it was acked or signed out. It' s a complex custom test
Did Xymon create the alert? I believe there is a log specifically for this. On Apr 5, 2011 11:45 PM, "Elizabeth Schwartz" <betsy.schwartz at gmail.com> wrote:
Yesterday we had a red after yellow page all the way up the hierarchy immediately. Today we had a red after yellow not page at ALL. It did page in BB (test is going to both servers during this test period. Running Xymon 4.3.0 and really hoping to go live ASAP
Here are the hist log entries, see it go red at 22:07 for five minutes
Tue Apr 5 17:32:35 2011 red 1302039155 300 Tue Apr 5 17:37:35 2011 green 1302039455 900 Tue Apr 5 17:52:35 2011 yellow 1302040355 599 Tue Apr 5 18:02:34 2011 red 1302040954 601 Tue Apr 5 18:12:35 2011 yellow 1302041555 1199 Tue Apr 5 18:32:34 2011 red 1302042754 900 Tue Apr 5 18:47:34 2011 yellow 1302043654 12002 Tue Apr 5 22:07:36 2011 red 1302055656 300 Tue Apr 5 22:12:36 2011 yellow 1302055956
History shows critical status: Tue Apr 5 22:07:36 EDT 2011 OTHER Applications ( "mysqle1" ): CRITICAL
And it paged and emailed earlier in the evening: (domain name elided). It paged correctly at 6:34 and 6:45 but nothing at 10:07:
Tue Apr 5 17:34:28 2011 db0.other (10.100.4.51) techops[160] 1302039268 0 Tue Apr 5 17:34:28 2011 db0.com.other (10.100.4.51) alert1[162] 1302039268 0 Tue Apr 5 17:37:35 2011 db0.other (10.100.4.51) techops[160] 1302039455 0 300 Tue Apr 5 17:52:35 2011 db0.other (10.100.4.51) techops[160] 1302040355 0 Tue Apr 5 17:52:35 2011 db0.other (10.100.4.51) ticket[161] 1302040355 0 Tue Apr 5 18:04:18 2011 db0.other (10.100.4.51) techops[160] 1302041058 0 Tue Apr 5 18:04:18 2011 db0.other (10.100.4.51) alert1[162] 1302041058 0 Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert1[162] 1302042858 0 Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert2[163] 1302042858 0 Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert3[164] 1302042858 0 Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert1[162] 1302043502 0 Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert2[163] 1302043502 0 Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert3[164] 1302043502 0
And here are lines 159-165 in the hobbit-alerts.cfg: HOST=%^db EXHOST=%.*dl2.example* SERVICE=other MAIL techops REPEAT=1d RECOVERED MAIL ticket REPEAT=1d COLOR=yellow # open ticket email MAIL alert1 REPEAT=10 COLOR=red,purple FORMAT=SMS# page onshift or oncall at start RED, rep every 10 minutes MAIL alert2 DURATION>20 REPEAT=10 COLOR=red,purple FORMAT=SMS# page secondary after 20 mins RED . Repevery 10 minutes MAIL alert3 DURATION>40 REPEAT=10 COLOR=red,purple FORMAT=SMS# page tertiary after 40 mins RED. Rep every 10mins MAIL alert4 DURATION>60 REPEAT=10 COLOR=red,purple FORMAT=SMS# page team after 60 mins RED. Rpt every 10mins
I don't believe it was acked or signed out. It' s a complex custom test
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
participants (2)
-
betsy.schwartz@gmail.com
-
josh@imaginenetworksllc.com