Den 2014-02-10 8:18, Johan Sjöberg skrev:
A while ago, we upgraded to 4.3.15. It seems like the alert repeat setting isn't working, only the first alert is sent. We have an on-call person that receives the first alert via SMS after 7 minutes. It should then repeat every 15 minutes. The rest of the team gets their first alert after 22 minutes.
[snip config]
From the notification log:
Mon Feb 10 05:43:15 2014 web01.apache2 (123.123.123.123) alarms at domain.tld 1392007395 0
Mon Feb 10 05:51:15 2014 web01.apache2 (123.123.123.123) 111111 1392007875 0
Mon Feb 10 06:05:17 2014 web01.apache2 (123.123.123.123) 222222 1392008717 0
Mon Feb 10 06:05:17 2014 web01.apache2 (123.123.123.123) 333333 1392008717 0
Mon Feb 10 06:05:17 2014 web01.apache2 (123.123.123.123) 444444 1392008717 0
Strangely though, it seems like it was working on Feb 5, which was also after the upgrade. The only change done since then is the patch for xymonnet, and don't see how this could affect the alerts?
There are no changes to how alerts work in neither 4.3.15 or 4.3.16.
I copied your configuration into a 4.3.16 system, and REPEAT is working fine here:
$ tail -f notifications.log Mon Feb 10 09:39:58 2014 webmail.hswn.dk.conn (0.0.0.0) root[3] 1392021598 500 Mon Feb 10 09:46:16 2014 webmail.hswn.dk.conn (0.0.0.0) root-1[4] 1392021976 500 Mon Feb 10 10:01:57 2014 webmail.hswn.dk.conn (0.0.0.0) root-1[4] 1392022917 500 Mon Feb 10 10:01:57 2014 webmail.hswn.dk.conn (0.0.0.0) root-2[5] 1392022917 500 Mon Feb 10 10:01:57 2014 webmail.hswn.dk.conn (0.0.0.0) root-3[6] 1392022917 500 Mon Feb 10 10:01:57 2014 webmail.hswn.dk.conn (0.0.0.0) root-4[7] 1392022917 500 Mon Feb 10 10:17:06 2014 webmail.hswn.dk.conn (0.0.0.0) root-1[4] 1392023826 500 Mon Feb 10 10:17:06 2014 webmail.hswn.dk.conn (0.0.0.0) root-2[5] 1392023826 500 Mon Feb 10 10:17:06 2014 webmail.hswn.dk.conn (0.0.0.0) root-3[6] 1392023826 500 Mon Feb 10 10:17:06 2014 webmail.hswn.dk.conn (0.0.0.0) root-4[7] 1392023826 500
(my "root" recipient is your first recipient, the "root-X" are your "11111", "22222" etc. recipients).
You didn't list the history log for the web01.apache2 service. Are you sure that it was red all of the time? Any green status will reset the REPEAT interval, this could explain why you don't see it.
Running xymond_alert with the "--debug" option will log a lot of data about how alert messages are handled. It would be nice to have this if the problem re-occurs.
Regards, Henrik