On 11-01-2012 20:53, Gore, David W (David) wrote:
Since it has been argued that it is not exactly a bug I would only humbly request that the current behavior is not changed but enhanced for those who want it to work differently. If an alert has been alarming for x time and then goes red do you want to wait even longer to be alerted. Yellow time + red time or yellow time and now its red so alert, provided the yellow time exceeds the red threshold.
If I understand it correctly, then the unhappiness with the current setup is that the DURATION setting in alerts.cfg counts both yellow and red time. So when a status goes yellow, stays there for a few hours time before going red - then a rule such as
MAIL cio at example.com COLOR=RED DURATION>3h
will trigger immediately.
Some would argue that if you haven't fixed a problem before it goes critical, then your CIO *should* be notified.
The other school of thought argues that this rule means the CIO only wants to be informed when something has been really hosed for at least three hours. So the yellow warning-time shouldn't count when evaluating the DURATION setting for that rule - only the critical time counts.
Is that a correct understanding of the arguments here ?
Let's say I implement the 3-hour delay before sending an escalation notice. What should happen if the status is yellow for two hours, then goes red for 2h50m, dips back into yellow for 10 minutes and then goes back to red ? Should the 2h50m count after the status was yellow for a while? Or does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?
I'm not trying to be too pedantic here, but it is the sort of things that do happen. So let's discuss how it can best be handled.
I think Josh is right that changing this will require some sort of additional configuration setting to indicate that "this duration value applies to the time it's been red only". It's for curbing escalation notices. And therefore it is obviously only an issue for those statuses that can be yellow - not those that can only be red or green.
It's been quite some time since I last dug into the alert-module code, so I cannot say how much effort it will take to add this. Right now I am not sure if the alert module has enough information about an alert to be able to implement it.
Meanwhile, may I draw your attention to the "SCRIPT" way of sending alerts. It's not an ideal solution, but I think it's a usable work-around for this problem:
The alert script gets triggered just the same as your MAIL alerts do. But your script can query xymond to see when the status last changed (to red, presumably) - it's the "lastchange" field stored for a status. So you could put something like this in your alert script:
#!/bin/sh
This script only handles red
if test "$BBCOLORLEVEL" != "red" then exit 0 fi
REDSTART=xymon 127.0.0.1 "xymondlog $BBHOSTNAME.$BBSVCNAME fields=lastchange" | head -n 1
NOW=date +%s
REDDURATION=expr $NOW - $REDSTART
if test $REDDURATION -lt 10800 # 3-hour (10800 secs) delay
then
exit 0
fi
... send the alert ...
(the "head -n 1" is needed, because xymondlog also sends you the full status message. On the other hand, that might be useful when generating the alert message).
Regards, Henrik