Hello Henrik,
two things we can do:
- Add "--checkpoint-file=$BBTMP/alert.chk --checkpoint-interval=600" to the hobbitd_alert command in hobbitlaunch.cfg. That way it will remember all active alerts when you restart Hobbit. I'll do that asap (coming monday). That will certainly resolve this issue.
- When a new alert was first seen (also after a restart of Hobbit), the duration was reset to 0 - instead of using the information Hobbit already had about when the status change occurred. I've changed this in the code, so that it picks up the duration of the alert from the timestamp we keep for when the last status change happened. Ok, but that usefull addition is for new/coming releases.
However, I think I found out why the entire problem showed up in the first place. I had a alert-config that first mailed on an occuring event and if that was not dealt with properly, ran a pager script 20 minutes later. After an evening of applying (OS-)patches, a reboot etc. it did not work anymore. Eventually I thought that it had to do with a alert-config modification, resulting in this email-conversation.
As suggested, I checked the alerttrace.log, but could not find a reason why this problem happened (I changed pagerscript to mail, but no result). It *does* worked fine when *all* the alerts are processed at the same time!
Exploring the mailinglist and Changes-file for each version, I think it can be brought down to a known bug in Hobbit that is to be fixed in 4.1.2; see my mail from August 19th, 11:42.
Since we are running 4.0.4, I'm thinking what is a wise thing to do? The workaround does work fine now (we are a 24*7 University), I thinking to wait untill 4.1.2 reaches the Stable status, since 4.1.1 does not solve this particular bug.
Regards, Peter