Acknowledge; If an alert is acknowledge from the operators in critical systems this is a fix acknowledge for the given time, also when there is a status change. When a problem is fixed and goes red/yellow again it will not shown up in critical view until the acked time is expired. This sould be an option to ack a alert until a status
change (like in
disable until ok).
I decided against the "ack-until-ok" method, because in my experience systems often go briefly ok while being fixed, and then they crash again. (E.g. you'd reboot a server and all the processes startup, but one process that is being monitored dies after a few minutes). So the monitoring reports OK for a few minutes, and then go red - if you did use an "ack-until-ok" it would show up on the critical systems view again, triggering a new ticket.
What happens now is that when the status goes green, a timer kicks off in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit for good measure). If the test has been OK throughout those 12 minutes then the ack is cleared; if it goes non-green during that time the timer is reset and the ack persists (at least until it eventually expires).
I agree with you the ack until-ok could be end in a lot more unneeded alerts. So its unnecessary. The cleartime of 12 min is a good choice, might be an option in hobbitserver.cfg.
Definition (Edit Critical Systems); Easiest way for us; made standard definitions and add host to this templates. Works fine. But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.
Wouldn't these two do the same thing ?
Actually in daytimes the recovery-group gets alerts on the in-house pager. This are the identical Systems like in the operator view but the defintion is in hobbit-alerts.
By the way in the page.log i get this message from my custom pager-script; 2007-11-09 09:05:00 hobbitd_alert: Got message 52634, expected 52615 Maybe the reason is the long script runtime to send the message trough a slow analog modem connection on a other server; this takes 30seconds to finish. But i dont know what this message really mean, it seems to work as expected.
Using the alert definitions to control the critical view is an interesting idea, I hadn't thought of that.
Special Case missed or belated Messages by Operation Center; Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again. This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).
This is a problem with all of the tests that have multiple ways of going red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green - and then the Operations Center won't need to raise a ticket for the second event.
Its like you say when its red i have to fix it until the test is green again. Maybe we disassemble some Test(example made for important procs a own test / split custom tests).
Roland