AW: [hobbit] Use hobbit in operation center with critcal systemsview

12 Nov 2007


      ...
...
Acknowledge;
If an alert is acknowledge from the operators in critical systems
this is a fix acknowledge for the given time, also when there is a
status change.
When a problem is fixed and goes red/yellow again it will not shown
up in critical view until the acked time is expired.
This sould be an option to ack a alert until a status
change (like in
...
disable until ok).
I decided against the "ack-until-ok" method, because in my experience
systems often go briefly ok while being fixed, and then they crash
again. (E.g. you'd reboot a server and all the processes startup, but
one process that is being monitored dies after a few minutes). So the
monitoring reports OK for a few minutes, and then go red - if you did
use an "ack-until-ok" it would show up on the critical systems view
again, triggering a new ticket.
What happens now is that when the status goes green, a timer kicks off
in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles,
plus a bit
for good measure). If the test has been OK throughout those 12 minutes
then the ack is cleared; if it goes non-green during that
time the timer
is reset and the ack persists (at least until it eventually expires).
I agree with you the ack until-ok could be end in a lot more unneeded alerts. So its unnecessary.
The cleartime of 12 min is a good choice, might be an option in hobbitserver.cfg.
...
...
Definition (Edit Critical Systems);
Easiest way for us; made standard definitions and add host
to this templates. Works fine.
But i miss a connection between alerts and critical view
definition. Something like a option in hobbit-alerts.cfg to
define that this rule is also valid for critical view.
Send a email when a alert shows up in critical view with
all the possibiltys form hobbit-alerts.cfg.
Wouldn't these two do the same thing ?
Actually in daytimes the recovery-group gets alerts on the in-house pager.
This are the identical Systems like in the operator view but the defintion is in hobbit-alerts.
By the way in the page.log i get this message from my custom pager-script;
2007-11-09 09:05:00 hobbitd_alert: Got message 52634, expected 52615
Maybe the reason is the long script runtime to send the message trough a slow analog modem connection on a other server; this takes 30seconds to finish.
But i dont know what this message really mean, it seems to work as expected.
...
Using the alert definitions to control the critical view is
an interesting
idea, I hadn't thought of that.
...
Special Case missed or belated Messages by Operation Center;
Now some application/scripts sends Alerts to the Console
View and the Operation Center make an alert call for each event.
A problem in Hobbit/BB is when changes happen in red
messages, the Operation Center didnt realize that until the
acknowledge time runs out and they make the alert call again.
This can happen for example in the disk status test (a
second filesystem goes red) or with nested Tests/Logfiles.
With the Event Console they get two messages (each for one
Filesystem).
This is a problem with all of the tests that have multiple
ways of going
red: disk, procs, msgs and http are the common ones. I don't have
solution to that right now. The way Hobbit works right now
assumes that
when you get an alert about the "disk" status, you keep on fixing it
until the status goes green - and then the Operations Center
won't need
to raise a ticket for the second event.
Its like you say when its red i have to fix it until the test is green again.
Maybe we disassemble some Test(example made for important procs a own test / split custom tests).
Roland

AW: [hobbit] Use hobbit in operation center with critcal systemsview

roland.graeub＠rtc.ch