[hobbit] Use hobbit in operation center with critcal systems view
On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:
In our environment the Operation Center always call when a alerts shows up on their Event Console and acknowledge the alert. With this action the alert is no longer visible for the operators.
Now following questions/toughts came up when we look closer;
Acknowledge; If an alert is acknowledge from the operators in critical systems this is a fix acknowledge for the given time, also when there is a status change. When a problem is fixed and goes red/yellow again it will not shown up in critical view until the acked time is expired. This sould be an option to ack a alert until a status change (like in disable until ok).
I decided against the "ack-until-ok" method, because in my experience systems often go briefly ok while being fixed, and then they crash again. (E.g. you'd reboot a server and all the processes startup, but one process that is being monitored dies after a few minutes). So the monitoring reports OK for a few minutes, and then go red - if you did use an "ack-until-ok" it would show up on the critical systems view again, triggering a new ticket.
What happens now is that when the status goes green, a timer kicks off in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit for good measure). If the test has been OK throughout those 12 minutes then the ack is cleared; if it goes non-green during that time the timer is reset and the ack persists (at least until it eventually expires).
The option Host-ack seems to be broken, on my system only one Test is acknowledged although the Host-ack Checkbox is selected.
A quick test says you're right. Will have to look into that.
Log; Missing a Log/Report from Critical view. A Report with information about the alerts and acknowledgeds information that were made in Critical systems would be helpful.
Right now it isn't even being logged, except inside the Hobbit daemon. A reporting tool is needed, I agree.
Definition (Edit Critical Systems); Easiest way for us; made standard definitions and add host to this templates. Works fine. But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.
Wouldn't these two do the same thing ? Using the alert definitions to control the critical view is an interesting idea, I hadn't thought of that.
Special Case missed or belated Messages by Operation Center; Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again. This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).
This is a problem with all of the tests that have multiple ways of going red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green - and then the Operations Center won't need to raise a ticket for the second event.
Regards, Henrik
-----Original Message----- From: Henrik Stoerner [mailto:henrik at hswn.dk] Sent: quinta-feira, 8 de novembro de 2007 21:27 To: hobbit at hswn.dk Subject: Re: [hobbit] Use hobbit in operation center with critcal systems view
On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:
In our environment the Operation Center always call when a alerts shows up on their Event Console and acknowledge the alert. With this action the alert is no longer visible for the operators.
Now following questions/toughts came up when we look closer;
Acknowledge; If an alert is acknowledge from the operators in critical systems this is a fix acknowledge for the given time, also when there is a status change. When a problem is fixed and goes red/yellow again it will not shown up in critical view until the acked time is expired. This sould be an option to ack a alert until a status change (like in disable until ok).
I decided against the "ack-until-ok" method, because in my experience systems often go briefly ok while being fixed, and then they crash again. (E.g. you'd reboot a server and all the processes startup, but one process that is being monitored dies after a few minutes). So the monitoring reports OK for a few minutes, and then go red - if you did use an "ack-until-ok" it would show up on the critical systems view again, triggering a new ticket.
What happens now is that when the status goes green, a timer kicks off in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit for good measure). If the test has been OK throughout those 12 minutes then the ack is cleared; if it goes non-green during that time the timer is reset and the ack persists (at least until it eventually expires).
The option Host-ack seems to be broken, on my system only one Test is acknowledged although the Host-ack Checkbox is selected.
A quick test says you're right. Will have to look into that.
Log; Missing a Log/Report from Critical view. A Report with information about the alerts and acknowledgeds information that were made in Critical systems would be helpful.
Right now it isn't even being logged, except inside the Hobbit daemon. A reporting tool is needed, I agree.
Definition (Edit Critical Systems); Easiest way for us; made standard definitions and add host to this templates. Works fine. But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.
Wouldn't these two do the same thing ? Using the alert definitions to control the critical view is an interesting idea, I hadn't thought of that.
Special Case missed or belated Messages by Operation Center; Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again. This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).
This is a problem with all of the tests that have multiple ways of going red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green - and then the Operations Center won't need to raise a ticket for the second event.
I use as a solution to this problem, the counting of alerts within each test, if the number of alerts has changed, then a new alert will be generated with the status of the test
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Special Case missed or belated Messages by Operation Center; Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again. This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).
This is a problem with all of the tests that have multiple ways of going red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green - and then the Operations Center won't need to raise a ticket for the second event.
I use as a solution to this problem, the counting of alerts within each test, if the number of alerts has changed, then a new alert will be generated with the status of the test
Sounds promising; how it works exactly ?
If the alert is already red, how you can send a new alert ?
Hi
-----Original Message----- From: Gräub Roland [mailto:roland.graeub at rtc.ch] Sent: segunda-feira, 12 de novembro de 2007 07:13 To: hobbit at hswn.dk Subject: AW: [hobbit] Use hobbit in operation center with critcal systems view
- and then
the Operations Center won't need to raise a ticket for the second event.
I use as a solution to this problem, the counting of alerts within each test, if the number of alerts has changed, then a new alert will be generated with the status of the test
Special Case missed or belated Messages by Operation Center; Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again. This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).
This is a problem with all of the tests that have
multiple ways of
going red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green
Sounds promising; how it works exactly ?
If the alert is already red, how you can send a new alert ?
I create a new red to red event. Soo i have ever time a new event if number alert change. This procedure have I in one software that was either modification the BB. I am working to try to adapt it to the Hobbit now
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Special Case missed or belated Messages by Operation Center; Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again. This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).
This is a problem with all of the tests that have multiple ways of going red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green - and then the Operations Center won't need to raise a ticket for the second event.
As has been mentioned before, it seems the "Info" column doesn't properly display GROUP alert definitions...
Anyway, what about doing something with the way GROUP alerts are defined to take care of such tests with multiple ways of going red. For starters, I wouldn't think it would be too hard to modify the Critical Systems page to handle group-based alerts. You could then expand on that idea to take care of each individual triggering event. Migrating this functionality to the non-green page/etc might take a little more work, but I know at least where I work, getting this taken care of so our Operations Center doesn't needlessly call people is the first time I would want to get working.
Acknowledge; If an alert is acknowledge from the operators in critical systems this is a fix acknowledge for the given time, also when there is a status change. When a problem is fixed and goes red/yellow again it will not shown up in critical view until the acked time is expired. This sould be an option to ack a alert until a status
change (like in
disable until ok).
I decided against the "ack-until-ok" method, because in my experience systems often go briefly ok while being fixed, and then they crash again. (E.g. you'd reboot a server and all the processes startup, but one process that is being monitored dies after a few minutes). So the monitoring reports OK for a few minutes, and then go red - if you did use an "ack-until-ok" it would show up on the critical systems view again, triggering a new ticket.
What happens now is that when the status goes green, a timer kicks off in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit for good measure). If the test has been OK throughout those 12 minutes then the ack is cleared; if it goes non-green during that time the timer is reset and the ack persists (at least until it eventually expires).
I agree with you the ack until-ok could be end in a lot more unneeded alerts. So its unnecessary. The cleartime of 12 min is a good choice, might be an option in hobbitserver.cfg.
Definition (Edit Critical Systems); Easiest way for us; made standard definitions and add host to this templates. Works fine. But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.
Wouldn't these two do the same thing ?
Actually in daytimes the recovery-group gets alerts on the in-house pager. This are the identical Systems like in the operator view but the defintion is in hobbit-alerts.
By the way in the page.log i get this message from my custom pager-script; 2007-11-09 09:05:00 hobbitd_alert: Got message 52634, expected 52615 Maybe the reason is the long script runtime to send the message trough a slow analog modem connection on a other server; this takes 30seconds to finish. But i dont know what this message really mean, it seems to work as expected.
Using the alert definitions to control the critical view is an interesting idea, I hadn't thought of that.
Special Case missed or belated Messages by Operation Center; Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again. This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).
This is a problem with all of the tests that have multiple ways of going red: disk, procs, msgs and http are the common ones. I don't have solution to that right now. The way Hobbit works right now assumes that when you get an alert about the "disk" status, you keep on fixing it until the status goes green - and then the Operations Center won't need to raise a ticket for the second event.
Its like you say when its red i have to fix it until the test is green again. Maybe we disassemble some Test(example made for important procs a own test / split custom tests).
Roland
On Fri, 2007-11-09 at 00:26 +0100, Henrik Stoerner wrote:
On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:
Definition (Edit Critical Systems); Easiest way for us; made standard definitions and add host to this templates. Works fine. But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg.
Wouldn't these two do the same thing ? Using the alert definitions to control the critical view is an interesting idea, I hadn't thought of that.
I remind you that I previously asked for a method to filter in hobbit-alerts.cfg based on whether the test is a critical test (in it's timeframe etc.).
The other problem I have with the critical view is the fact that (in 4.2.0 + the patch set from about one year ago) the "Config Report (critical)" does not work, it displays nothing, even though "Config Report" for the same page/host lists the details in the NK column.
We have just enabled passing events via a SCRIPT rule in hobbit-alerts.cfg through to CA Unicenter (unfortunately with a proprietary middleware).
Regards, Buchan
participants (5)
-
bgmilne@staff.telkomsa.net
-
emichels@quicksoft.com.br
-
gumby3203@gmail.com
-
henrik@hswn.dk
-
roland.graeub@rtc.ch