Thoughts

2 May 2007


      I'm going to be tasked to integrate hobbit with an event aggregator such
as IBM netcool or EMC Smarts sometime in the near future.  Most items
will be cake, such as ping, port, and filesystem tests.  Certain items,
however, will be problematic from what I've seen.  I'll describe the
problems to see if anyone has ideas about how they can be addressed.
The event aggregator does not want all status information for each item
monitored, only triggered events that cause a state change.  This is
easy to configure using a script in the alerts file that matches all
hosts for all colors.  When a test recovers a recovery is sent and it
disappears from our view.
Grouped items, such as the process check and log monitors, are issues.
A single process down causes the whole check to go red.  A process
listed as alerting only operators can then mask another process on the
same system from notifying the DBA's.  Setting the alert repeat interval
to 0 shows the other problem, a recovery message is not generated for
each process that recovers, only when the whole group of processes
recovers.
The only way I've been able to wrap my head around this is to use a
database for the current state of all monitored processes and log files.
I tried separate alert groups with rules to distinguish each target of a
process separately but the rules match multiple times and lead to
confusion.  I tried using a channel on the hobbitd alert but the GROUP=
items from the configuration file do not get passed as associated to the
process, only on the page line.
Any ideas or feedback are appreciated.
Jason

Thoughts

jason.kruse＠teldta.com