I'm going to be tasked to integrate hobbit with an event aggregator such as IBM netcool or EMC Smarts sometime in the near future. Most items will be cake, such as ping, port, and filesystem tests. Certain items, however, will be problematic from what I've seen. I'll describe the problems to see if anyone has ideas about how they can be addressed.
The event aggregator does not want all status information for each item monitored, only triggered events that cause a state change. This is easy to configure using a script in the alerts file that matches all hosts for all colors. When a test recovers a recovery is sent and it disappears from our view.
Grouped items, such as the process check and log monitors, are issues. A single process down causes the whole check to go red. A process listed as alerting only operators can then mask another process on the same system from notifying the DBA's. Setting the alert repeat interval to 0 shows the other problem, a recovery message is not generated for each process that recovers, only when the whole group of processes recovers.
The only way I've been able to wrap my head around this is to use a database for the current state of all monitored processes and log files. I tried separate alert groups with rules to distinguish each target of a process separately but the rules match multiple times and lead to confusion. I tried using a channel on the hobbitd alert but the GROUP= items from the configuration file do not get passed as associated to the process, only on the page line.
Any ideas or feedback are appreciated.
Jason