On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:
Grouped items, such as the process check and log monitors, are issues. A single process down causes the whole check to go red. A process listed as alerting only operators can then mask another process on the same system from notifying the DBA's. Setting the alert repeat interval to 0 shows the other problem, a recovery message is not generated for each process that recovers, only when the whole group of processes recovers.
This will be difficult to handle - it's a very basic thing in the Hobbit design that it only tracks the color of each status, not the details of which rule (out of many) causes e.g. the "procs" column to go red.
To do that, you would need to associate some "event ID" with each of the settings that can cause a red/yellow status; e.g. you'd have
HOST=myhost PROC tnslistener 1 ID=100 PROC httpd 4 ID=200
The "procs" status would then store the set of ID's that had been triggered for a status, and whenever there was a change in the set of triggered rules it would pass this information to some process.
It can be done, but I am not particularly happy with it; it seems a bit too complex for my taste. If anyone has a better idea, please speak up.
(And just in case you wonder why I've used a new "event ID" instead of re-using the existing "group" definition: I can easily imagine a scenario where you have e.g. multiple processes monitored with alerts going to one group of people (i.e. several PROC rules have the same GROUP setting), but you still want to track exactly which processes are up or down - and then you need a unique ID for each PROC rule).
Regards, Henrik