I think that whatever solution decided should work for all other tests (in some form or another). My need for this would be the Disk report. I have a good number of Database Servers that have their disk fill up regularly. These disks are located on SAN and we either clean up the disk or put in a request for the SAN storage to be expanded. A storage expansion request could take 2-6 weeks to be fulfilled. So Ack'ing disk for that long 'blinds' you to any other disk issue that may crop up. So a way to ack just one volume, would be very desireable. I have actually written an ext test module to do this. I am still in the process of bring it over from BigBrother.
Another disk scenario I have, is similar to the point raised with ports. It is when a server is shared between 2 groups (or more). Being able to have multiple disk reports would be very welcomed. So groupA has a dedicate report for their volumes and so does groupB. I realize alerting can already be split up this way, but a way to split the reports would be a nice to have also. We do use Alternative Pagesets a lot, so that could be the reason that I like the idea of being able to split up a report into multiple reports. We create a Pageset for GroupA, with just the devices & reports that GroupA cares about. But when you have reports like disk, well disk could be red due to a GroupB volume. And this sometimes confuses GroupA :(
I think the simplest solution would be to have an parameter in the hobbit-clients.cfg:
DISK %(/mnt/Vol1|/mnt/Vol3|/mnt/Vol4) 90 95 REPORTALIAS=disk_a DISK %(/mnt/Vol2|/mnt/Vol5|/mnt/Vol6) 80 95 REPORTALIAS=disk_b DISK %!(disk_a|disk_b) 96 98
The last disk rule setting alert values for all other volumes, except those defined by disk_a & disk_b. The same REPORTALIAS feature could be used for MSGS, PORTS, PROCS, FILES, etc. And these alias names could be used in the alert rules, instead of GROUP=.
Now the above suggestion still does not help when a report has an alert status(red|yellow) and more alert items are added/subtracted. I would love the feature of being alerted when a report had more/less items in it than it did previously. The simplest way I see to do that is by including a alertstate field when the status is sent in to hobbit. I would imagine that this could be added to the report status first line, i.e bin/bb 127.0.0.1 "status server1.disk red (red:/mnt/Vol1:/mnt/Vol2 yellow:/mnt/vol3) <rest of disk report>"
So in the above example there are 2 volumes with a red status & one with a yellow. When the next status report comes in it has (red:/mnt/Vol1 yellow:/mnt/vol3), hobbit would be able to determine the report had a state change, even though the disk report would still have a red status. If reports do not provide this extra 'alertstate' field, it really shouldn't break anything. Hobbit would just behave as it does presently. Also a new alert parameter could be added, UPDATES. So people that want to receive emails whenever a report's alertstate changes can. And for people that just want alerts when reports have an alert status or recover, still can. The update alert emails can be as simple as, "server1's disk alert status has changed.", or can be complicated/informative "server1's disk /mnt/Vol2 alert status has cleared, but there are still disks that have met alert thresholds." Something else to consider is how this would affect acknowledgments. When acknowledging reports, I think a new option would be needed. Ack for the alert status, or Ack for the present alertstate. All depends on how you want to implement.
Sorry for the very long winded email, just trying to do a braindump of my thoughts. ~Steve
On Wednesday 02 May 2007 17:24, Kruse, Jason K. wrote:
Actually, you just indirectly mentioned that feels like a fairly elegant solution. What would be nice in this particular case would be to be able to attach a service label to the PROCS tests for groups of processes. The service could then be monitored without custom tests being created for each one. New colums can be created from the service tag without really cluttering the lines.
I'll have to think about how the log files are processed to see if something like that works or not.
Jason
From: Dan Vande More [mailto:bigdan at gmail.com] Sent: Wed 5/2/2007 4:09 PM To: hobbit at hswn.dk Subject: Re: [hobbit] Thoughts
Indeed, it seems to me that the whole group concept is a good way to work with us humans but breaks down wildly when dealing with computers. This is fine because most of us use the groups to save space on the screens, and configuration in the conf files.
If you want tests for each process and ultimately different behaviours for each process, you need to be prepared to do the work and make the tests for each process.
Please don't overcomplicate hobbit for this - it's a corner case and will ultimately make the program more unwieldy.
On 5/2/07, Henrik Stoerner <henrik at hswn.dk> wrote:
On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:
Grouped items, such as the process check and log monitors, are issues. A single process down causes the whole check to go red. A process listed as alerting only operators can then mask another process on the same system from notifying the DBA's. Setting the alert repeat interval to 0 shows the other problem, a recovery message is not generated for each process that recovers, only when the whole group of processes recovers.
This will be difficult to handle - it's a very basic thing in the Hobbit design that it only tracks the color of each status, not the details of which rule (out of many) causes e.g. the "procs" column to go red.
To do that, you would need to associate some "event ID" with each of the settings that can cause a red/yellow status; e.g . you'd have
HOST=myhost PROC tnslistener 1 ID=100 PROC httpd 4 ID=200The "procs" status would then store the set of ID's that had been triggered for a status, and whenever there was a change in the set of triggered rules it would pass this information to some process.
It can be done, but I am not particularly happy with it; it seems a bit too complex for my taste. If anyone has a better idea, please speak up.
(And just in case you wonder why I've used a new "event ID" instead of re-using the existing "group" definition: I can easily imagine a scenario where you have e.g. multiple processes monitored with alerts going to one group of people (i.e. several PROC rules have the same GROUP setting), but you still want to track exactly which processes are up or down - and then you need a unique ID for each PROC rule).
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk