On Mon, Feb 28, 2005 at 01:28:18PM -0500, Tom Georgoulias wrote:
While were on the topic of purple status messages...Hobbit is config'd to turn a host purple if it hasn't heard from it in 30 mins. I want mine to go purple after 15, so I changed the PURPLEDELAY from "30" to "15" in hobbitserver.cfg, but that doesn't seem to make a difference. What else needs to be changed?
It's the program that generates the status message, that also determines how long it is valid. So this is something you set on each BB client or extension script. You actually cannot set it anywhere for the network tests performed by bbtest-net (I just checked and was a bit surprised that I had not provided some way of changing this).
I think I found a loop hole that may cause problems in certain circumstances: Say I get a red alert for something, give an estimate of 120 mins to fix it, and the host goes purple 45 mins later (i.e. it crashes), before the ack clears. That ack stays in the red state and I won't get a page for the red -> purple transition until after the 120 mins passed and paging resumes (presumably because the ack wasn't cleared because it never went green before going purple). This could be bad news if I have a system that crashes when the support tech is busy with other things or if a system is brought back online after a purple status and returns to something non green (i.e. disk is the only thing that is monitored on the system, and it immediately goes to red after boot up and stays that way for a while).
There are lots of ways you can outsmart the system. And you needn't have a purple status in-between:
- Disk fills up and goes red
- Clueless admin ack's the disk alert for 60 minutes, then reboots the server because that "usually fixes things"
- Disk stays red and no alerts go out until an hour has passed
In such cases there is little Hobbit can do. When you ack an alert, you take over the responsibility for that status for the time the ack is valid. If you "fix" something without checking that it actually did solve the problem, you're asking for trouble.
If you really want it, it's not a big problem to implement an "de-acknowledge" function. It might even be worthwhile for reporting purposes, to keep track of how much time your admins are using on troubleshooting. I'm open to suggestions.
Regards, Henrik