Is there a [non-messy] way to set a DURATION rule for a specific host alert? Basically, what I'm thinking of is something like this:
In hobbit-clients.cfg HOST=myhost LOAD 20 30 DURATION>5m
The effect being, the status of the "myhost" cpu alert will only change to yellow/red if the load is above the appropriate threshold for more than 5 minutes.
There are a few hosts that occasionally will spike above the cpu load thresholds, but only for a few minutes (usually around 5 min at most), and then recover on its own. However, I don't want to raise the thresholds, because a sustained load (more than 10 minutes) at this level _is_ actually a critical event. It's just not critical if it is just a momentary spike.
My specific example is with cpu load, but it could be for other things too, such as process counts, memory, or even in some situations, disk space.
Why would you not want the status to change ? Such a history log is great for troubleshooting.
if you don't want to be notified about it, just use this in the hobbit-alerts.cfg
Page=x IGNORE HOST=foo SERVICE=cpu COLOR=red DURATION<5m
if you don't want it to change the status color on the parent pages , then use NOPROPYELLOW:cpu in the bb-hosts file.
if you REALLY don't want it to change status, increase the LOAD numbers in the hobbit-clients.cfg file.
-Dan
Gary Baluha wrote:
Is there a [non-messy] way to set a DURATION rule for a specific host alert? Basically, what I'm thinking of is something like this:
In hobbit-clients.cfg HOST=myhost LOAD 20 30 DURATION>5m
The effect being, the status of the "myhost" cpu alert will only change to yellow/red if the load is above the appropriate threshold for more than 5 minutes.
There are a few hosts that occasionally will spike above the cpu load thresholds, but only for a few minutes (usually around 5 min at most), and then recover on its own. However, I don't want to raise the thresholds, because a sustained load (more than 10 minutes) at this level _is_ actually a critical event. It's just not critical if it is just a momentary spike.
My specific example is with cpu load, but it could be for other things too, such as process counts, memory, or even in some situations, disk space.
On 6/22/07, Daniel Bourque <dbourque at weatherdata.com> wrote:
Why would you not want the status to change ? Such a history log is great for troubleshooting.
I wouldn't want the status to change, because I'm essentially making it a two-part threshold; one part based on the hard-and-true numeric value, and another threshold based on the length of time.
if you don't want to be notified about it, just use this in the
hobbit-alerts.cfg
Page=x IGNORE HOST=foo SERVICE=cpu COLOR=red DURATION<5m
Ahh, that's the sort of hobbit-alerts rule that would work for me, at least until (if?) there becomes a way to do what I'm looking for in hobbit-clients.cfg.
if you don't want it to change the status color on the parent pages , then
use NOPROPYELLOW:cpu in the bb-hosts file.
if you REALLY don't want it to change status, increase the LOAD numbers in the hobbit-clients.cfg file.
The problem is that it is only a problem if the load is _sustained_ for more than 10 minutes or so. If I set the red threshold to Y, and the load momentarily spikes to Y+1, it isn't a problem. But if I raise the threshold to Y+2 and now I get a sustained load of Y+1, it would be a problem since I wouldn't get alerted.
Essentially, I'm looking for a sort of time-based hysteretic monitoring.
-Dan
Gary Baluha wrote:
Is there a [non-messy] way to set a DURATION rule for a specific host alert? Basically, what I'm thinking of is something like this:
In hobbit-clients.cfg HOST=myhost LOAD 20 30 DURATION>5m
The effect being, the status of the "myhost" cpu alert will only change to yellow/red if the load is above the appropriate threshold for more than 5 minutes.
There are a few hosts that occasionally will spike above the cpu load thresholds, but only for a few minutes (usually around 5 min at most), and then recover on its own. However, I don't want to raise the thresholds, because a sustained load (more than 10 minutes) at this level _is_ actually a critical event. It's just not critical if it is just a momentary spike.
My specific example is with cpu load, but it could be for other things too, such as process counts, memory, or even in some situations, disk space.
participants (2)
-
dbourque@weatherdata.com
-
gumby3203@gmail.com