[hobbit] only alert if X number of hosts are already in error
My best suggestion would be to use the bbcombotest tool to define a pseudo "host" with the combined status of your host "pool".
E.g. if you're monitoring http on 5 hosts, you could define a combination test like this:
Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3
That would give you a red alert if 3 or fewer hosts in the pool were green. And you could then trigger an alert based on that test result.
Pretty unwieldy when you have large pools of servers, however.
I just started writing a smart paging script which will keep track of downed hosts and decide whether or not to page.
One question I have so far is: Does hobbit wait for an alerting script to return before continuing to evaluate other rules?
-- Bruce Z. Lysik <blysik at shutterfly.com> Operations Engineer
The information contained in this message (including any attachments) may be confidential. This message (including any attachments) is intended to be read only by the recipient(s) to whom it is addressed. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Shutterfly by telephone at (650) 610-5200 and delete or destroy any copy of this message (including any attachments).
On Thu, Jun 16, 2005 at 02:28:53PM -0700, Bruce Lysik wrote:
My best suggestion would be to use the bbcombotest tool to define a pseudo "host" with the combined status of your host "pool".
E.g. if you're monitoring http on 5 hosts, you could define a combination test like this:
Pool1.http=(hostA.http+hostB.http+hostC.http+hostD.http+hostE.http)>3
That would give you a red alert if 3 or fewer hosts in the pool were green. And you could then trigger an alert based on that test result.
Pretty unwieldy when you have large pools of servers, however.
Could be, yes.
I just started writing a smart paging script which will keep track of downed hosts and decide whether or not to page.
I'm interested to know if this kind of alerting is generally useful. I suspect it might be ... if so, then we should devise a way of defining such alerts directly in Hobbit instead of forcing you to come up with scripts that work around this.
Perhaps one solution could be to implement a new kind of rule for the hobbit-alerts file. Currently all of the rules are matched against a specific host+test combination; we could define a type of rule that could be matched against all of the host+test statuses that are in an alerting stage, and then have the rule trigger based on some criteria for how many matches we get.
Something like
HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5 MAIL someone at foo.com
The "COUNT>=5" would then cause this rule to trigger only if there were 5 or more hosts named www.*.foo.com, whose http tests are red. You could even combine this with other criteria, say have a threshold of 5 during the daytime, and 10 during off-hours.
I can foresee a problem in handling recovery-notifications for this kind of alerts, but that's something I'll have to think about.
Would that be useful ?
One question I have so far is: Does hobbit wait for an alerting script to return before continuing to evaluate other rules?
Paging scripts are serialized, yes - Hobbit will wait for a paging script to complete before continuing down the list of alert rules.
Regards, Henrik
On Fri, 2005-06-17 at 08:01 +0200, Henrik Stoerner wrote:
Something like
HOST=%(www.*).foo.com TEST=http COLOR=red COUNT>=5 MAIL someone at foo.com
The "COUNT>=5" would then cause this rule to trigger only if there were 5 or more hosts named www.*.foo.com, whose http tests are red. You could even combine this with other criteria, say have a threshold of 5 during the daytime, and 10 during off-hours.
I can foresee a problem in handling recovery-notifications for this kind of alerts, but that's something I'll have to think about.
Would that be useful ?
The main place I would use it would be NTP alerts. If one router loses NTP, I'm not terribly worried. If 10-20 of them all fail at once then I know there is something really bad happening... Maybe both GPS clocks lost sync and all 4 cesium backups failed, or ntp locked up on a core router and I need to make fewer down-stream nodes dependent on that one.
I would also consider using it for purple alerts. I don't want individual purples for most of my stuff, but if there are a lot of them (>100) then I know I killed mrtg and I should page on that.
Daniel J McDonald, CCIE # 2495, CNX Austin Energy
dan.mcdonald at austinenergy.com
participants (3)
-
blysik@shutterfly.com
-
dan.mcdonald@austinenergy.com
-
henrik@hswn.dk