the all or nothing nature of hobbit
I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism.
One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status.
Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box.
I'm unaware of a solution to this issue, and I'm considering moving to another product because of it. Are there any solutions, either existing or planned?
Lastly, who is maintaining the debian package for hobbit? Both the server and client packages still have the same bugs I reported months ago.
Thanks.
Dan
Try leveraging the "depends" functionality given in bb-hosts. Correctly implemented, it should account for most cases of multiple errors with the tests:
http://www.hswn.dk/hobbit/help/manpages/man5/bb-hosts.5.html
I am unsure as to who maintains the debian pkg. I wish it could make it mainline though...
Dan
On 12/7/06, Dan Simoes <dan.simoes at gmail.com> wrote:
I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism.
One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status.
Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box.
I'm unaware of a solution to this issue, and I'm considering moving to another product because of it. Are there any solutions, either existing or planned?
Lastly, who is maintaining the debian package for hobbit? Both the server and client packages still have the same bugs I reported months ago.
Thanks.
Dan
On Thu, Dec 07, 2006 at 11:59:30AM -0800, Dan Simoes wrote:
I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism.
One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status.
Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box.
It's a valid point - but it is also very, very difficult to handle. Not so much because it is difficult to suppress alerts; the $1bn question is how to decide when to suppress an alert, and which issue is the root cause of all the problems we're seeing.
Heck, sometimes it can be difficult even for intelligent humans to figure out what is really going on ...
I think what this really boils down to is some form of event correlation mechanism, on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue. E.g. if we have 200 tests reporting a failure because of a DNS lookup that timed out, then we probably have an issue with the DNS server we used. But it could also be a firewall mis-configuration that blocks our outbound DNS queries, or an IP address conflict that causes our DNS lookups to go to a server which doesn't handle DNS - it is really hard for any machine to figure that out by itself.
The current implementation is not ideal, I'll be the first to admit that. Any ideas for improving it are welcome, but please consider the possibilities for the system making wrong decisions. I'd rather send out one alert too many than one too few.
I'm unaware of a solution to this issue, and I'm considering moving to another product because of it.
If you know of any products that are really good at handling this, I'd be interested to hear about them.
Lastly, who is maintaining the debian package for hobbit? Both the server and client packages still have the same bugs I reported months ago.
Since there haven't been any Hobbit releases since August, that really shouldn't come as a surprise.
Regards, Henrik
On Thursday 07 December 2006 23:05, Henrik Stoerner wrote:
On Thu, Dec 07, 2006 at 11:59:30AM -0800, Dan Simoes wrote:
I love hobbit and have been using it (and BB) for many years, so take this as constructive criticism.
One of my biggest headaches with BB (and now hobbit) has been the all-or-nothing nature of alerts. By this I mean that if your main network link is down, everything goes red for network status.
Something happened on my monitoring box (probably DNS) that caused a cadence of http errors. http was not truly down on all these N hosts on various networks, it was the network test that was failing on the monitoring box.
It's a valid point - but it is also very, very difficult to handle. Not so much because it is difficult to suppress alerts; the $1bn question is how to decide when to suppress an alert, and which issue is the root cause of all the problems we're seeing.
Heck, sometimes it can be difficult even for intelligent humans to figure out what is really going on ...
I think what this really boils down to is some form of event correlation mechanism,
Event correlation seems to be the current buzzword from all the monitoring tool vendors whose presentations I have seen recently ...
on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue. E.g. if we have 200 tests reporting a failure because of a DNS lookup that timed out, then we probably have an issue with the DNS server we used. But it could also be a firewall mis-configuration that blocks our outbound DNS queries, or an IP address conflict that causes our DNS lookups to go to a server which doesn't handle DNS - it is really hard for any machine to figure that out by itself.
The current implementation is not ideal, I'll be the first to admit that. Any ideas for improving it are welcome, but please consider the possibilities for the system making wrong decisions. I'd rather send out one alert too many than one too few.
I'm unaware of a solution to this issue, and I'm considering moving to another product because of it.
If you know of any products that are really good at handling this, I'd be interested to hear about them.
I can list some (proprietary ones) that are punting this, but I've never seen them in action.
Regards, Buchan
-- Buchan Milne ISP Systems Specialist - Monitoring/Authentication Team Leader B.Eng,RHCE(803004789010797),LPIC-2(LPI000074592)
On 12/7/06, Henrik Stoerner <henrik at hswn.dk> wrote:
I think what this really boils down to is some form of event correlation mechanism, on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue. E.g. if we have 200 tests reporting a failure because of a DNS lookup that timed out, then we probably have an issue with the DNS server we used. But it could also be a firewall mis-configuration that blocks our outbound DNS queries, or an IP address conflict that causes our DNS lookups to go to a server which doesn't handle DNS - it is really hard for any machine to figure that out by itself.
What I had in mind was more of a baseline check, before proceeding to the other tests. Can't resolve DNS? All other tests which depend on DNS are skipped. Can't ping your default router? Don't bother with any extra network tests.
I'm unaware of a solution to this issue, and I'm considering moving to
another product because of it.
If you know of any products that are really good at handling this, I'd be interested to hear about them.
I can't think of any in particular. I've used unicenter way back when and don't recall this issue, but it's been a while. And I've only taken a cursory look at nagios.
Lastly, who is maintaining the debian package for hobbit? Both the server
and client packages still have the same bugs I reported months ago.
Since there haven't been any Hobbit releases since August, that really shouldn't come as a surprise.
True, but they are bugs I pointed out since before the release candidate. In particular, the client postconf script munges the /etc/default/hobbit-client file and needs to be edited by hand before hobbit will run. I'd be happy to provide feedback to whomever is maintaining the package (is that you Henrik?)
Thanks.
I think what this really boils down to is some form of event correlation mechanism, on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue.
| If you know of any products that are really good at handling this, I'd | be interested to hear about them.
Heuristics is poppycock in the datacenter. Humans are so ridiculously good at correlating events the effort is completely useless to try and train a computer to guess. Now, from an intellectual or research point of view that may not be the case, but I am pragmatic in the datacenter: Useful, not interesting.
My thought to "solve" this problem is the idea of "scenario fingerprinting." As I mentioned, trying to teach a computer to learn is futile, but instructing a computer to look for *known* conditions works perfectly. Criminals and problems have a tendency to repeat themselves.
So, rather than deal with "event correlation", I think a better approach would be an engine that could do state analysis with many rules for a single scenario. Perhaps it's semantics, but "event correlation" to me implies events over time, and I don't think you need the time parameter, only the view of the environment at an instant, the fingerprint. If the scenario is recognized, then "react" by disabling and alerting appropriately.
Example, say you lose a router in Europe and all the pings die across the pond (I am in North America). Generate a "scenario alert" that described the scenario and disable all the routers/hosts over there. Odds are if that router went down once, it will go down again.
You leverage the ability of a human to correlate with the computers ability to "keep on the look-out" for "known offenders." I think this methodlogy could also be applied to the RRD system stats.
Let the machines do what they are good at, following instructions, and let the humans do what they are good at, thinking.
Scott
participants (5)
-
bgmilne@staff.telkomsa.net
-
bigdan@gmail.com
-
dan.simoes@gmail.com
-
henrik@hswn.dk
-
scott@PacketPushers.com