I think what this really boils down to is some form of event correlation mechanism, on top of which you then apply some heuristics (that's a fancy word for "guessing") to decide what is the core issue.
| If you know of any products that are really good at handling this, I'd | be interested to hear about them.
Heuristics is poppycock in the datacenter. Humans are so ridiculously good at correlating events the effort is completely useless to try and train a computer to guess. Now, from an intellectual or research point of view that may not be the case, but I am pragmatic in the datacenter: Useful, not interesting.
My thought to "solve" this problem is the idea of "scenario fingerprinting." As I mentioned, trying to teach a computer to learn is futile, but instructing a computer to look for *known* conditions works perfectly. Criminals and problems have a tendency to repeat themselves.
So, rather than deal with "event correlation", I think a better approach would be an engine that could do state analysis with many rules for a single scenario. Perhaps it's semantics, but "event correlation" to me implies events over time, and I don't think you need the time parameter, only the view of the environment at an instant, the fingerprint. If the scenario is recognized, then "react" by disabling and alerting appropriately.
Example, say you lose a router in Europe and all the pings die across the pond (I am in North America). Generate a "scenario alert" that described the scenario and disable all the routers/hosts over there. Odds are if that router went down once, it will go down again.
You leverage the ability of a human to correlate with the computers ability to "keep on the look-out" for "known offenders." I think this methodlogy could also be applied to the RRD system stats.
Let the machines do what they are good at, following instructions, and let the humans do what they are good at, thinking.
Scott