In message <F098822C-19A2-42CC-B6BA-2AB4E71D18BA at PacketPushers.com>, Scott Walters writes:
}>
}> We run pretty much all of our big brother tests every minute. On
}> our new hobbit servers, we're running them at the default intervals.
}>
}> BB shows us that our primary name server is going out for less than
}> a minute, about every 62 minutes.
}> Hobbit is missing most of those
}> outages, although the longer "xxxx events received in the last xxx
}> minutes" is what helped us spot the problem, as a whole bunch of
}> machines' services don't respond well when our primary name server
}> is out, and having a mass of servers go yellow then green, in
}> unison, is sort of eye catching.
}
}So hobbit with the xxx events (running every 5m) did provide enough
}information to indicate an intermittent problem with DNS?
Hobbit's non-green page, with last xxx events, gave us a large enough view that we could see all the machine services going yellow at the same time dns went red. We're monitoring a bit over 260 machines with a whole lot of difference services, so there's often something going red or yellow. With BB's older default of the last 25 events, there wasn't ever that much on screen to notice a group of swings to yellow, then back to green.
}Things running every 5m will collide with a problem that happens for
}a minute frequently enough to 'show up on the radar'
Sure, but we'd see up to 13 hours between dns 'red', when BB would get several in that period.
I haven't changed hobbit yet to 1 minute checks. I've even made an explicit explanation that I wasn't planning to shorten it to 1 minute checks when we officially switched over, and that was agreed to. However, with the fact that the 1 minute checks did actually make a difference in tracking down and solving the problem with DNS, I may yet have to work on that change. We'll see what kind of feedback I get after today. Even then, the only thing I'd really be willing to shorten to that frequency of checks are the remote checks, over the network.
}But every site has different requirements. It's just been my
}experience that sampling more frequently than 5m hits the knee-bend
}of diminishing returns. It also increases the potential for state
}changes, which chews up the filesystem with the history info.
I thought it was unnecessary when I originally brought BB into production years ago, but it was one of the requirements I ended up with to sell switching to BB. Some things can't be checked every minute, I have raid checks that can take more than a minute to run.
Tracy J. Di Marco White Information Technology Services Iowa State University