[hobbit] Status change and history snapshots - how to force one?

4 Mar 2009

      "Everett, Vernon" <Vernon.Everett at woodside.com.au> writes:
...
We are monitoring a system, and the alert went red.
We acknowldeged the alert, and logged a call with our vendor for a replacement component.
So far, all good.
However, while waiting for the replacement component, we had another failure, which also should have triggered a red alert.
(It's a disk array, with many many disks, and many hot spares, so it wasn't a tragic failure)
The problem, of course, is that since the first alert was acknowledged, nobody spotted the second one, until the first disk was replaced, and the system remained red.
(Somebody did the old "hmm, that's odd" and had a look.)
I think the real problem here is that your test flagged a red alarm, even
though the new disk had been ordered and the original problem had therefore
been detected and handled. In my world, "acknowledged" means that the problem
has not been resolved, but someone is working on it (so I don't have to). Once
resolved, the status should go green. So you had a false alarm, and this did
hide your second alarm, as false alarms can do.
So I would suggest change the test to only flag parts that are broken, but not
yet ordered a replacement for.
It seems what you are asking for is some kind of count for red alarms. Like
the first red alarm is "1 component broken", and the next alarm would then be
"2 components broken", and the increase from count=1 to count=2 should trigger
a new alarm. However, adding something like this fundamentally changes the
monitoring model, and it seems to me this would just complicate matters both
for testing and reporting without much gain.
Alternatively you could just set up seperate alarms for each separate
component; that is not much different from counting number of failing
components that you would in any case have to do.
Hope this helps,

Kristian.

[hobbit] Status change and history snapshots - how to force one?

knielsen＠knielsen-hq.org