On Wed, Mar 04, 2009 at 11:36:22AM +0900, Everett, Vernon wrote:
We are monitoring a system, and the alert went red. We acknowldeged the alert, and logged a call with our vendor for a replacement component. So far, all good.
However, while waiting for the replacement component, we had another failure, which also should have triggered a red alert. (It's a disk array, with many many disks, and many hot spares, so it wasn't a tragic failure)
The problem, of course, is that since the first alert was acknowledged, nobody spotted the second one, until the first disk was replaced, and the system remained red.
Xymon did what it was supposed to do. There's no way - short of human intelligence - to determine that the two red statuses were different.
Is there a way to force a status update/change, even though there is no real colour change?
No, not with the current logic.
Is there a way for Xymon to detect that we are looking at a second failure?
No.
If the answer to the above is no, then can we add this to the feature wish-list?
:-) I suppose so, but we would have to figure out just what it is that you want Xymon to do.
I gave it a little thought, and came up with what I think is a simple implementation. We could add this an option to bb, similar to the [+lifetime] option, which will force the server to do a colour change from any colour to the new colour, even if they are the same. This should also clear any acks, and take a snapshot for the history. Implementing it as an option to bb, will probably require relatively minor changes at server level, and put the onus on the client scripts to decide what should force a change.
That's one possibility. I'm not terribly thrilled with it, because for a lot of tests - all the standard ones, and particularly the ones likes "msgs" or "procs" that collect lots of different data - it will be somewhat of a headache to provide a framework for rules that determine when the situation has changed 'enough' to warrant such an override. And I think that your average admin would not be pleased when his ACK was auto-cleared at 3 AM.
Maybe we could do it based on the ack's ? When an ack expires or is cleared, this triggers the "next status is different" situation. (You cannot clear an ack right now, but that can be done and would probably be meaningful anyway). If we do that, then you would clear the ack after you had repaired the first disk; then the status would remain red, but Xymon would know now that it was a "different" red from the first one because you had cleared the ack for the first one. So you would immediately get a new alert, and the history log would update with the new status snapshot.
Regards, Henrik