On 7/30/2014 8:50 AM, oliver wrote:
Ideally, I'd like to see the name of the server group ("prod" in the example) change to blue from white on the main view to remind me there's an ignored test. But I don't want the "main view" colour to change from green
Don't disable the test. Acknowledge the alert.
Let me explain the situation a little more clearly.
We have tons of servers deployed in pairs. Each pair consists of an active box and a standby box and it doesn't technically matter which one of the two is active. For consistency reasons, we like to keep it so the "first" box is active whenever possible.
If the first box fails over, for whatever reason, it generates a red alarm on Xymon saying it's no longer active and (after checking everything out) we ask someone on the night-shift to fail back over during off-hours. At this point, we don't want the main Xymon view to be red so we "ignore" the test. However, since the main view is now green, the techs sometimes forget that there's anything to do and it remains failed over until someone drills down and sees it.
This comes back around to something I regularly tell our staff: "Xymon (and Big Brother before that) is not a task list. It is an alerting system. Using it as a task list is an abuse of the tool and reduces its ability to meets its fundamental business goal."
We have task-list and problem tracking processes in place so don't need to use Xymon to meet this need. Your business needs and available tools may be different, but I urge you to consider finding a better tool than Xymon for managing task lists.
I was trying to get to a state where they would know that there's a disabled/ignored/ack'd box from the front page to eliminate the "I missed the email" excuses
You could define a 'combo' test which alarmed when fewer than two of the underlying tests were green. This 'combo' test could be rigged to propagate to the non-green screen while suppressing the propagation of the underlying tests.
You could then rig the underlying tests to send automated email alerts to the folks who should fix the broken half of the pair. Look at combo.cfg and alerts.cfg for options to aggregate test results and time/escalate automated email alerts.
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska