First, let me say that this is very nifty. Flap detection makes folks look at things that they might have missed.
It's driving the NOC folks **nuts** though. Acking the reds should stop them from paging, but the main page then stays red for a full half hour, even though the problem is completely fixed. IMHO it would be very useful to have a "release" or "ALL CLEAR" button of some sort for flapping situations that have been dealt with. The NOC folks hate red screens... (it would be even MORE useful to have a release button that only some folks could push...is there any seekrit workaround?)
In our case we have a bit of a cascade situation, where one server can trigger a lot of secondary reds, so we end up looking at a whole lot of red...
thanks Betsy
On Tue, Mar 29, 2011 at 10:54 PM, Elizabeth Schwartz < betsy.schwartz at gmail.com> wrote:
First, let me say that this is very nifty. Flap detection makes folks look at things that they might have missed.
It's driving the NOC folks **nuts** though. Acking the reds should stop them from paging, but the main page then stays red for a full half hour, even though the problem is completely fixed. IMHO it would be very useful to have a "release" or "ALL CLEAR" button of some sort for flapping situations that have been dealt with. The NOC folks hate red screens... (it would be even MORE useful to have a release button that only some folks could push...is there any seekrit workaround?)
I don't know much about flapping, but what happens if you manually send a 'green' status for the flapping service?
On the Xymon server:
bb localhost "status server,domain,com.column green `date`
Flapped so hard we took off..."
If that does the trick it could be turned into your "seekrit" webpage for certain select folks to be able to clear the status.
Did you try going to the Enable/Disable page and disabling the red things with "Until OK" selected?? That would make the red dot go away until after the next green report.
Ralph Mitchell
On Tue, Mar 29, 2011 at 11:43 PM, Ralph Mitchell <ralphmitchell at gmail.com> wrote:
On Tue, Mar 29, 2011 at 10:54 PM, Elizabeth Schwartz I don't know much about flapping, but what happens if you manually send a 'green' status for the flapping service? On the Xymon server: bb localhost "status server,domain,com.column green
date
I will try this the next time we get a flap, thanks! That would be a good seekrit
Did you try going to the Enable/Disable page and disabling the red things with "Until OK" selected?? That would make the red dot go away until after
We don't want to disable them, because what if they go down again? Not unlikely with a previously flapping service. Can't have a half-hour window with no monitoring of vital services
thanks Betsy
Excuse top-posting, it's my hardware.
Disabling "Until OK" would disable only until the next green state.
-- Sent from my Palm Pre On Mar 30, 2011 12:53, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:
On Tue, Mar 29, 2011 at 11:43 PM, Ralph Mitchell
<ralphmitchell at gmail.com> wrote:
> On Tue, Mar 29, 2011 at 10:54 PM, Elizabeth Schwartz
> I don't know much about flapping, but what happens if you manually send a
> 'green' status for the flapping service?
> On the Xymon server:
> bb localhost "status server,domain,com.column green date
I will try this the next time we get a flap, thanks! That would be a
good seekrit
> Did you try going to the Enable/Disable page and disabling the red things
> with "Until OK" selected?? That would make the red dot go away until after
We don't want to disable them, because what if they go down again? Not
unlikely with a previously flapping service.
Can't have a half-hour window with no monitoring of vital services
thanks Betsy
Xymon mailing list
Xymon at xymon.com
On 03/29/2011 10:54 PM, Elizabeth Schwartz wrote:
Flap detection makes folks look at things that they might have missed.
It's driving the NOC folks **nuts** though. Acking the reds should stop them from paging, but the main page then stays red for a full half hour, even though the problem is completely fixed.
(it would be even MORE useful to have a release button that only some folks could push...is there any seekrit workaround?)
In our case we have a bit of a cascade situation, where one server can trigger a lot of secondary reds, so we end up looking at a whole lot of red...
Have you explored the depends tag at all? It might help reduce or eliminate the cascade effect. You can read about it in the hosts.cfg manpage.
Tom
On Tue, 29 Mar 2011 22:54:25 -0400, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:
First, let me say that this is very nifty. Flap detection makes folks look at things that they might have missed.
Glad you like it:-)
It's driving the NOC folks **nuts** though. Acking the reds should stop them from paging, but the main page then stays red for a full half hour, even though the problem is completely fixed. IMHO it would be very useful to have a "release" or "ALL CLEAR" button of some sort for flapping situations that have been dealt with. The NOC folks hate red screens...
Well ... yes, I see your point but I am not sure I agree with it.
If your NOC folks are using the "critical view", then they can ack the alert, and it's gone from their view. That is how I think it/they should work :-)
I know a lot of sites use the "All non-green" view or even the full overview pages for monitoring, and the ack won't change the color there. If you must have a green display in that case, then you can disable the status (make it "blue") for 30 minutes, and then it will return to the real status after that half hour has passed. But of course, any errors during that period will not show up until the disable-period expires.
There may be a third possibility that does what you're asking for. I think (haven't tested it) that the new "modify" command would override a flapping status. If you have a "disk" status on the "server1" host, then a command like this
xymon 127.0.0.1 "modify server1.disk green manual Disk cleanup completed"
will override the normal status-color and force the status green with the comment "Disk cleanup completed". The "manual" keyword is just a token to identify this modification. However, a modification is only valid for 2 status-updates, so it won't handle the full 30-minute period. It wouldn't be terribly difficult to modify xymond to allow modifiers to be valid for a longer period of time.
This could easily be wrapped into the status display when a flapping status is shown.
Regards, Henrik
Interesting, thanks!
We haven't explored the critical systems view because there's a perception that *all* our monitored systems are critical. With Big Brother (which I'm hoping to turn off next week!) we've been going on the model of trying to make all the alerts that have to wake soneone up be red, and making ones that can wait not go over yellow. But it's true that as we get bigger having all those ack'ed yellows around muddies up the display.And now with the flapping feature, I see what you're saying.
I'm finding the critical hosts setup to be rather indimidating, though. We've got roughly 250 hosts , 71 distinct *types* of host. Some of them can be cloned as generic unix or generic linux or whatever, but most have at least one test specific to their business function, There's an average of maybe ten tests per host, and some hosts have tests that run on only one or two servers in a cluster. Am I understanding correctly that when you edit the critical systems view, you're editing a group that applies to only one particular test? That is, I have to create "production databases-disk" and "production databases-ntp" as separate entities? (or maybe it should be "hosts with sev1 disk" and "hosts with sev1 ntp"? Or is there a way to set the rules for all tests on a production database?
Are people with hundreds of hosts using this feature? If so, any tips? I suspect I'm misunderstanding how to set it up.
thanks Newbie
participants (5)
-
betsy.schwartz@gmail.com
-
henrik@hswn.dk
-
novosirj@umdnj.edu
-
ralphmitchell@gmail.com
-
tomg@mcclatchyinteractive.com