autofixing

newer
Bug report for RRD graphs...

older
Feature - NTP stratum too high -...

lebarber＠gmail.com

6 Apr 2012 6 Apr '12

9:31 p.m.

Resending to the list, Gmail seems to be hiding the "reply to all".

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com> wrote:

...

The kind of things that you can automate should be handled routinely, not be triggered by an alert from your monitoring tool. If you have logs growing to fast that they are filling up you file system you should find out what is filling them up and why and then fix that. Automatic log rotation and compression should be done by a tool like logrotate, not Xymon or any other monitoring tool. You shouldn't be using a monitoring tool to trigger routine maintenance, it simply causes unnecessary alerts that cause problems in other areas.

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <KK1051 at att.com> wrote:

...
Larry,****

Some auto correcting is not bad. Back in the Big brother days I had a datacenter and team of folks. We managed to the “yellow” alerts. I had folks correct and build scripts to address the things that brought on the yellow so we never saw the red. This made it so very little red was ever seen.****

Now the things you can automate are the disk full kind of things. If that happens you can run a script to clean logs compress and that stuff. This was usually handled by managing the yellow. There would be a script in place to keep the space to below the yellow trigger. So if you got a red it was usually a bug temp file or something that would get cleaned shortly. So say on the red alert you could have it run the cleanup script rather than waiting for your cron to do the normal cleanup.****

Now on other issues it really depends on what the alert is about. You cannot automate everything economically. At some point it is cheaper and faster to put a human in the loop. I did have a script that would take the e-mail response from the alert and we could have it parse the message and do the work. This was back in the day with the RIM pagers. So you got an alert you replied to the alert with “run clean script on host” The reply e-mail was parsed in by the same script we were using to acknowledge the alert. It would parse and run a clean script. This let my admins be able to work issues while away from a PC or network connection.****

I do hear and agree with your concerns. A blanket statement from managers that do not have a full understanding of all the elements is a ruff thing to swallow. But there heart is in the right spot J ****

I guess in a rather long rambling way I am saying that you learn and tune your systems. Address re-occurring issues so they do not. Then watch for the next thing to be addressed.****

-Kevin****

*From:* xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] *On Behalf Of *Larry Barber *Sent:* Friday, April 06, 2012 1:43 PM *To:* xymon at xymon.com *Subject:* [Xymon] autofixing****

My management has gotten the idea that we should be automating the repair processes on our servers. They want things set up so that when a fault is detected a script is run that attempts to repair it. I've tried to convince them that this is a profoundly wrong-headed idea, but I'm not having much luck. Do any of you know of any articles or resources that might help convince them?

Thanks, Larry Barber****

Show replies by date

josh＠imaginenetworksllc.com

6 Apr 6 Apr

9:32 p.m.

For Gmail, you can click the down arrow by the reply button (just to the right of it) or change the default action to reply to all in your settings.

Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373

On Fri, Apr 6, 2012 at 5:31 PM, Larry Barber <lebarber at gmail.com> wrote:

...

Resending to the list, Gmail seems to be hiding the "reply to all".

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com> wrote:

...
The kind of things that you can automate should be handled routinely, not be triggered by an alert from your monitoring tool. If you have logs growing to fast that they are filling up you file system you should find out what is filling them up and why and then fix that. Automatic log rotation and compression should be done by a tool like logrotate, not Xymon or any other monitoring tool. You shouldn't be using a monitoring tool to trigger routine maintenance, it simply causes unnecessary alerts that cause problems in other areas.

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <KK1051 at att.com> wrote:

...
Larry,

Some auto correcting is not bad. Back in the Big brother days I had a datacenter and team of folks. We managed to the “yellow” alerts. I had folks correct and build scripts to address the things that brought on the yellow so we never saw the red. This made it so very little red was ever seen.

Now the things you can automate are the disk full kind of things. If that happens you can run a script to clean logs compress and that stuff. This was usually handled by managing the yellow. There would be a script in place to keep the space to below the yellow trigger. So if you got a red it was usually a bug temp file or something that would get cleaned shortly. So say on the red alert you could have it run the cleanup script rather than waiting for your cron to do the normal cleanup.

Now on other issues it really depends on what the alert is about. You cannot automate everything economically. At some point it is cheaper and faster to put a human in the loop. I did have a script that would take the e-mail response from the alert and we could have it parse the message and do the work. This was back in the day with the RIM pagers. So you got an alert you replied to the alert with “run clean script on host” The reply e-mail was parsed in by the same script we were using to acknowledge the alert. It would parse and run a clean script. This let my admins be able to work issues while away from a PC or network connection.

I do hear and agree with your concerns. A blanket statement from managers that do not have a full understanding of all the elements is a ruff thing to swallow. But there heart is in the right spot J

I guess in a rather long rambling way I am saying that you learn and tune your systems. Address re-occurring issues so they do not. Then watch for the next thing to be addressed.

-Kevin

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Larry Barber Sent: Friday, April 06, 2012 1:43 PM To: xymon at xymon.com Subject: [Xymon] autofixing

My management has gotten the idea that we should be automating the repair processes on our servers. They want things set up so that when a fault is detected a script is run that attempts to repair it. I've tried to convince them that this is a profoundly wrong-headed idea, but I'm not having much luck. Do any of you know of any articles or resources that might help convince them?

Thanks, Larry Barber

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

lebarber＠gmail.com

9:33 p.m.

I've tried looking at Google, but can't seem to come up with a good search phrase. What I mainly get is articles about various tools that will auto repair various Microsoft products. This is the kind of thing that happens once you start on the auto repair bandwagon, but some software, then buy some more software to keep the first program running, than but some more to keep the second running then ....

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:31 PM, Larry Barber <lebarber at gmail.com> wrote:

...

Resending to the list, Gmail seems to be hiding the "reply to all".

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com> wrote:

...
The kind of things that you can automate should be handled routinely, not be triggered by an alert from your monitoring tool. If you have logs growing to fast that they are filling up you file system you should find out what is filling them up and why and then fix that. Automatic log rotation and compression should be done by a tool like logrotate, not Xymon or any other monitoring tool. You shouldn't be using a monitoring tool to trigger routine maintenance, it simply causes unnecessary alerts that cause problems in other areas.

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <KK1051 at att.com> wrote:

...
Larry,****

Some auto correcting is not bad. Back in the Big brother days I had a datacenter and team of folks. We managed to the “yellow” alerts. I had folks correct and build scripts to address the things that brought on the yellow so we never saw the red. This made it so very little red was ever seen.****

Now the things you can automate are the disk full kind of things. If that happens you can run a script to clean logs compress and that stuff. This was usually handled by managing the yellow. There would be a script in place to keep the space to below the yellow trigger. So if you got a red it was usually a bug temp file or something that would get cleaned shortly. So say on the red alert you could have it run the cleanup script rather than waiting for your cron to do the normal cleanup.****

Now on other issues it really depends on what the alert is about. You cannot automate everything economically. At some point it is cheaper and faster to put a human in the loop. I did have a script that would take the e-mail response from the alert and we could have it parse the message and do the work. This was back in the day with the RIM pagers. So you got an alert you replied to the alert with “run clean script on host” The reply e-mail was parsed in by the same script we were using to acknowledge the alert. It would parse and run a clean script. This let my admins be able to work issues while away from a PC or network connection.****

I do hear and agree with your concerns. A blanket statement from managers that do not have a full understanding of all the elements is a ruff thing to swallow. But there heart is in the right spot J ****

I guess in a rather long rambling way I am saying that you learn and tune your systems. Address re-occurring issues so they do not. Then watch for the next thing to be addressed.****

-Kevin****

*From:* xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] *On Behalf Of *Larry Barber *Sent:* Friday, April 06, 2012 1:43 PM *To:* xymon at xymon.com *Subject:* [Xymon] autofixing****

My management has gotten the idea that we should be automating the repair processes on our servers. They want things set up so that when a fault is detected a script is run that attempts to repair it. I've tried to convince them that this is a profoundly wrong-headed idea, but I'm not having much luck. Do any of you know of any articles or resources that might help convince them?

Thanks, Larry Barber****

waa-hobbitml＠revpol.com

10 Apr 10 Apr

12:29 a.m.

On 04/06/12 17:33, Larry Barber wrote:

...

What I mainly get is articles about various tools that will auto repair various Microsoft products. This is the kind of thing that happens once you start on the auto repair bandwagon, buy some software, then buy some more software to keep the first program running, than buy some more to keep the second running then ....

Reminds of back in 1997/98 when I worked for a small ISP. We had to have a second "What's Up Gold" server who's only purpose was to ... wait for it...

Monitor our What's Up Gold server.

True (but sad) story.

-- Bill Arlofski Reverse Polarity, LLC

bewhite＠fellowes.com

11 Apr 11 Apr

5:52 p.m.

Actually, I have found some cases where an "auto fix" script is helpful (tools licenses going down, Oracle Listeners that die, etc.), however they are the exception not the rule. Also, they need to be coded very carefully, to make sure they don't keep doing the fix, but the problem is not solved. Want to bring down a server, have 2000 none functional Oracle Listener processes running, doing nothing!

.....Bruce

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Larry Barber Sent: Friday, April 06, 2012 4:34 PM To: xymon at xymon.com Subject: Re: [Xymon] autofixing

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:31 PM, Larry Barber <lebarber at gmail.com> wrote:

Resending to the list, Gmail seems to be hiding the "reply to all".

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com> wrote:

The kind of things that you can automate should be handled routinely, not be triggered by an alert from your monitoring tool. If you have logs growing to fast that they are filling up you file system you should find out what is filling them up and why and then fix that. Automatic log rotation and compression should be done by a tool like logrotate, not Xymon or any other monitoring tool. You shouldn't be using a monitoring tool to trigger routine maintenance, it simply causes unnecessary alerts that cause problems in other areas.

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <KK1051 at att.com> wrote:

Larry,

Some auto correcting is not bad. Back in the Big brother days I had a datacenter and team of folks. We managed to the "yellow" alerts. I had folks correct and build scripts to address the things that brought on the yellow so we never saw the red. This made it so very little red was ever seen.

Now the things you can automate are the disk full kind of things. If that happens you can run a script to clean logs compress and that stuff. This was usually handled by managing the yellow. There would be a script in place to keep the space to below the yellow trigger. So if you got a red it was usually a bug temp file or something that would get cleaned shortly. So say on the red alert you could have it run the cleanup script rather than waiting for your cron to do the normal cleanup.

Now on other issues it really depends on what the alert is about. You cannot automate everything economically. At some point it is cheaper and faster to put a human in the loop. I did have a script that would take the e-mail response from the alert and we could have it parse the message and do the work. This was back in the day with the RIM pagers. So you got an alert you replied to the alert with "run clean script on host" The reply e-mail was parsed in by the same script we were using to acknowledge the alert. It would parse and run a clean script. This let my admins be able to work issues while away from a PC or network connection.

I do hear and agree with your concerns. A blanket statement from managers that do not have a full understanding of all the elements is a ruff thing to swallow. But there heart is in the right spot J

I guess in a rather long rambling way I am saying that you learn and tune your systems. Address re-occurring issues so they do not. Then watch for the next thing to be addressed.

-Kevin

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Larry Barber Sent: Friday, April 06, 2012 1:43 PM To: xymon at xymon.com Subject: [Xymon] autofixing

My management has gotten the idea that we should be automating the repair processes on our servers. They want things set up so that when a fault is detected a script is run that attempts to repair it. I've tried to convince them that this is a profoundly wrong-headed idea, but I'm not having much luck. Do any of you know of any articles or resources that might help convince them?

Thanks, Larry Barber

Bruce White Senior Enterprise Systems Engineer | Phone: 1-630-671-5169 | Fax: 630-893-1648 | bewhite at fellowes.com | http://www.fellowes.com/ Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc.

asparks＠doublesparks.net

6 Apr 6 Apr

9:44 p.m.

I'd generally agree that fixing root cause whenever possible, so the problem doesn't occur is preferable. In a past life, we did do some of this - of course, do whatever we could to prevent the problem in the first place... but web server instances crash, and sometimes traffic irregularities cause logs to fill fast than usual.

I had a hack going that involved cfengine, with cfrun callable from a paging script. The premise was to have cfengine invoked on the remote node before pages actually went out (e.g., a DURATION delay on real pages), to see if cfengine could fix the simpler problems (like a process dying or whatnot). If it could, we could sleep. If not, the second-level page went out for human intervention.

We didn't do much autofixing... there wasn't a lot in the environment that lent itself to such. Either we engineered an HA environment (clustered) where a dead machine didn't affect the service... or the problem was probably not simple to fix, and we needed real eyes/brains on it. -Alan

On 4/6/2012 3:31 PM, Larry Barber wrote:

...

Resending to the list, Gmail seems to be hiding the "reply to all".

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com <mailto:lebarber at gmail.com>> wrote:

The kind of things that you can automate should be handled
routinely, not be triggered by an alert from your monitoring tool.
If you have logs growing to fast that they are filling up you file
system you should find out what is filling them up and why and then
fix that. Automatic log rotation and compression should be done by a
tool like logrotate, not Xymon or any other monitoring tool. You
shouldn't be using a monitoring tool to trigger routine maintenance,
it simply causes unnecessary alerts that cause problems in other areas.

Thanks,
Larry Barber


On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN &lt;KK1051 at att.com
&lt;mailto:KK1051 at att.com>> wrote:

    Larry,____

    __ __

    Some auto correcting is not bad.  Back in the Big brother days I
    had a datacenter and team of folks. We managed to the “yellow”
    alerts. I had folks correct and build scripts to address the
    things that brought on the yellow so we never saw the red.  This
    made it so very little red was ever seen.____

    __ __

    Now the things you can automate are the disk full kind of
    things. If that happens you can run a script to clean logs
    compress and that stuff.  This was usually handled by managing
    the yellow. There would be a script in place to keep the space
    to below the yellow trigger. So if you got a red it was usually
    a bug temp file or something that would get cleaned shortly. So
    say on the red alert you could have it run the cleanup script
    rather than waiting for your cron to do the normal cleanup.____

    __ __

    Now on other issues it really depends on what the alert is
    about. You cannot automate everything economically. At some
    point it is cheaper and faster to put a human in the loop. I did
    have a script that would take the e-mail response from the alert
    and we could have it parse the message and do the work. This was
    back in the day with the RIM pagers. So you got an alert you
    replied to the alert with “run clean script on host” The reply
    e-mail was parsed in by the same script we were using to
    acknowledge the alert. It would parse and run a clean script.
    This let my admins be able to work issues while away from a PC
    or network connection.____

    __ __

    I do hear and agree with your concerns. A blanket statement from
    managers that do not have a full understanding of all the
    elements is a ruff thing to swallow. But there heart is in the
    right spot J____

    __ __

    I guess in a rather long rambling way I am saying that you learn
    and tune your systems. Address re-occurring issues so they do
    not. Then watch for the next thing to be addressed.____

    __ __

    __ __

    -Kevin____

    __ __

    __ __

    *From:*xymon-bounces at xymon.com &lt;mailto:xymon-bounces at xymon.com>
    [mailto:xymon-bounces at xymon.com
    &lt;mailto:xymon-bounces at xymon.com>] *On Behalf Of *Larry Barber
    *Sent:* Friday, April 06, 2012 1:43 PM
    *To:* xymon at xymon.com &lt;mailto:xymon at xymon.com>
    *Subject:* [Xymon] autofixing____

    __ __

    My management has gotten the idea that we should be automating
    the repair processes on our servers. They want things set up so
    that when a fault is detected a script is run that attempts to
    repair it. I've tried to convince them that this is a profoundly
    wrong-headed idea, but I'm not having much luck. Do any of you
    know of any articles or resources that might help convince them?

    Thanks,
    Larry Barber____

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

jamison＠newasterisk.com

11:46 p.m.

...

From my experience, you almost always want humans in the loop, even if all they do is run a quick script over an SSH session on their phone. What inevitably happens when they're not is that the band-aid clean-all script, for example, will compress and delete stuff all day long, but the actual problem is that there is a database, or something, that has grown far past acceptable levels. So, when the day comes that everything is compressed and deleted and cleaned out and emptied, and there is no space left to reclaim, it's always three o'clock in the morning and by the time the admins get they're laptops fired up and are awake enough to work, the database has dismounted and all hell is breaking loose. Or something else I've seen is that an alert is generated that the memory is at, say 99% for example, well that calls a script that bounces your cifs mounts because mount.cifs has always seemed a little buggy to me, but the actual problem is that the internal app running on that system has a small memory leak. What will always happen is, given time, that the application will leak all available memory and now the whole system is hosed instead of somebody actually looking at what process is taking all your memory.

Even if the alert still goes out as a page, email, or something I know that I will forget about it unless I've got to logon and do something...... Management's heart is in the right place, just not their head. It sure sounds good, "Yeah, let's get the whole datacenter running autamagically, without admin intervention." However, what I really hear is, "Yeah, let's make sure were not documenting and acting upon any small issue that arises until it becomes a bonified conflaguration."

Jamison Maxwell Jamison at newasterisk.com

-----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Alan Sparks Sent: Friday, April 06, 2012 5:45 PM To: xymon at xymon.com Subject: Re: [Xymon] autofixing

On 4/6/2012 3:31 PM, Larry Barber wrote:

...

Resending to the list, Gmail seems to be hiding the "reply to all".

Thanks, Larry Barber

On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com <mailto:lebarber at gmail.com>> wrote:

The kind of things that you can automate should be handled
routinely, not be triggered by an alert from your monitoring tool.
If you have logs growing to fast that they are filling up you file
system you should find out what is filling them up and why and then
fix that. Automatic log rotation and compression should be done by a
tool like logrotate, not Xymon or any other monitoring tool. You
shouldn't be using a monitoring tool to trigger routine maintenance,
it simply causes unnecessary alerts that cause problems in other areas.

Thanks,
Larry Barber


On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN &lt;KK1051 at att.com
&lt;mailto:KK1051 at att.com>> wrote:

    Larry,____

    __ __

    Some auto correcting is not bad.  Back in the Big brother days I
    had a datacenter and team of folks. We managed to the "yellow"
    alerts. I had folks correct and build scripts to address the
    things that brought on the yellow so we never saw the red.  This
    made it so very little red was ever seen.____

    __ __

    Now the things you can automate are the disk full kind of
    things. If that happens you can run a script to clean logs
    compress and that stuff.  This was usually handled by managing
    the yellow. There would be a script in place to keep the space
    to below the yellow trigger. So if you got a red it was usually
    a bug temp file or something that would get cleaned shortly. So
    say on the red alert you could have it run the cleanup script
    rather than waiting for your cron to do the normal

cleanup.____

    __ __

    Now on other issues it really depends on what the alert is
    about. You cannot automate everything economically. At some
    point it is cheaper and faster to put a human in the loop. I did
    have a script that would take the e-mail response from the alert
    and we could have it parse the message and do the work. This was
    back in the day with the RIM pagers. So you got an alert you
    replied to the alert with "run clean script on host" The reply
    e-mail was parsed in by the same script we were using to
    acknowledge the alert. It would parse and run a clean script.
    This let my admins be able to work issues while away from a PC
    or network connection.____

    __ __

    I do hear and agree with your concerns. A blanket statement from
    managers that do not have a full understanding of all the
    elements is a ruff thing to swallow. But there heart is in the
    right spot J____

    __ __

    I guess in a rather long rambling way I am saying that you learn
    and tune your systems. Address re-occurring issues so they do
    not. Then watch for the next thing to be addressed.____

    __ __

    __ __

    -Kevin____

    __ __

    __ __

    *From:*xymon-bounces at xymon.com &lt;mailto:xymon-bounces at xymon.com>
    [mailto:xymon-bounces at xymon.com
    &lt;mailto:xymon-bounces at xymon.com>] *On Behalf Of *Larry Barber
    *Sent:* Friday, April 06, 2012 1:43 PM
    *To:* xymon at xymon.com &lt;mailto:xymon at xymon.com>
    *Subject:* [Xymon] autofixing____

    __ __

    My management has gotten the idea that we should be automating
    the repair processes on our servers. They want things set up so
    that when a fault is detected a script is run that attempts to
    repair it. I've tried to convince them that this is a profoundly
    wrong-headed idea, but I'm not having much luck. Do any of you
    know of any articles or resources that might help convince them?

    Thanks,
    Larry Barber____

Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon

5187

Age (days ago)

5192

Last active (days ago)

List overview

Download

6 comments

6 participants

participants (6)

asparks＠doublesparks.net
bewhite＠fellowes.com
jamison＠newasterisk.com
josh＠imaginenetworksllc.com
lebarber＠gmail.com
waa-hobbitml＠revpol.com

autofixing

lebarber＠gmail.com

josh＠imaginenetworksllc.com

lebarber＠gmail.com

waa-hobbitml＠revpol.com

bewhite＠fellowes.com

asparks＠doublesparks.net

jamison＠newasterisk.com

tags

participants (6)