[hobbit] Highlights of the 4.3.0 version

3 Aug 2007


      Sometimes the real world runs interference for Utopia.  While in Utopia
you want to analyse, find the root cause, and fix everything before
proceding, you can't always do that.  When an outage of one hour costs
your company tens of thousands of dollars, you can't justify withholding
a simple bandaid (so long as you don't then ignore the long term fix).
Most everything I do in Hobbit is a custom script.  Restarting crashed
processes is one of the least of my worries.  Although in some rare
cases I do just that (short term), with appropriate logging and email to
the app developement team.  The corporate expense of having the app down
is too great to let Utopian ideas prevail.
Most of the automated Hobbit stuff I do is not restarting dead apps
(luckily, that is very infrequent around here).  It's more mundane.  One
example is disk space.  A full filesystem would shut many things down.
Apps should not fill a filesystem, but sometimes they do.  So my custom
Hobbit scripts first scream and scream about low disk space, even
analysing things down to specific subdirectories and fast growing files
and doing trend analysis.  But if their call is not answered, they start
freeing up space from a "private reserve" I have set aside to deal with
emergencies.  So if we experience a sudden unexpected blowup in a
filesystem at 3am, Hobbit keeps things running in production until the
appropriate people can look into and diagnose the problem.  This may not
be Utopian behavior, but it sure is practical at 3am in the morning!
But my vote would be for Hobbit out-of-the-box to NOT attempt automated
repair actions.  That should be left to the Hobbit administrator.  We
can write custom monitor scripts or custom alert scripts to add this
functionality if it's appropriate for our environments.  It's trivial to
integrate your own scripting into Hobbit.
I sure wish I worked in Utopia though.  The job would be a helluva lot
less stressful!  :-)
-----Original Message-----
From: scottrwalters at gmail.com [mailto:scottrwalters at gmail.com] On Behalf
Of Scott Walters
Sent: Friday, August 03, 2007 11:15 AM
To: hobbit at hswn.dk
Subject: Re: [hobbit] Highlights of the 4.3.0 version
I am definitely in the "monitor only" camp.  As appealing as
"self-healing" may seem, I've seen attempts go horrible wrong too many
times.  For example, shutting down Oracle for upgrades and then being
restarted in the middle of the upgrade.  Not good.
I also agree that "self-healing" lends itself to band-aids that avoid
root-cause determination.  I don't think this requires "baby-sitting,"
but a commitment to fixing things once.  I have also had the displeasure
of making permanent band-aids, but I cannot condone it.
All of those "operational" aspects aside, I've convinced myself from a
security point of view, corrective action from monitoring is bad-- a
clear violation of the separation of duties.  You don't want your
auditors "cleaning up" the numbers as they go over your books.
You know what's better than your webserver being automatically restarted
when it crashes?  Your webserver not crashing.
I completely support the absence of corrective actions from monitor
triggers.  The question I have yet to answer satisfactorily is,"Should
the monitoring system perform additional data collection after specific
errors?"  For example, running a particular "find" command when disk
usage increases to try and identify which files are causing the
partition to fill.
Scott Walters
-PacketPusher
On 8/3/07, Hubbard, Greg L <greg.hubbard at eds.com> wrote:
...
Well, I use Netcool which has the opposite philosophy -- there is a
"process automation" system that watches processes and restarts them
if they fail, while also logging restarts.  You can configure a
"restart"
parameter to be anything from 0 (forever) to any number of times.  I
like to set a reasonable number so persistent errors eventually kill
the process, but occasional errors do not.  Log files are not
overwritten, but are appended and rotated.
But whatever.  My view seems to be in the minority -- guess the rest
of you don't mind 24x7x365 babysitting.
GLH
-----Original Message-----
From: Galen Johnson [mailto:Galen.Johnson at sas.com]
Sent: Friday, August 03, 2007 10:18 AM
To: hobbit at hswn.dk
Subject: RE: [hobbit] Highlights of the 4.3.0 version
DOn't forget...this is the model that Tivoli and HP Openview, and many
...
other commercial monitoring solutions provide and sell as a feature.
From my experience as a sys admin, I've alwys found that automatically
...
restarting a service if it goes down to be "a bad thing"(TM).
In many solutions, logs get overwritten upon a restart that would be
integral to the real resolution and prevention.
=G=
-----Original Message-----
From: Tod Hansmann [mailto:thansmann at directpointe.com]
Sent: Friday, August 03, 2007 10:40 AM
To: hobbit at hswn.dk
Subject: RE: [hobbit] Highlights of the 4.3.0 version
In my experience, I have to agree.  Hobbit is for monitoring so the
information that x is down gets to people who can properly diagnose
what is going on, not take generic actions.  If generic actions were
something that were required for X to function properly, it should be
a feature of that software.
Hobbit CAN do some scripting based on alerts, but even that might be a
...
bit more than a systems administrator wants to hinder himself with.
Tod Hansmann
Network Engineer
-----Original Message-----
From: Buchan Milne [mailto:bgmilne at staff.telkomsa.net]
Sent: Friday, August 03, 2007 12:31 AM
To: hobbit at hswn.dk
Cc: Hubbard, Greg L
Subject: Re: [hobbit] Highlights of the 4.3.0 version
On Tuesday 24 July 2007 22:55:02 Hubbard, Greg L wrote:
...
Wonder if there is any way to tell a client what it's status is so
it can be autonomous?  What I mean is this:  suppose there was a way
...
...
for the Hobbit client to tell the server that service X was now in
state
Y,
and a client-side module could then activate response Z on its own?
I don't like band-aids like this.
"restart because it's down" prevents the real impact of problems being
...
seen, and provides less motivation for fixing things properly.
Instead, you sit with frequent short outages (which may avoid the
attention of managers, production managers) which have end-user
impact.
I like even less using a monitoring system to do this ...
Regards,
Buchan
To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk
To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk
To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk
To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk
To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk

[hobbit] Highlights of the 4.3.0 version

haertig＠avaya.com