On Friday 03 August 2007 19:15:27 Scott Walters wrote:
I am definitely in the "monitor only" camp. As appealing as "self-healing" may seem, I've seen attempts go horrible wrong too many times. For example, shutting down Oracle for upgrades and then being restarted in the middle of the upgrade. Not good.
How about the easy example of a web server not responding. Do you restart it ? In the case I am thinking of, no. Since, the reason it is not responding is that the database server it (and another 4 webservers) is waiting for is having problems. Restarting the web server would drop the >1000 existing (working) sessions, causing a full-blown outage, and migrate the problem to the other 4 web servers that sit behind the same load balancer.
I also agree that "self-healing" lends itself to band-aids that avoid root-cause determination.
Or *prevent* the root-cause determination. For example, I had a problem on an LDAP server that appeared once in 2 or 3 weeks. I start it under a debugger, and when next experienced the problem, some online debugging (after taking it out of the pool) with a developer found and fixed the bug within one hour (and allowed me to understand the cause so I could work around it). A restart here would have meant waiting some more and another few outages.
I don't think this requires "baby-sitting," but a commitment to fixing things once. I have also had the displeasure of making permanent band-aids, but I cannot condone it.
We do have some applications that require supervision ... but for them we use daemon-tools or supervise-scripts (a re-implementation of daemon-tools), as these are *much* better at supervision than a monitoring system. If you really need a baby-sitter, the monitoring system isn't the best one ...
All of those "operational" aspects aside, I've convinced myself from a security point of view, corrective action from monitoring is bad-- a clear violation of the separation of duties. You don't want your auditors "cleaning up" the numbers as they go over your books.
You know what's better than your webserver being automatically restarted when it crashes? Your webserver not crashing.
I completely support the absence of corrective actions from monitor triggers. The question I have yet to answer satisfactorily is,"Should the monitoring system perform additional data collection after specific errors?" For example, running a particular "find" command when disk usage increases to try and identify which files are causing the partition to fill.
Or attach a debugger to the hung process and get a backtrace ?
Regards, Buchan