On Mon, Aug 31, 2015, at 16:24, J.C. Cleaver wrote:
On Mon, August 31, 2015 10:19 am, John Thurston wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote: . . .
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
The patch from http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the most recent Terabithia RPM.
If you could test the direct patch (for hostdata, at http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachm... ) on your OS, that would be very helpful. Signal handling is always a bit tricky to ensure is correct across the board.
I have patched one of my servers and it behaves much better under my contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The original report was under Red Hat Enterprise Linux 5.
If my understanding of this is correct, it is a pretty nasty defect :(
My failure scenario was non-delivery of some email alerts for hosts in dire straits. I have several customers who do not monitor the web interface, but rely on email notifications to warn them of impending problems. These folks had been without any alerting capability since early in July when I "dropped" at host and unknowingly clobbered the child of xymond_hostdata.
Thanks for the confirmation... Yes, I believe it's probably time to start another release cycle, for this and a few other of the recent bug fixes still pending.
For the record, I can't reproduce this on FreeBSD either.