On Tue, December 1, 2015 1:41 pm, John Thurston wrote:
On 12/1/2015 11:51 AM, J.C. Cleaver wrote:
On Tue, December 1, 2015 9:32 am, John Thurston wrote: *snip*
In this occurrence, it does not appear to be related to a "drop" message. My last recorded "drop" was at 20151103-0846 and the alert process didn't start logging "which is not defined" until 20151120-0007
Hmm. Okay, that does change things slightly. Fortunately, that means it's probably specifically caused by drops per se. Were there any other errors that occurred with other components around this time?
I have several instances of "Oversize status msg from " in the xymond.log, but those are appearing six hours before the bad behavior appeared in xymon_alert. I have difficulty believing they are related.
Ack. Yeah, that should have been 'NOT specifically' :)
Perhaps the system being low enough on memory that some re-allocations might have failed?
I think this is unlikely. The system has 256GB of RAM, and there are no memory caps placed on the non-global zone in which xymon is running. I don't have information of its size on Nov 20, but today it using about 400MB of RAM. All of the zones on the system are consuming less than 10GB of the 256GB and it wouldn't have been significantly different a few weeks ago.
I've been doing some 'drops' today to try to break it, but haven't succeeded. I'll continue to beat on it and see if I can find a repeatable failure scenario.
fwiw, this is under 4.3.22
Hmm. This is an area where it's possible that glibc/NULL issues might be causing subtle things too. I could easily see the btree getting hosed by tree re-insertion of a key we weren't really expecting.
-jc