On Wed, April 22, 2015 11:28 am, John Thurston wrote:
On 4/21/2015 4:40 PM, J.C. Cleaver wrote:
- snip -
:: Hypothesis ::
The message handling code is accepting messages from clients stating 0MB total physical memory, but such information is making its way into the RRD handler and causing a divide by zero.
Can anyone else test this hypothesis?
Can someone with more C-skills look at do_la_rrd and see if a zero really can find its way into its division statements?
On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
It has been a long road, but I may have uncovered a defect in the rrd handler. I'm currently running xymon 4.3.17 (somewhat patched) on Solaris 10 on SPARC.
:: Symptom :: The xymond_rrd process crashes. It leaves footprints in the log like:
2015-04-20 19:09:18 Child process 23929 died: Signal 8 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe 2015-04-20 19:09:18 Peer not up, flushing message queue It also leaves a pid file behind. It also leaves gaps in the rrd data.
Yep, seems exactly like that's the case! I believe the following patch should fix it for you. Can you try it out?
Thank you!
I created a script with which I could semi-reliably induce a crash by feeding a message claiming 0MB of physical memory. It isn't 100% reliable because I think there is some magic timing I haven't deciphered. But if I wait five or ten minutes between attempts, I can crash the unpatched process with my message.
After applying your patch, I am _unable_ to crash the process with my message. I also found the "report had 0 total physical/pagefile memory listed" text in my rrd-status log.
Great to hear! Unfortunately, it means a search for other un-validated zero divs is probably warranted.
Now I want to try to grasp the possible consequences of using this patch. Am I correct that by responding to this condition with "return 0", there will not be a call made to do_memory_rrd_update for this host/message combination? And that the worst consequence of this will be a possible gap in the data stored in the rrd for this host?
This is correct. Since no report is coming in, RRD will eventually see 'NaN' instead of "0's".
The actual memory *status*, FBoFW, is done via a different calculation (for Solaris, unix_memory_report() in xymond_client.c). It appears that phystotal of 0 there will cause the Physical memory usage to be listed as '0', which would probably not trigger anything on MEMPHYS alerts in analysis.cfg. (Not sure if that's the safest approach, but if there are clients that regularly report in 0 total RAM, the alternative might be worse.)
Regards,
-jc