On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
It has been a long road, but I may have uncovered a defect in the rrd handler. I'm currently running xymon 4.3.17 (somewhat patched) on Solaris 10 on SPARC.
:: Symptom :: The xymond_rrd process crashes. It leaves footprints in the log like:
2015-04-20 19:09:18 Child process 23929 died: Signal 8 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe 2015-04-20 19:09:18 Peer not up, flushing message queue It also leaves a pid file behind. It also leaves gaps in the rrd data.
:: Tracing :: By collecting core dumps and using pstack, I was able to see:
__divdi3+0x164(0, 0, 0, 0, 0, 0) do_la_rrd+0x2d4(1017c8, 1017e4, 10181d, 101829, 1, 55369956) update_rrd+0x76c(1017c8, 1017e4, 10182c, 55369956, 1017c0, 69000) main+0xa40(10182c, ffbfdc3c, 101829, 68800, 3, 49538) _start+0x5c(0, 0, 0, 0, 0, 0)
By enabling --debug on the "xymond_channel --channel=status" task (after applying jc's patch of March 13 so that the debug code didn't crash xymond - Thank you!), I was able to see that the last messages received prior to crashes was always a sequence of status messages containing "mem" followed by "cpu".
A "Signal 8" is commonly associated with a "divide by zero" event. Not being very C-aware, I had to consult google to learn that __divdi3 is a library used for unsigned long-int division. So if I'm reading the stack trace correctly, update_rrd called do_la_rrd which tied to do a division which failed.
:: Smoking gun? :: One of the hosts (running a BBPE 4.3 client), whose status message triggered a crash, has the following in its last CPU message:
Memory Statistics Total Physical memory: 0 bytes (1.00MB) Available Physical memory: 0 bytes (1.00MB) Total PageFile size: 17397915648 bytes (16.20GB) Available PageFile size: 12473839616 bytes (11.62GB) Total Virtual memory size: 17397915648 bytes (16.20GB) Available Virtual memory size: 12487081984 bytes (11.63GB)
And has the following in its last MEM message:
Physical Memory: 0.00MB/1.00MB (-1.#J%) Commit Charge: 4.59GB/16.20GB (28.30%)
Memory Statistics (Used/Available/Total in bytes) Physical Memory: 0 (0.00MB) / 0 (1.00MB) / 0 (1.00MB) Commit Charge: 4924112896 (4.59GB) / 12473839616 (11.62GB) / 17397915648 (16.20GB) Virtual memory: 4910481408 (4.57GB) / 12487081984 (11.63GB) / 17397915648 (16.20GB)
It isn't reasonable for a system to actually have 0MB of memory, but that is what the client has reported.
:: Hypothesis ::
The message handling code is accepting messages from clients stating 0MB total physical memory, but such information is making its way into the RRD handler and causing a divide by zero.
Can anyone else test this hypothesis?
Can someone with more C-skills look at do_la_rrd and see if a zero really can find its way into its division statements?
Yikes.
Yep, seems exactly like that's the case! I believe the following patch should fix it for you. Can you try it out?
HTH,
-jc