Possible defect in rrd handler causing divide-by-zero crashes
It has been a long road, but I may have uncovered a defect in the rrd handler. I'm currently running xymon 4.3.17 (somewhat patched) on Solaris 10 on SPARC.
:: Symptom :: The xymond_rrd process crashes. It leaves footprints in the log like:
2015-04-20 19:09:18 Child process 23929 died: Signal 8 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe 2015-04-20 19:09:18 Peer not up, flushing message queue It also leaves a pid file behind. It also leaves gaps in the rrd data.
:: Tracing :: By collecting core dumps and using pstack, I was able to see:
__divdi3+0x164(0, 0, 0, 0, 0, 0) do_la_rrd+0x2d4(1017c8, 1017e4, 10181d, 101829, 1, 55369956) update_rrd+0x76c(1017c8, 1017e4, 10182c, 55369956, 1017c0, 69000) main+0xa40(10182c, ffbfdc3c, 101829, 68800, 3, 49538) _start+0x5c(0, 0, 0, 0, 0, 0)
By enabling --debug on the "xymond_channel --channel=status" task (after applying jc's patch of March 13 so that the debug code didn't crash xymond - Thank you!), I was able to see that the last messages received prior to crashes was always a sequence of status messages containing "mem" followed by "cpu".
A "Signal 8" is commonly associated with a "divide by zero" event. Not being very C-aware, I had to consult google to learn that __divdi3 is a library used for unsigned long-int division. So if I'm reading the stack trace correctly, update_rrd called do_la_rrd which tied to do a division which failed.
:: Smoking gun? :: One of the hosts (running a BBPE 4.3 client), whose status message triggered a crash, has the following in its last CPU message:
Memory Statistics Total Physical memory: 0 bytes (1.00MB) Available Physical memory: 0 bytes (1.00MB) Total PageFile size: 17397915648 bytes (16.20GB) Available PageFile size: 12473839616 bytes (11.62GB) Total Virtual memory size: 17397915648 bytes (16.20GB) Available Virtual memory size: 12487081984 bytes (11.63GB)
And has the following in its last MEM message:
Physical Memory: 0.00MB/1.00MB (-1.#J%) Commit Charge: 4.59GB/16.20GB (28.30%)
Memory Statistics (Used/Available/Total in bytes) Physical Memory: 0 (0.00MB) / 0 (1.00MB) / 0 (1.00MB) Commit Charge: 4924112896 (4.59GB) / 12473839616 (11.62GB) / 17397915648 (16.20GB) Virtual memory: 4910481408 (4.57GB) / 12487081984 (11.63GB) / 17397915648 (16.20GB)
It isn't reasonable for a system to actually have 0MB of memory, but that is what the client has reported.
:: Hypothesis ::
The message handling code is accepting messages from clients stating 0MB total physical memory, but such information is making its way into the RRD handler and causing a divide by zero.
Can anyone else test this hypothesis?
Can someone with more C-skills look at do_la_rrd and see if a zero really can find its way into its division statements?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
It has been a long road, but I may have uncovered a defect in the rrd handler. I'm currently running xymon 4.3.17 (somewhat patched) on Solaris 10 on SPARC.
:: Symptom :: The xymond_rrd process crashes. It leaves footprints in the log like:
2015-04-20 19:09:18 Child process 23929 died: Signal 8 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe 2015-04-20 19:09:18 Peer not up, flushing message queue It also leaves a pid file behind. It also leaves gaps in the rrd data.
:: Tracing :: By collecting core dumps and using pstack, I was able to see:
__divdi3+0x164(0, 0, 0, 0, 0, 0) do_la_rrd+0x2d4(1017c8, 1017e4, 10181d, 101829, 1, 55369956) update_rrd+0x76c(1017c8, 1017e4, 10182c, 55369956, 1017c0, 69000) main+0xa40(10182c, ffbfdc3c, 101829, 68800, 3, 49538) _start+0x5c(0, 0, 0, 0, 0, 0)
By enabling --debug on the "xymond_channel --channel=status" task (after applying jc's patch of March 13 so that the debug code didn't crash xymond - Thank you!), I was able to see that the last messages received prior to crashes was always a sequence of status messages containing "mem" followed by "cpu".
A "Signal 8" is commonly associated with a "divide by zero" event. Not being very C-aware, I had to consult google to learn that __divdi3 is a library used for unsigned long-int division. So if I'm reading the stack trace correctly, update_rrd called do_la_rrd which tied to do a division which failed.
:: Smoking gun? :: One of the hosts (running a BBPE 4.3 client), whose status message triggered a crash, has the following in its last CPU message:
Memory Statistics Total Physical memory: 0 bytes (1.00MB) Available Physical memory: 0 bytes (1.00MB) Total PageFile size: 17397915648 bytes (16.20GB) Available PageFile size: 12473839616 bytes (11.62GB) Total Virtual memory size: 17397915648 bytes (16.20GB) Available Virtual memory size: 12487081984 bytes (11.63GB)
And has the following in its last MEM message:
Physical Memory: 0.00MB/1.00MB (-1.#J%) Commit Charge: 4.59GB/16.20GB (28.30%)
Memory Statistics (Used/Available/Total in bytes) Physical Memory: 0 (0.00MB) / 0 (1.00MB) / 0 (1.00MB) Commit Charge: 4924112896 (4.59GB) / 12473839616 (11.62GB) / 17397915648 (16.20GB) Virtual memory: 4910481408 (4.57GB) / 12487081984 (11.63GB) / 17397915648 (16.20GB)
It isn't reasonable for a system to actually have 0MB of memory, but that is what the client has reported.
:: Hypothesis ::
The message handling code is accepting messages from clients stating 0MB total physical memory, but such information is making its way into the RRD handler and causing a divide by zero.
Can anyone else test this hypothesis?
Can someone with more C-skills look at do_la_rrd and see if a zero really can find its way into its division statements?
Yikes.
Yep, seems exactly like that's the case! I believe the following patch should fix it for you. Can you try it out?
HTH,
-jc
On 4/21/2015 4:40 PM, J.C. Cleaver wrote:
- snip -
:: Hypothesis ::
The message handling code is accepting messages from clients stating 0MB total physical memory, but such information is making its way into the RRD handler and causing a divide by zero.
Can anyone else test this hypothesis?
Can someone with more C-skills look at do_la_rrd and see if a zero really can find its way into its division statements?
On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
It has been a long road, but I may have uncovered a defect in the rrd handler. I'm currently running xymon 4.3.17 (somewhat patched) on Solaris 10 on SPARC.
:: Symptom :: The xymond_rrd process crashes. It leaves footprints in the log like:
2015-04-20 19:09:18 Child process 23929 died: Signal 8 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe 2015-04-20 19:09:18 Peer not up, flushing message queue It also leaves a pid file behind. It also leaves gaps in the rrd data.
Yep, seems exactly like that's the case! I believe the following patch should fix it for you. Can you try it out?
Thank you!
I created a script with which I could semi-reliably induce a crash by feeding a message claiming 0MB of physical memory. It isn't 100% reliable because I think there is some magic timing I haven't deciphered. But if I wait five or ten minutes between attempts, I can crash the unpatched process with my message.
After applying your patch, I am _unable_ to crash the process with my message. I also found the "report had 0 total physical/pagefile memory listed" text in my rrd-status log.
Now I want to try to grasp the possible consequences of using this patch. Am I correct that by responding to this condition with "return 0", there will not be a call made to do_memory_rrd_update for this host/message combination? And that the worst consequence of this will be a possible gap in the data stored in the rrd for this host?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Wed, April 22, 2015 11:28 am, John Thurston wrote:
On 4/21/2015 4:40 PM, J.C. Cleaver wrote:
- snip -
:: Hypothesis ::
The message handling code is accepting messages from clients stating 0MB total physical memory, but such information is making its way into the RRD handler and causing a divide by zero.
Can anyone else test this hypothesis?
Can someone with more C-skills look at do_la_rrd and see if a zero really can find its way into its division statements?
On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
It has been a long road, but I may have uncovered a defect in the rrd handler. I'm currently running xymon 4.3.17 (somewhat patched) on Solaris 10 on SPARC.
:: Symptom :: The xymond_rrd process crashes. It leaves footprints in the log like:
2015-04-20 19:09:18 Child process 23929 died: Signal 8 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe 2015-04-20 19:09:18 Peer not up, flushing message queue It also leaves a pid file behind. It also leaves gaps in the rrd data.
Yep, seems exactly like that's the case! I believe the following patch should fix it for you. Can you try it out?
Thank you!
I created a script with which I could semi-reliably induce a crash by feeding a message claiming 0MB of physical memory. It isn't 100% reliable because I think there is some magic timing I haven't deciphered. But if I wait five or ten minutes between attempts, I can crash the unpatched process with my message.
After applying your patch, I am _unable_ to crash the process with my message. I also found the "report had 0 total physical/pagefile memory listed" text in my rrd-status log.
Great to hear! Unfortunately, it means a search for other un-validated zero divs is probably warranted.
Now I want to try to grasp the possible consequences of using this patch. Am I correct that by responding to this condition with "return 0", there will not be a call made to do_memory_rrd_update for this host/message combination? And that the worst consequence of this will be a possible gap in the data stored in the rrd for this host?
This is correct. Since no report is coming in, RRD will eventually see 'NaN' instead of "0's".
The actual memory *status*, FBoFW, is done via a different calculation (for Solaris, unix_memory_report() in xymond_client.c). It appears that phystotal of 0 there will cause the Physical memory usage to be listed as '0', which would probably not trigger anything on MEMPHYS alerts in analysis.cfg. (Not sure if that's the safest approach, but if there are clients that regularly report in 0 total RAM, the alternative might be worse.)
Regards,
-jc
participants (2)
-
cleaver@terabithia.org
-
john.thurston@alaska.gov