On Wed, June 10, 2015 10:01 am, Scot Kreienkamp wrote:
Hi everyone,
I have a xymon server running 4.3.21 that seems to be accumulating processes like these:
hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28435 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28440 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28444 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28449 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages, so I did a test.
[hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 161 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 162 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 163 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 164 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 165 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 166 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 167
So every time I send a drop message I get a defunct process hanging out. Bug in Xymon?
This is on RHEL5, xymon 4.3.21.
Thanks!
Scot,
Some background: When doing a full drop on a host, xymond_hostdata (and xymond_history, IIRC) forks to perform the recursive directory removal of history files and whatnot in the background, then exits out. That's why it corresponds to those events.
Looks like xymond_hostdata.c is missing a SIGCHLD registration, which is causing the defunct processes to stack up. Strangely, I haven't observed this behavior on RHEL6 at all though, even though we're dropping hosts all the time. Odd.
The following patch should fix the issue for you, I believe.
Regards,
-jc