[Xymon] xymond crashing! -- Please help!

30 Jan 2016 · *debug*


      Hi J.C.,
So it appears that only fixed it temporarily.
If I stop the service and start it back up again, it crashes again.
I think I figured out how to read the core file and get a backtrace for you
(I think).
Here's what I got from the most recent crash (with some host names
obfuscated):
[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612 "linux",
hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "", can_respond=can_respond at entry=1) at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized out>)
at xymond.c:6288
Is this what you wanted? Do you want me to install the debug package for
glibc or other packages?
Let me know what I can do.
Thanks!!
--
Matt Vander Werf
On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf <matt1299 at gmail.com>
wrote:
...
Hi J.C.,
Moving the xymond.chk checkpoint file out of the way after it was stopped
seemed to fix this (at least so far).
I see that I lost all record of disabled tests (getting alerts for things
that were disabled).
What all data exactly did I lose with moving that checkpoint file out of
the way?
Is there anyway to get the data back? Or maybe figure out the corruptness
in the checkpoint file and then move the file back in place?
Also, see my most recent e-mail with the xymonlaunch log (if you haven't
already). Looks like this has happened in the past but resolved itself....
Regarding the backtrace....
I put those lines in /etc/sysconfig/xymonlaunch and I see the core files
being generated now.
I feel embarrassed to admit this, but how exactly do I get the backtrace
out of the binary core files, besides trying to read the files with an
editor? Any way to know which core file had the backtrace?
Also, I see this in journalctl:
Ignoring invalid environment assignment 'export
DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
Thanks for your help!!
--
Matt Vander Werf
On Sat, Jan 30, 2016 at 12:39 PM, J.C. Cleaver <cleaver at terabithia.org>
wrote:
...
Hi Matt,
The log lines you're seeing are actually from the new xymond process
trying to start up, then failing because the port is already in use. I
think the timeout right below it is from the previous process's signal
handler giving up, based on the timestamps.
Can you get a backtrace from xymond's core file? It should be left in
/var/lib/xymon/tmp/, or in the (*shudder*) systemd journal somewhere...
If your system is set not to keep them by default, add
''
export DAEMON_COREFILE_LIMIT="unlimited"
ulimit -c unlimited
''
to /etc/sysconfig/xymonlaunch
I suspect there might be something corrupted in the xymond checkpoint
file.
First, do a 'service xymon stop' and make sure all xymon processes are
completely gone, including any xymond's still pending, then start xymon
back up. If it crashes again, do the same, but move the
/var/lib/xymon/xymond.chk checkpoint file out of the way after it's off,
and let it come back up.
If it *still* doesn't come up, there's something else going on. Either
way, a full backtrace will help let us see where exactly it's dying.
HTH,
-jc
On Sat, January 30, 2016 8:28 am, Matt Vander Werf wrote:
...
As a followup, xymond seems to try and start itself up again after a
while
(probably because xymonlaunch is still running) and goes for a short
while
working just fine and then just crashes again with the same messages and
results.
--
Matt Vander Werf
On Sat, Jan 30, 2016 at 11:21 AM, Matt Vander Werf <matt1299 at gmail.com>
wrote:
...
Hello,
I'm having a major issue with xymond crashing shortly after the service
starts.
I'm using the the latest Terabithia RPM for RHEL 7
(4.3.24-3.el7.terabithia).
When I check the status of the xymon service, it shows it as up but
with
only the xymonlaunch parent process and vmstat processes. Upon
restarting
the service, I see it start normally (all the normal channel processes,
etc.) and then after a while they all go away, leaving the following
process behind:
       ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal
<server
hostname>.xymond red (Check time of report) - xymond program crashed
Fatal
signal caught!
along with the xymonlaunch process and some vmstat processes. After a
while that process goes away. Sometimes a single xymond_rrd will show
up
alongside the xymonlaunch and vmstat processes as well after a little
while.
I'm already running xymond in --debug mode.
This is what I see in the xymond log around the time of the crash:
2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host>
(<client
IP address>)
2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address
already
in use)
2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
2016-01-30 11:02:59.539020 ->
2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout
50
2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal
<server
hostname>.xymond red (Check time of report) - xymond program crashed'
It seems to get finished with loading all the hosts and then it crashes
(the last host before it crashes is the last client I have
alphabetically).
I've tried stopping the service, killing off any remaining xymon owned
processes, and started the service with the same results. I've also
tried
restarting the xymon server machine itself, with the same crash
happening
when the service starts the first time.
This just started happening out of the blue a couple of hours ago...
Looking in netstat, there are no active connections using port 1984 on
the
local side, just a bunch of clients trying to connect to the server
with
1984 in the foreign address.
ANY help would be much appreciated as currently our Xymon server is not
working!!
Thanks!!
--
Matt Vander Werf

Xymon mailing list
Xymon at xymon.com
http://lists.xymon.com/mailman/listinfo/xymon

[Xymon] xymond crashing! -- Please help!

matt1299＠gmail.com