xymonnet - fatal signal caught
Hello,
This is my first post to the mailing list and I was hoping to get some help with an issue I am having.
My xymon server has been running great with no issues for months, version xymon-server-4.3.17_3 on FreeBSD10.
This morning xymonnet has crashed. Here is the output from the xymonlaunch.log file:
06:45:07 Task xymonnet terminated by signal 6 06:50:11 Task xymonnet terminated by signal 6 06:55:11 Task xymonnet terminated by signal 6 7:00:12 Task xymonnet terminated by signal 6 07:05:13 Task xymonnet terminated by signal 6 7:10:14 Task xymonnet terminated by signal 6 07:20:16 Task xymonnet terminated by signal 6
Server was rebooted:
07:26:48 xymonlaunch starting 07:26:48 Loading tasklist configuration from (location of file here) 07:26:48 Loading hostnames 07:26:49 Loading saved state 07:26:49 Setting up network listener on 0.0.0.0:1984 07:26:49 Setting up signal handlers 7:26:49 Setting up xymond channels 07:26:49 Setting up logfiles 07:31:02 Task xymonnet terminated by signal 6 07:36:04 Task xymonnet terminated by signal 6 7:41:09 Task xymonnet terminated by signal 6 7:46:11 Task xymonnet terminated by signal 6
Before it started crashing there were no changes (in xymon config files) or updates made. We add new hosts and checks every few days.
I checked the other logs for xymon and things looks ok. Other areas of xymon work as well.
Maybe I am over looking a log file and missing something obvious of why it keeps crashing. Any thoughts on where to look next?
Thank you,
--
On Tue, Oct 14, 2014, at 11:21, Wallace Barrow wrote:
Maybe I am over looking a log file and missing something obvious of why it keeps crashing. Any thoughts on where to look next?
When it crashes does it create core.* files in Xymon's tmp directory? I think that's /usr/local/www/xymon/server/tmp
Perhaps it's related to this post:
http://lists.xymon.com/archive/2014-February/039058.html
If there's a core file could you paste the backtrace info?
cd /usr/local/www/xymon/server
gdb bin/xymonnet tmp/core.64739 (example core file name)
-- GDB informational output will be here --
(gdb) bt <-- type "bt" at the prompt to get the backtrace output and then copy/paste it back to the list
Core file used: Oct 14 11:49 xymonnet.core
New core file generated every 10 minutes (being over written)
GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `xymonnet'. Program terminated with signal 6, Aborted. Reading symbols from /usr/local/lib/libcares.so.2...done. Loaded symbols for /usr/local/lib/libcares.so.2 Reading symbols from /usr/lib/libssl.so.7...done. Loaded symbols for /usr/lib/libssl.so.7 Reading symbols from /lib/libcrypto.so.7...done. Loaded symbols for /lib/libcrypto.so.7 Reading symbols from /usr/local/lib/libpcre.so.3...done. Loaded symbols for /usr/local/lib/libpcre.so.3 Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /lib/libthr.so.3...done. Loaded symbols for /lib/libthr.so.3 Reading symbols from /usr/local/lib/nss_ldap.so.1...done. Loaded symbols for /usr/local/lib/nss_ldap.so.1 Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x0000000801457e1a in kill () from /lib/libc.so.7 [New Thread 802006400 (LWP 100281/xymonnet)]
(gdb) bt #0 0x0000000801457e1a in kill () from /lib/libc.so.7 #1 0x0000000801456ac9 in abort () from /lib/libc.so.7 #2 0x000000000041c051 in sigsegv_handler (signum=<value optimized out>) at sig.c:57 #3 <signal handler called> #4 0x000000000041028a in dns_simple_callback (arg=0x80259adc0, status=<value optimized out>, timeout=0, hent=0x8021221a0) at dns.c:120 #5 0x0000000800851fe0 in ares_gethostbyname_file () from /usr/local/lib/libcares.so.2 #6 0x0000000800851ec1 in ares_gethostbyname_file () from /usr/local/lib/libcares.so.2 #7 0x000000080085f4f7 in ares_search () from /usr/local/lib/libcares.so.2 #8 0x000000080085f0e6 in ares_search () from /usr/local/lib/libcares.so.2 #9 0x000000080085e858 in ares_query () from /usr/local/lib/libcares.so.2 #10 0x000000080085c5b2 in ares_process_fd () from /usr/local/lib/libcares.so.2 #11 0x000000080085ded6 in ares_process_fd () from /usr/local/lib/libcares.so.2 #12 0x000000080085d5ca in ares_process_fd () from /usr/local/lib/libcares.so.2 #13 0x000000080085bb54 in ares_process () from /usr/local/lib/libcares.so.2 #14 0x000000080085badf in ares_process () from /usr/local/lib/libcares.so.2 #15 0x000000000041010b in dns_ares_queue_run (channel=0x802176000) at dns.c:172 #16 0x0000000000409d15 in main (argc=7, argv=0x7fffffffb7a0) at xymonnet.c:2305 (gdb)
Perhaps it's related to this post:
http://lists.xymon.com/archive/2014-February/039058.html
If there's a core file could you paste the backtrace info?
cd /usr/local/www/xymon/server
gdb bin/xymonnet tmp/core.64739 (example core file name)
-- GDB informational output will be here --
(gdb) bt <-- type "bt" at the prompt to get the backtrace output and then copy/paste it back to the list
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
So far by adding --no-ares to tasks.cfg has let the program start and things seem to be working.
On Tue, Oct 14, 2014, at 17:53, Wallace Barrow wrote:
So far by adding --no-ares to tasks.cfg has let the program start and things seem to be working.
I see that this has come up several times before. From a 2007 thread Henrik mentioned that this can happen randomly and was unsure why. Do we have any better ideas how to debug this problem? Would be nice to be able to come up with a permanent solution for everyone so this doesn't happen and catch people off guard.
On 15 October 2014 10:27, Mark Felder <feld at feld.me> wrote:
I see that this has come up several times before. From a 2007 thread Henrik mentioned that this can happen randomly and was unsure why. Do we have any better ideas how to debug this problem? Would be nice to be able to come up with a permanent solution for everyone so this doesn't happen and catch people off guard.
Being a random problem makes it difficult to track down the fault, but if we have a willing participant that can reproduce the fault, we might be able to make progress. From reading the code, it looks like the ARES library is resolving with success, but when xymonnet is copying the resolved address into its own data structure, it fails to copy. The troublesome line 120 is:
memcpy(&dnsc->addr, *(hent->h_addr_list), sizeof(dnsc->addr));
(From my poor knowledge of C) some problems that can arise here are: a) dhsc->addr or dsc is null b) hent->h_addr_list or hent is null c) dnsc->addr is larger than hent->h_addr_list
Perhaps we need to see the values of these. Wallace, can you recompile after inserting these lines immediately before line 120:
dbgprintf("ARES host=%s\n", hent->h_name); dbgprintf("ARES status=%d name=%s\n", status, dnsc->name); dbgprintf("ARES addr size=%d\n", sizeof(dnsc->addr)); dbgprintf("ARES addr hex=%#lx\n", dnsc->addr); dbgprintf("ARES addr ascii=%s\n", inet_ntoa(dnsc->addr));
Assuming this compiles correctly for you (it did for me), backup the old xymonnet, and copy the newly compiled on into place. Then wait for a core dump, and see what's in the logs.
Warning: This might break your monitoring, so you might not want to use this on a production system, depending on your stability requirements.
Alternatively, you might see if you can reproduce the problem by running the xymonnet binary manually, something like this:
xymonnet --debug --no-update name.of.server
If this dumps core, then you should be able to manually run the new binary in the same way, and check the log output for our debug statements.
J
On Tue, Oct 14, 2014, at 21:00, Jeremy Laidman wrote:
Perhaps we need to see the values of these. Wallace, can you recompile after inserting these lines immediately before line 120:
dbgprintf("ARES host=%s\n", hent->h_name); dbgprintf("ARES status=%d name=%s\n", status, dnsc->name); dbgprintf("ARES addr size=%d\n", sizeof(dnsc->addr)); dbgprintf("ARES addr hex=%#lx\n", dnsc->addr); dbgprintf("ARES addr ascii=%s\n", inet_ntoa(dnsc->addr));
Are you talking about line 120 in xymonnet/dns.c ? It's not obvious which file you're talking about in the xymonnet source code. It would be clearer if you provided a diff in the future.
On Wed, Oct 15, 2014, at 07:38, Mark Felder wrote:
On Tue, Oct 14, 2014, at 21:00, Jeremy Laidman wrote:
Perhaps we need to see the values of these. Wallace, can you recompile after inserting these lines immediately before line 120:
dbgprintf("ARES host=%s\n", hent->h_name); dbgprintf("ARES status=%d name=%s\n", status, dnsc->name); dbgprintf("ARES addr size=%d\n", sizeof(dnsc->addr)); dbgprintf("ARES addr hex=%#lx\n", dnsc->addr); dbgprintf("ARES addr ascii=%s\n", inet_ntoa(dnsc->addr));
Are you talking about line 120 in xymonnet/dns.c ? It's not obvious which file you're talking about in the xymonnet source code. It would be clearer if you provided a diff in the future.
It does appear Jeremy meant dns.c. I'm providing a diff -- it does compile as Jeremy said.
participants (3)
-
feld@feld.me
-
incin@incin.me
-
jlaidman@rebel-it.com.au