To get more information I have enabled "--debug" to both channels (status and data). Then we see a bit more information in rrd-status.log: .... 2019-10-17 13:40:02.376153 Host 'synologyhost.domain.eu' reports netstat for an unknown OS 408 2019-10-17 13:40:02.376181 Flush, but xymonmsg is empty 408 2019-10-17 13:40:02.376185 0 status messages merged into 1 transmissions 408 2019-10-17 13:40:02.376203 xymond_rrd: Got message 612 @@status#612/synologyhost.domain.eu|1571308802.357389|83.99.221.6||synologyhost.domain.eu|procs|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0| 408 2019-10-17 13:40:02.376210 startpos 95710, fillpos 99309, endpos 97006 408 2019-10-17 13:40:02.376227 Flush, but xymonmsg is empty 408 2019-10-17 13:40:02.376233 0 status messages merged into 1 transmissions 408 2019-10-17 13:40:02.376244 xymond_rrd: Got message 613 @@status#613/synologyhost.domain.eu|1571308802.357673|83.99.221.6||synologyhost.domain.eu|raid|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0| 408 2019-10-17 13:40:02.376251 startpos 97010, fillpos 99309, endpos 97945 408 2019-10-17 13:40:02.376269 Flush, but xymonmsg is empty 408 2019-10-17 13:40:02.376276 0 status messages merged into 1 transmissions 408 2019-10-17 13:40:02.376288 xymond_rrd: Got message 614 @@status#614/synologyhost.domain.eu|1571308802.368308|83.99.221.6||synologyhost.domain.eu|temperature|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0| 408 2019-10-17 13:40:02.376294 startpos 97949, fillpos 99309, endpos 98645 2019-10-17 13:40:02.381339 Child process 408 died: Signal 6 2019-10-17 13:40:04.432302 Peer at 0.0.0.0:0 failed: Broken pipe 2019-10-17 13:40:04.452708 Peer not up, flushing message queue 13920 2019-10-17 13:40:04.557656 setup_feedback_queue: got ID -1 for key 0xA03EB91 13920 2019-10-17 13:40:04.558141 Opening file /u01/app/xymon/product/xymon4.3.30/server/etc/rrddefinitions.cfg 13920 2019-10-17 13:40:04.558326 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=1052671 13920 2019-10-17 13:40:04.558359 Got 6716 bytes ... Here we can see processing of data from our Synology NAS with Synology Monitoring Tool 1.4.8, http://www.sysco.ch/synomon/ enabled. Make note - despite RRD crash we can see good status and text of "temperature" metric status like: -- Device Temp(C) Temp(F) --------------------------------------- green system 52 125 green /dev/sda 36 96 green /dev/sdb 38 100 green /dev/sdd 36 96 --------------------------------------- Synology Monitoring Tool 1.4.8, http://www.sysco.ch/synomon/ Model: RS812+ (synologyhost,domain.eu) Processor: Intel(R) Atom(TM) CPU D2701 @ 2.13GHz System temperature: 52?C Serial number: serialnumberdata-replaced Firmware: 6.2-24922 MAC address(s): number-replaced, number-replaced Linux version 3.10.105 (root at build10) (gcc version 4.9.3 20150311 (prerelease) (crosstool-NG 1.20.0) ) #24922 SMP Fri May 10 02:51:01 CST 2019 -- After stopping the plugin on Synology we have got no more data from it and no more xymond_rrd crash (red changed to purple, as expected). I am note sure where is the problem/bug. So I have added the Synology Monitoring Tool developers e-mail to our communictaion. Please, review and give the hint how can we fix the problem - our NAS state monitoring is quite critical thing we need. The suspection has been also proved by GDC info (as instructed at: http://www.robertandrobert.com/xymon/help/known-issues.html ): -- [xymon at synologyhost server]$ /bin/gdb /u01/app/xymon/product/xymon4.3.30/server/bin/xymond_rrd tmp/core.408 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 ... copyright... ... Reading symbols from /u01/app/xymon/product/xymon4.3.30/server/bin/xymond_rrd...done. [New LWP 408] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `xymond_rrd --rrddir=/u01/app/xymon/product/xymon4.3.30/data/rrd --debug'. Program terminated with signal 6, Aborted. #0 0x00007f62fcd85337 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 expat-2.1.0-10.el7_3.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 freetype-2.8-14.el7.x86_64 fribidi-1.0.2-1.el7.x86_64 glib2-2.56.1-5.el7.x86_64 glibc-2.17-292.el7.x86_64 graphite2-1.3.10-1.el7_3.x86_64 harfbuzz-1.7.5-2.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libffi-3.0.13-18.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libglvnd-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-egl-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-glx-1.0.1-0.8.git5baa1e5.el7.x86_64 libpng-1.5.13-7.el7_2.x86_64 libselinux-2.5-14.1.el7.x86_64 libthai-0.1.14-9.el7.x86_64 libtirpc-0.2.4-0.16.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 libxcb-1.13-1.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pango-1.42.4-4.el7_7.x86_64 pcre-8.32-17.el7.x86_64 pixman-0.34.0-1.el7.x86_64 rrdtool-1.4.8-9.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) (gdb) (gdb) bt #0 0x00007f62fcd85337 in raise () at /lib64/libc.so.6 #1 0x00007f62fcd86a28 in abort () at /lib64/libc.so.6 #2 0x0000000000428e63 in sigsegv_handler (signum=<optimized out>) at sig.c:57 #3 0x00007f62fcd853b0 in <signal handler called> () at /lib64/libc.so.6 #4 0x00007f62fcd89f97 in ____strtoll_l_internal () at /lib64/libc.so.6 #5 0x000000000040f9c2 in do_temperature_rrd (__nptr=0x0) at /usr/include/stdlib.h:280 #6 0x000000000040f9c2 in do_temperature_rrd (hostname=hostname at entry=0x7f62fdfceb43 "synologyhost.domain.eu", testname=testname at entry=0x7f62fdfceb58 "temperature", classname=classname at entry=0x7f62fdfceb99 "p_cominder", pagepaths=pagepaths at entry=0x7f62fdfceba4 "0", msg=msg at entry=0x7f62fdfceba7 "status+300 synologyhost,domain.eu.temperature green 2019-10-17 13:40:01 [synologyhost.domain.eu] - temperature\nDevice", ' ' <repeats 13 times>, "Temp(C) Temp(F)\n", '-' <repeats 39 times>, "\n&green system"..., tstamp=tstamp at entry=1571308802) at rrd/do_temperature.c:100 #7 0x000000000041316b in update_rrd (hostname=hostname at entry=0x7f62fdfceb43 "synologyhost.domain.eu", testname=<optimized out>, testname at entry=0x7f62fdfceb58 "temperature", msg=msg at entry=0x7f62fdfceba7 "status+300 synologyhost,domain.eu.temperature green 2019-10-17 13:40:01 [synologyhost.domain.eu] - temperature\nDevice", ' ' <repeats 13 times>, "Temp(C) Temp(F)\n", '-' <repeats 39 times>, "\n&green system"..., tstamp=tstamp at entry=1571308802, sender=sender at entry=0x7f62fdfceb36 "83.99.221.6", ldef=<optimized out>, classname=classname at entry=0x7f62fdfceb99 "p_cominder", pagepaths=pagepaths at entry=0x7f62fdfceba4 "0") at do_rrd.c:714 #8 0x0000000000403434 in main (argc=<optimized out>, argv=0x7ffffb4bd4b8) at xymond_rrd.c:391 (gdb) -- So, we know which metric cause RRD crash, we have workaround (to make RRD working to generate other metrics graphs), but we need better solution to make all that working as expected. P.S. Note: real hostname is replaced in all outputs submitted in e-mail (just if there are some checksums are used). Best regards, Andrey Chervonets ---------------------- CoMinder Support http://www.cominder.eu/ mobile: +371 26517848 "Xymon" <xymon-bounces at xymon.com> wrote on 15.10.2019 13:00:01:
From: xymon-request at xymon.com To: xymon at xymon.com Date: 15.10.2019 13:00 Subject: Xymon Digest, Vol 105, Issue 9 Sent by: "Xymon" <xymon-bounces at xymon.com>
----------------------------------------------------------------------
Message: 1 Date: Mon, 14 Oct 2019 15:09:53 +0300 From: Andrey Chervonets <A.Chervonets at cominder.eu> To: xymon at xymon.com Subject: [Xymon] xymond_rrd - Program crashed after fresh install of Xymon 4.3.30 and data from Xymon 4.3.17 Message-ID: <OFD5D1CD2D.3E1D4B14-ONC2258493.00408D6C-C2258493.0042D300 at cominder.eu>
Content-Type: text/plain; charset="us-ascii"
Good day!
Recently we have installed Xymon 4.3.30 on new VM (CentOS Linux release 7.7.1908 (Core) - guest under KVM Guest Kernel: 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
All OK, except xymond_rrd is crashing frequently - the "xymond_rrd" metric is always red (was never green) with message: - Program crashed Fatal signal caught!
In rrd-status.log we can find frequent messages like:
2019-10-14 14:35:03.609265 Child process 2997 died: Signal 6 2019-10-14 14:35:04.239677 Peer at 0.0.0.0:0 failed: Broken pipe 2019-10-14 14:35:08.886124 Peer not up, flushing message queue 2019-10-14 14:36:45.883398 Host 'synologyhost.domain.eu' reports netstat
for an unknown OS 2019-10-14 14:36:45.888875 Child process 21622 died: Signal 6 2019-10-14 14:36:52.510319 Peer at 0.0.0.0:0 failed: Broken pipe 2019-10-14 14:36:52.510720 Peer not up, flushing message queue 2019-10-14 14:40:02.689062 Host 'synologyhost.domain.eu' reports netstat
for an unknown OS 2019-10-14 14:40:02.694320 Child process 28158 died: Signal 6 2019-10-14 14:40:05.119354 Peer at 0.0.0.0:0 failed: Broken pipe 2019-10-14 14:40:05.250422 Peer not up, flushing message queue
Note: lines like "Host 'synologyhost.domain.eu' reports netstat for an unknown OS" are comining from Synonlogy NAS with Monitoring package installed. I am sure it is not related - it was working on old Xymon 4.3.17 (CentOS
6.6)
After fresh installation we just remapped (with symbolic link) the data directory to continue employ old data logs and rra.
There is plenty of core files under server/tmp/ srw-rw-rw- 1 xymon monitor 0 Oct 14 14:40 rrdctl.572 -rw------- 1 xymon monitor 3252224 Oct 14 14:45 core.572 srw-rw-rw- 1 xymon monitor 0 Oct 14 14:45 rrdctl.17027 -rw------- 1 xymon monitor 3248128 Oct 14 14:50 core.17027 srw-rw-rw- 1 xymon monitor 0 Oct 14 14:50 rrdctl.30574 -rw------- 1 xymon monitor 3248128 Oct 14 14:55 core.30574 srw-rw-rw- 1 xymon monitor 0 Oct 14 14:55 rrdctl.13275 -rw------- 1 xymon monitor 3239936 Oct 14 15:00 core.13275 -rw-r--r-- 1 xymon monitor 1887355 Oct 14 15:02 xymond.chk -rw-r--r-- 1 xymon monitor 0 Oct 14 15:02 alert.chk.sub -rw-r--r-- 1 xymon monitor 70921 Oct 14 15:02 alert.chk srw-rw-rw- 1 xymon monitor 0 Oct 14 15:02 rrdctl.5887 srw-rw-rw- 1 xymon monitor 0 Oct 14 15:02 rrdctl.5954 -rw------- 1 xymon monitor 3764224 Oct 14 15:05 core.5887 srw-rw-rw- 1 xymon monitor 0 Oct 14 15:05 rrdctl.10234
Question: How can we diagnose what is the cause of the problem?
Best regards,
Andrey Chervonets ---------------------- SIA CoMinder http://www.cominder.eu/ mobile: +371 26517848