Good day!
Recently we have installed Xymon 4.3.30 on new VM (CentOS Linux release 7.7.1908 (Core) - guest under KVM Guest Kernel: 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
All OK, except xymond_rrd is crashing frequently - the "xymond_rrd" metric is always red (was never green) with message:
- Program crashed Fatal signal caught!
In rrd-status.log we can find frequent messages like:
2019-10-14 14:35:03.609265 Child process 2997 died: Signal 6 2019-10-14 14:35:04.239677 Peer at 0.0.0.0:0 failed: Broken pipe 2019-10-14 14:35:08.886124 Peer not up, flushing message queue 2019-10-14 14:36:45.883398 Host 'synologyhost.domain.eu' reports netstat for an unknown OS 2019-10-14 14:36:45.888875 Child process 21622 died: Signal 6 2019-10-14 14:36:52.510319 Peer at 0.0.0.0:0 failed: Broken pipe 2019-10-14 14:36:52.510720 Peer not up, flushing message queue 2019-10-14 14:40:02.689062 Host 'synologyhost.domain.eu' reports netstat for an unknown OS 2019-10-14 14:40:02.694320 Child process 28158 died: Signal 6 2019-10-14 14:40:05.119354 Peer at 0.0.0.0:0 failed: Broken pipe 2019-10-14 14:40:05.250422 Peer not up, flushing message queue
Note: lines like "Host 'synologyhost.domain.eu' reports netstat for an unknown OS" are comining from Synonlogy NAS with Monitoring package installed. I am sure it is not related - it was working on old Xymon 4.3.17 (CentOS 6.6)
After fresh installation we just remapped (with symbolic link) the data directory to continue employ old data logs and rra.
There is plenty of core files under server/tmp/ srw-rw-rw- 1 xymon monitor 0 Oct 14 14:40 rrdctl.572 -rw------- 1 xymon monitor 3252224 Oct 14 14:45 core.572 srw-rw-rw- 1 xymon monitor 0 Oct 14 14:45 rrdctl.17027 -rw------- 1 xymon monitor 3248128 Oct 14 14:50 core.17027 srw-rw-rw- 1 xymon monitor 0 Oct 14 14:50 rrdctl.30574 -rw------- 1 xymon monitor 3248128 Oct 14 14:55 core.30574 srw-rw-rw- 1 xymon monitor 0 Oct 14 14:55 rrdctl.13275 -rw------- 1 xymon monitor 3239936 Oct 14 15:00 core.13275 -rw-r--r-- 1 xymon monitor 1887355 Oct 14 15:02 xymond.chk -rw-r--r-- 1 xymon monitor 0 Oct 14 15:02 alert.chk.sub -rw-r--r-- 1 xymon monitor 70921 Oct 14 15:02 alert.chk srw-rw-rw- 1 xymon monitor 0 Oct 14 15:02 rrdctl.5887 srw-rw-rw- 1 xymon monitor 0 Oct 14 15:02 rrdctl.5954 -rw------- 1 xymon monitor 3764224 Oct 14 15:05 core.5887 srw-rw-rw- 1 xymon monitor 0 Oct 14 15:05 rrdctl.10234
Question: How can we diagnose what is the cause of the problem?
Best regards,
Andrey Chervonets
SIA CoMinder http://www.cominder.eu/ mobile: +371 26517848