Xymon 4.3.30-1 memory issues and core dumps
Hi,
After running for 5 hrs on my new installation on a RH 7.9, xymond has already allocated 11.5GB of memory... Last night it core-dumped multiple times, and threw "Cannot allocate memory" in multiple xymon logfiles, ala "newstrbuffer: Attempt to allocate failed (initialsize=1009956863): Cannot allocate memory". Monitoring 1900 hosts currently - on my primary system I do this with only 4 GB of memory with no issues.
Any idea where I should start to look - it's a terabithia installation.
Heres a couple of the core-dumps gdb'ed:
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done. done. [New LWP 10035] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `xymongen --recentgifs --subpagecolumns=4 --report --max-eventtime=1440 --max-ac'. Program terminated with signal 6, Aborted. #0 0x00007f8bb64aa387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt #0 0x00007f8bb64aa387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f8bb64aba78 in __GI_abort () at abort.c:90 #2 0x0000561f05bf6115 in sigsegv_handler (signum=<optimized out>) at sig.c:57 #3 <signal handler called> #4 strbuf_addtobuffer (buf=0x0, newtext=0x561f0701db60 "extcombo", ' ' <repeats 192 times>..., newlen=2000) at strfunc.c:115 #5 0x0000561f05bf79b5 in addtobufferraw (buf=<optimized out>, newdata=<optimized out>, bytes=<optimized out>) at strfunc.c:184 #6 0x0000561f05c00d32 in combo_start () at sendmsg.c:908 #7 0x0000561f05bd7ccb in main (argc=6, argv=0x7ffe092839a8) at xymongen.c:706
Reading symbols from /usr/libexec/xymon/xymonnet...Reading symbols from /usr/lib/debug/usr/libexec/xymon/xymonnet.debug...done. done. [New LWP 15437] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `xymonnet --report --ping --checkresponse --dns-timeout=3 --dnslog=/var/log/xymo'. Program terminated with signal 6, Aborted. #0 0x00007f96383f0387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt #0 0x00007f96383f0387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f96383f1a78 in __GI_abort () at abort.c:90 #2 0x0000000000422d95 in sigsegv_handler (signum=<optimized out>) at sig.c:57 #3 <signal handler called> #4 strbuf_addtobuffer (buf=0x0, newtext=0x2a99910 "extcombo", ' ' <repeats 192 times>..., newlen=2000) at strfunc.c:115 #5 0x0000000000424635 in addtobufferraw (buf=<optimized out>, newdata=<optimized out>, bytes=<optimized out>) at strfunc.c:184 #6 0x000000000042d9b2 in combo_start () at sendmsg.c:908 #7 0x00000000004064dc in main (argc=6, argv=0x7ffc4e0055d8) at xymonnet.c:2554
Seems like all core-dumps are from xymonnet and xymongen...
Where do I start?
Regards,
Carl Melgaard
This may not help, but....
I've had core dumps from xymonnet that I *think* are related to running LDAP checks against an Active Directory that won't talk back. Or possibly it gets junk back that it can't understand, I don't know. Some times xymonnet core-dumps, other times it hangs up. Once it hangs, no more network tests happen - I guess the scheduler sees xymonnet already running and won't start another. I've had to install a cron job that looks for xymonnet running longer than 30 minutes and kills it. This is with xymon-4.3.12, RHEL5, compiled from source.
Can you separate out some of your network tests? Maybe spin up another copy of xymon and offload either your ping or ssh or http(s) tests to it?
Ralph Mitchell
On Mon, Dec 14, 2020 at 9:53 AM Carl Melgaard <Carl.Melgaard at stab.rm.dk> wrote:
Hi,
After running for 5 hrs on my new installation on a RH 7.9, xymond has already allocated 11.5GB of memory? Last night it core-dumped multiple times, and threw ?Cannot allocate memory? in multiple xymon logfiles, ala ?newstrbuffer: Attempt to allocate failed (initialsize=1009956863): Cannot allocate memory?. Monitoring 1900 hosts currently ? on my primary system I do this with only 4 GB of memory with no issues.
Any idea where I should start to look ? it?s a terabithia installation.
Heres a couple of the core-dumps gdb?ed:
Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols from /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
done.
[New LWP 10035]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymongen --recentgifs --subpagecolumns=4 --report --max-eventtime=1440 --max-ac'.
Program terminated with signal 6, Aborted.
#0 0x00007f8bb64aa387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007f8bb64aa387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007f8bb64aba78 in __GI_abort () at abort.c:90
#2 0x0000561f05bf6115 in sigsegv_handler (signum=<optimized out>) at sig.c:57
#3 <signal handler called>
#4 strbuf_addtobuffer (buf=0x0, newtext=0x561f0701db60 "extcombo", ' ' <repeats 192 times>..., newlen=2000) at strfunc.c:115
#5 0x0000561f05bf79b5 in addtobufferraw (buf=<optimized out>, newdata=<optimized out>, bytes=<optimized out>) at strfunc.c:184
#6 0x0000561f05c00d32 in combo_start () at sendmsg.c:908
#7 0x0000561f05bd7ccb in main (argc=6, argv=0x7ffe092839a8) at xymongen.c:706
Reading symbols from /usr/libexec/xymon/xymonnet...Reading symbols from /usr/lib/debug/usr/libexec/xymon/xymonnet.debug...done.
done.
[New LWP 15437]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymonnet --report --ping --checkresponse --dns-timeout=3 --dnslog=/var/log/xymo'.
Program terminated with signal 6, Aborted.
#0 0x00007f96383f0387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007f96383f0387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007f96383f1a78 in __GI_abort () at abort.c:90
#2 0x0000000000422d95 in sigsegv_handler (signum=<optimized out>) at sig.c:57
#3 <signal handler called>
#4 strbuf_addtobuffer (buf=0x0, newtext=0x2a99910 "extcombo", ' ' <repeats 192 times>..., newlen=2000) at strfunc.c:115
#5 0x0000000000424635 in addtobufferraw (buf=<optimized out>, newdata=<optimized out>, bytes=<optimized out>) at strfunc.c:184
#6 0x000000000042d9b2 in combo_start () at sendmsg.c:908
#7 0x00000000004064dc in main (argc=6, argv=0x7ffc4e0055d8) at xymonnet.c:2554
Seems like all core-dumps are from xymonnet and xymongen?
Where do I start?
Regards,
Carl Melgaard
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
On Tue, 15 Dec 2020 at 07:59, Ralph M <ralphmitchell at gmail.com> wrote:
I've had to install a cron job that looks for xymonnet running longer than 30 minutes and kills it.
In tasks.cfg you could add "MAXTIME 30m" for xymonnet. Then xymonlaunch will kill the process automatically.
Hi Carl
On Tue, 15 Dec 2020 at 01:53, Carl Melgaard <Carl.Melgaard at stab.rm.dk> wrote:
Hi,
After running for 5 hrs on my new installation on a RH 7.9, xymond has already allocated 11.5GB of memory?
xymond using a lot of RAM could be something different from the core dumps. But I suspect they're related. For instance, if it's having to keep lots of large combo messages in RAM while other modules send or receive them, but the other modules keep crashing. It's not clear if the xymonnet and xymongen crashes are causing the high RAM usage, or the other way around. It might be worth checking log timestamps to work out what happened first.
Last night it core-dumped multiple times, and threw ?Cannot allocate
memory? in multiple xymon logfiles, ala ?newstrbuffer: Attempt to allocate failed (initialsize=1009956863): Cannot allocate memory?.
"Cannot allocate memory" - do you have swap space? Is it being used?
Monitoring 1900 hosts currently ? on my primary system I do this with only 4 GB of memory with no issues.
What version of Xymon are you running on the primary system? Similar OS?
Any idea where I should start to look ? it?s a terabithia installation.
Heres a couple of the core-dumps gdb?ed:
The two core dumps suggest the same cause.
#2 0x0000561f05bf6115 in sigsegv_handler (signum=<optimized out>) at sig.c:57
#3 <signal handler called>
The sigsegv handler was called, which probably means there was a memory segment violation - typically using memory that hasn't been allocated.
I'm not a C programmer, but I'm guessing from this:
#4 strbuf_addtobuffer (buf=0x0, newtext=0x561f0701db60 "extcombo", ' '
<repeats 192 times>..., newlen=2000) at strfunc.c:115
that the code responsible is (in strfunc.c):
void strbuf_addtobuffer(strbuffer_t *buf, char *newtext, size_t newlen) { if (buf->s == NULL) { buf->used = 0; buf->sz = newlen + BUFSZINCREMENT; buf->s = (char *) malloc(buf->sz); *(buf->s) = '\0'; }
The "malloc()" operation may have failed due to running out of memory. Then the next line tries to store a "0" byte into unallocated RAM. I'd guess this would cause a sigsegv.
In other parts of the same file, malloc() is followed by a check for failure, before the memory is used:
For instance:
newbuf->s = (char *)malloc(initialsize);
if (newbuf->s == NULL) {
errprintf("newstrbuffer: Attempt to allocate failed
(initialsize=%d): %s\n", initialsize, strerror(errno)); xfree(newbuf); return NULL; } *(newbuf->s) = '\0';
The above error checking has been added to some of the code, but perhaps there are places it still needs to be added.
This appears to have happened during the addition of a combo message string to allocated memory, while creating a message to send to xymond (sendmsg.c):
#5 0x0000561f05bf79b5 in addtobufferraw (buf=<optimized out>,
newdata=<optimized out>, bytes=<optimized out>) at strfunc.c:184
#6 0x0000561f05c00d32 in combo_start () at sendmsg.c:908
Combo messages can be large, and this could a) cause increased RAM usage, or b) be affected by it. Again, it's not clear if the behaviour of xymongen and xymonnet are the cause of your problems or the result of them.
It looks like the call to combo_start() in xymongen is in code that runs as a result of the "--report" switch. In xymonnet, it's in common code, but there seems to be a modified code path available if you were to add "--bfq" (backfeed queue). I know nothing about the backfeed queue feature, but there's a little about it in the README.backfeed file.
So some things to consider, mostly just work-arounds and troubleshooting:
- Make sure you have swap enabled, and monitor swap-in/swap-out.
- See if anything else is using excessive RAM.
- Play with combo message sizes. Perhaps a smaller size would help. You can set MAXMSGSPERCOMBO in xymonserver.cfg.
- Run an older version of Xymon on your new installation, perhaps the same as your current installation. Or perhaps just copy the binaries for xymonnet and/or xymongen to the new server?
- Patch the strfunc.c file to include the malloc error checking. You'd need to get the SRPM from Terabithia and build it yourself. Only the xymonnet and xymongen binaries would need to be replaced.
- Profile the xymond process's memory usage. I'm not sure how to do this. Perhaps you can get it to dump core, then analyse the core (perhaps just run "strings" over it) to see what's using up all the memory. Perhaps there's some gdb techniques for this.
- Try running xymongen without "--report", and xymonnet with "--bfq" or "--no-bfq".
Hope that helps.
Cheers Jeremy
Hi,
Thanks for the thorough walkthrough!
Combo messages can be large, and this could a) cause increased RAM usage, or b) be affected by it. Again, it's not clear if the behaviour of xymongen and xymonnet are the cause of your problems or the result of them. It looks like the call to combo_start() in xymongen is in code that runs as a result of the "--report" switch. In xymonnet, it's in common code, but there seems to be a modified code path available if you were to add "--bfq" (backfeed queue). I know nothing about the backfeed queue >feature, but there's a little about it in the README.backfeed file. So some things to consider, mostly just work-arounds and troubleshooting:
- Make sure you have swap enabled, and monitor swap-in/swap-out.
Swap is enabled, current usage:
Memory Used Total Percentage [green] Real/Physical 15706M 15885M 98% [green] Actual/Virtual 13054M 15885M 82% [green] Swap/Page 0M 8063M 0%
The old server running CentOS 5 and Xymon 4.37 is running on 4 GB of memory and using 96% - with the same tests?
- See if anything else is using excessive RAM.
It?s mostly just xymon processes eating up RAM:
xymon 1116 0.0 0.0 37940 1540 ? Ss Dec14 0:01 /usr/sbin/xymonlaunch --no-daemon --log=/var/log/xymon/xymonlaunch.log xymon 1138 0.5 1.0 12040844 171980 ? S Dec14 8:13 xymond --restart=/var/lib/xymon/tmp/xymond.chk --checkpoint-file=/var/lib/xymon/tmp/xymond.chk --checkpoint-interval=600 --admin-senders=127.0.0.1,<x.x.x.x> --store-clientlogs=!msgs xymon 1695 0.0 0.0 6212724 8240 ? S Dec14 0:24 xymond_channel --channel=stachg xymond_history xymon 1696 0.0 0.0 6211996 2764 ? S Dec14 0:26 xymond_channel --channel=page xymond_alert --checkpoint-file=/var/lib/xymon/tmp/alert.chk --checkpoint-interval=600 xymon 1697 0.0 0.0 6213976 8628 ? S Dec14 0:49 xymond_channel --channel=client xymond_client xymon 1698 0.0 0.0 6213804 8704 ? S Dec14 1:17 xymond_channel --channel=status xymond_rrd --rrddir=/var/lib/xymon/rrd xymon 1699 0.0 0.0 6211864 2084 ? S Dec14 0:12 xymond_channel --channel=data xymond_rrd --rrddir=/var/lib/xymon/rrd xymon 1700 0.0 0.0 6212580 8392 ? S Dec14 0:00 xymond_channel --channel=clichg xymond_hostdata xymon 1743 0.1 31.9 6315044 5200828 ? S Dec14 1:46 xymond_rrd --rrddir=/var/lib/xymon/rrd xymon 1744 0.1 31.9 6218584 5194192 ? S Dec14 1:53 xymond_client xymon 1745 0.0 0.4 5235992 74140 ? S Dec14 0:00 xymond_history xymon 1746 0.0 0.3 5229068 48912 ? S Dec14 0:05 xymond_alert --checkpoint-file=/var/lib/xymon/tmp/alert.chk --checkpoint-interval=600 xymon 1747 0.0 6.8 6306576 1121700 ? S Dec14 0:32 xymond_rrd --rrddir=/var/lib/xymon/rrd xymon 2213 0.0 0.0 5225548 15060 ? S Dec14 0:00 xymond_hostdata xymon 13022 0.0 0.0 116340 3024 pts/0 S 10:53 0:00 -bash xymon 14689 0.0 0.0 113420 1568 ? S 10:59 0:00 /bin/sh /usr/share/xymon/ext/ntpd.sh xymon 14822 0.0 0.0 9568 1140 ? S 10:59 0:00 /bin/sh xymon 14823 0.0 0.0 9568 1136 ? S 10:59 0:00 /bin/sh xymon 14828 0.0 0.0 49016 1264 ? S 10:59 0:00 vmstat 300 2 xymon 14829 0.0 0.0 49016 1264 ? S 10:59 0:00 vmstat 300 2 xymon 15680 0.0 0.0 113420 720 ? S 11:02 0:00 /bin/sh /usr/share/xymon/ext/ntpd.sh xymon 15681 0.0 0.0 23652 1504 ? S 11:02 0:00 /usr/sbin/ntpdate -t 1 -p 5 -u -q <server>
- Play with combo message sizes. Perhaps a smaller size would help. You can set MAXMSGSPERCOMBO in xymonserver.cfg.
- Profile the xymond process's memory usage. I'm not sure how to do this. Perhaps you can get it to dump core, then analyse the core (perhaps just run "strings" over it) to see what's using up all the memory. Perhaps there's some gdb techniques for this.
- Try running xymongen without "--report", and xymonnet with "--bfq" or "--no-bfq".
I?ll play around with combomsg sizes and try omitting the reports ? doing a bt full, I get this output:
Reading symbols from /usr/libexec/xymon/xymonnet...Reading symbols from /usr/lib/debug/usr/libexec/xymon/xymonnet.debug...done. done. [New LWP 15566] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `xymonnet --report --ping --checkresponse --dns-timeout=3 --dnslog=/var/log/xymo'. Program terminated with signal 6, Aborted. #0 0x00007f7484027387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt full #0 0x00007f7484027387 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 resultvar = 0 pid = 15566 selftid = 15566 #1 0x00007f7484028a78 in __GI_abort () at abort.c:90 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 47120176, sa_restorer = 0x0} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x0000000000422d95 in sigsegv_handler (signum=<optimized out>) at sig.c:57 No locals. #3 <signal handler called> No locals. #4 strbuf_addtobuffer (buf=0x0, newtext=0x2ceff30 "extcombo", ' ' <repeats 192 times>..., newlen=2000) at strfunc.c:115 No locals. #5 0x0000000000424635 in addtobufferraw (buf=<optimized out>, newdata=<optimized out>, bytes=<optimized out>) at strfunc.c:184 No locals. #6 0x000000000042d9b2 in combo_start () at sendmsg.c:908 No locals. #7 0x00000000004064dc in main (argc=6, argv=0x7ffd5619f5b8) at xymonnet.c:2554 msg = "PING test completed (1913 hosts)", '\000' <repeats 479 times> handle = <optimized out> s = <optimized out> h = <optimized out> t = <optimized out> argi = <optimized out> concurrency = <optimized out> pingcolumn = <optimized out> egocolumn = <optimized out> failgoesclear = <optimized out> dumpdata = <optimized out> runtimewarn = 300 servicedumponly = <optimized out> pingrunning = 1 usebackfeedqueue = 0 force_backfeedqueue = <optimized out> network_count = <optimized out>
Which looks like the report part?
My xymond.chk file is 55 MB ? is that an issue?
Regards,
Carl
participants (3)
-
Carl.Melgaard@STAB.RM.DK
-
jeremy@laidman.org
-
ralphmitchell@gmail.com