Hi Carl
On Tue, 15 Dec 2020 at 01:53, Carl Melgaard <Carl.Melgaard at stab.rm.dk> wrote:
Hi,
After running for 5 hrs on my new installation on a RH 7.9, xymond has already allocated 11.5GB of memory?
xymond using a lot of RAM could be something different from the core dumps. But I suspect they're related. For instance, if it's having to keep lots of large combo messages in RAM while other modules send or receive them, but the other modules keep crashing. It's not clear if the xymonnet and xymongen crashes are causing the high RAM usage, or the other way around. It might be worth checking log timestamps to work out what happened first.
Last night it core-dumped multiple times, and threw ?Cannot allocate
memory? in multiple xymon logfiles, ala ?newstrbuffer: Attempt to allocate failed (initialsize=1009956863): Cannot allocate memory?.
"Cannot allocate memory" - do you have swap space? Is it being used?
Monitoring 1900 hosts currently ? on my primary system I do this with only 4 GB of memory with no issues.
What version of Xymon are you running on the primary system? Similar OS?
Any idea where I should start to look ? it?s a terabithia installation.
Heres a couple of the core-dumps gdb?ed:
The two core dumps suggest the same cause.
#2 0x0000561f05bf6115 in sigsegv_handler (signum=<optimized out>) at sig.c:57
#3 <signal handler called>
The sigsegv handler was called, which probably means there was a memory segment violation - typically using memory that hasn't been allocated.
I'm not a C programmer, but I'm guessing from this:
#4 strbuf_addtobuffer (buf=0x0, newtext=0x561f0701db60 "extcombo", ' '
<repeats 192 times>..., newlen=2000) at strfunc.c:115
that the code responsible is (in strfunc.c):
void strbuf_addtobuffer(strbuffer_t *buf, char *newtext, size_t newlen) { if (buf->s == NULL) { buf->used = 0; buf->sz = newlen + BUFSZINCREMENT; buf->s = (char *) malloc(buf->sz); *(buf->s) = '\0'; }
The "malloc()" operation may have failed due to running out of memory. Then the next line tries to store a "0" byte into unallocated RAM. I'd guess this would cause a sigsegv.
In other parts of the same file, malloc() is followed by a check for failure, before the memory is used:
For instance:
newbuf->s = (char *)malloc(initialsize);
if (newbuf->s == NULL) {
errprintf("newstrbuffer: Attempt to allocate failed
(initialsize=%d): %s\n", initialsize, strerror(errno)); xfree(newbuf); return NULL; } *(newbuf->s) = '\0';
The above error checking has been added to some of the code, but perhaps there are places it still needs to be added.
This appears to have happened during the addition of a combo message string to allocated memory, while creating a message to send to xymond (sendmsg.c):
#5 0x0000561f05bf79b5 in addtobufferraw (buf=<optimized out>,
newdata=<optimized out>, bytes=<optimized out>) at strfunc.c:184
#6 0x0000561f05c00d32 in combo_start () at sendmsg.c:908
Combo messages can be large, and this could a) cause increased RAM usage, or b) be affected by it. Again, it's not clear if the behaviour of xymongen and xymonnet are the cause of your problems or the result of them.
It looks like the call to combo_start() in xymongen is in code that runs as a result of the "--report" switch. In xymonnet, it's in common code, but there seems to be a modified code path available if you were to add "--bfq" (backfeed queue). I know nothing about the backfeed queue feature, but there's a little about it in the README.backfeed file.
So some things to consider, mostly just work-arounds and troubleshooting:
- Make sure you have swap enabled, and monitor swap-in/swap-out.
- See if anything else is using excessive RAM.
- Play with combo message sizes. Perhaps a smaller size would help. You can set MAXMSGSPERCOMBO in xymonserver.cfg.
- Run an older version of Xymon on your new installation, perhaps the same as your current installation. Or perhaps just copy the binaries for xymonnet and/or xymongen to the new server?
- Patch the strfunc.c file to include the malloc error checking. You'd need to get the SRPM from Terabithia and build it yourself. Only the xymonnet and xymongen binaries would need to be replaced.
- Profile the xymond process's memory usage. I'm not sure how to do this. Perhaps you can get it to dump core, then analyse the core (perhaps just run "strings" over it) to see what's using up all the memory. Perhaps there's some gdb techniques for this.
- Try running xymongen without "--report", and xymonnet with "--bfq" or "--no-bfq".
Hope that helps.
Cheers Jeremy