I think I can point to a specific cause for this issue. It seems to be a combination of the "uptime" test being in an alert condition and the same test failing during an exclusion window on the Critical Systems Page.
I have a number of Windows systems monitored for uptime.
In analysis.cfg: UP 10m 37d yellow
In critical.cfg: CTX_Template|uptime|||*:0400:2400|1|EPD|System has rebooted|rchicks 2017-08-04 07:58:11
I also set Xymon to send me alerts for ALL systems between 2:30AM and 3:30AM; the average time window for the Critical Systems Page going down
In alerts.cfg: HOST=%.* MAIL edschminke at hormel.com FORMAT=text REPEAT=1h TIME=*:0230:0330 FORMAT=text MAIL edschminke at hormel.com FORMAT=text TIME=*:0230:0330 FORMAT=text RECOVERED
Last night, around 2:45, 4 of these systems were rebooted. As soon as the first email was sent that a system went yellow for uptime, I got the alert that http went red for the Critical Systems Page. When the last email was sent that uptime recovered, I got the alert that http recovered.
This morning, I rebooted a different Windows host. I watched the test go yellow, but the Critical Systems Page was fine. In this case, the condition was within the "Monitoring Time" window. I then went into the Critical Systems Editor and modified the "Monitoring Time" and put it outside the window (e.g. current time 8AM, window: 12PM-12AM). As soon as I refresh the Critical Systems Page, it crashes. Change the "Monitoring Time" so that the condition is back inside the window (e.g. 4AM), refresh, it loads fine.
I tested the same process with a few tests; disk, memory, cpu. I could not duplicate the problem with those tests. I think the problem is limited to uptime, but it very well could be others. It also does not seem to matter whether it is the actual host config, or a "cloned" host config. The crash happens with both.
If it matters, here's my environment..
I'm currently running Xymon v4.3.27. The OS is Red Hat Enterprise Linux v6.8. Kernel is 2.6.32-431.el6. Architecture is x86_64. glibc version is 2.12-1.192.el6; for what it's worth, but i686 and x86_64 packages are installed.
A gdb backtrace shows that crash occurs in a "strncmp" function in lib/loadcriticalconf.c on line 249
(gdb) backtrace #0 0x0000003603729420 in __strncmp_sse42 () from /lib64/libc.so.6 #1 0x000000000040fa40 in get_critconfig (key=<value optimized out>, flags=<value optimized out>, resultkey=<value optimized out>) at loadcriticalconf.c:249 #2 0x00000000004030eb in loadstatus (maxprio=3, maxage=31536000, mincolor=3, wantacked=0) at criticalview.c:115 #3 0x00000000004036f0 in main (argc=<value optimized out>, argv=<value optimized out>) at criticalview.c:513 (gdb) frame 1 #1 0x000000000040fa40 in get_critconfig (key=<value optimized out>, flags=<value optimized out>, resultkey=<value optimized out>) at loadcriticalconf.c:249 249 if (strncmp(realkey, rec->key, strlen (realkey)) != 0) handle=xtreeEnd(rbconf); (gdb) print realkey $1 = 0x1c20c80 "CTX_Template|uptime" (gdb) print *rec $2 = {key = 0x435f6c65746e6957 <Address 0x435f6c65746e6957 out of bounds>, priority = 1769236850, starttime = 7310575213499737428, endtime = 0, crittime = 0x1c1d8e0 "Wintel_Critical_Template", ttgroup = 0x21 <Address 0x21 out of bounds>, ttextra = 0x6364727673737763 <Address 0x6364727673737763 out of bounds>, updinfo = 0x3603003d31 <Address 0x3603003d31 out of bounds>}
All of the crash details are still in my GitHub repo at https://github.com/edschminke/xymon ...including the coredump file. I suspect better C developers than myself can put that to a lot better use.
Thanks!
Erik D. Schminke | Associate Systems Programmer Hormel Foods Corporation | One Hormel Place | Austin, MN 55912 Phone: (507) 434-6817 edschminke at hormel.com | www.hormelfoods.com
participants (1)
-
EDSchminke@Hormel.com