I was playing around with hobbit-clients.cfg file trying to create a LOG rule to ignore this alert: May 21 00:54:21 redirect1-bo3.dl2.example.com monit[2029]: [ID 111343 daemon.error] 'gmond-sample.xml' timestamp test failed for /usr/local/Ganglia/logs/gmond-sample.xml
I **think the rule that put it into conniptions was HOST=%redirect.*bo3.dl2.example.com LOG /var/adm/messages COLOR=yellow IGNORE=%(repeated|gmond|monit|puppetd)
Also, I am experiencing something I've seen a few other times this week - a service that is not reporting, that was signed out, stays blue even when signed back in. I can't get rid of the xymond_client blue. Where is blue status stored? (it does not appear as blue on the enable/disable page but I have a blue dot on the host page and a blue report when I drill in)
[xymon at netmon2 server]$ gdb bin/hobbitd_client tmp/core.24453 GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /u1/xymon/server/bin/hobbitd_client...done. Reading symbols from /lib64/libpcre.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libpcre.so.0 Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/librt.so.1 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Core was generated by `hobbitd_client'. Program terminated with signal 6, Aborted. #0 0x0000003833430265 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003833430265 in raise () from /lib64/libc.so.6 #1 0x0000003833431d10 in abort () from /lib64/libc.so.6 #2 0x0000000000427133 in sigsegv_handler (signum=<value optimized out>) at sig.c:57 #3 <signal handler called> #4 0x00000000004179f6 in scan_log (hinfo=0x1679440, classname=0x2b9863ae507e "sunos", logname=0x2b9863aee44b "/var/adm/messages", logdata=0x2b9863aee45e "May 21 00:57:25 redirect2-bo3.dl2.e-dialog.com last message repeated 36 times\nMay 21 00:57:35 redirect2-bo3.dl2.example.com monit[10418]: [ID 111343 daemon.error] 'gmond-sample.xml' timestamp test fa"..., section=<value optimized out>, summarybuf=0x1683a80) at client_config.c:2491 #5 0x0000000000408d0a in msgs_report ( hostname=0x2b9863ae5059 "redirect2-bo3.dl2.example.com", clientclass=0x2b9863ae507e "sunos", os=<value optimized out>, hinfo=0x1679440, fromline=0x7fff00bf2c50 "\nStatus message received from 10.200.32.51\n", timestr=0x2b9863ae50be "Sat May 21 01:11:24 EDT 2011", msgsstr=0x0) at xymond_client.c:1221 #6 0x000000000040fd6a in handle_solaris_client ( hostname=0x2b9863ae5059 "redirect2-bo3.dl2.example.com", clienttype=0x2b9863ae507e "sunos", os=OS_SOLARIS, hinfo=0x1679440, sender=<value optimized out>, timestamp=<value optimized out>, clientdata=0x2b9863ae5085 "client redirect2-bo3,dl2,example,com.sunos sunos\n[date") at client/solaris.c:69 #7 0x0000000000411e5f in main (argc=<value optimized out>, argv=0x7fff00bf3368) at xymond_client.c:2199
Hi Elizabeth,
I was playing around with hobbit-clients.cfg [...]
Which version of Xymon is this ? Since you're referring to hobbit-clients.cfg and hobbitd_client, I assume it is 4.2.something, but that doesn't match with some of the linenumbers ?
So I'll assume it's 4.3.something - the interesting line hasn't changed between the 4.3.x releases:
[xymon at netmon2 server]$ gdb bin/hobbitd_client tmp/core.24453 GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2) #2 0x0000000000427133 in sigsegv_handler (signum=<value optimized out>) at sig.c:57 #3<signal handler called> #4 0x00000000004179f6 in scan_log (hinfo=0x1679440, classname=0x2b9863ae507e "sunos", logname=0x2b9863aee44b "/var/adm/messages", logdata=0x2b9863aee45e "May 21 00:57:25 redirect2-bo3.dl2.e-dialog.com last message repeated 36 times\nMay 21 00:57:35 redirect2-bo3.dl2.example.com monit[10418]: [ID 111343 daemon.error] 'gmond-sample.xml' timestamp test fa"..., section=<value optimized out>, summarybuf=0x1683a80) at client_config.c:2491 #5 0x0000000000408d0a in msgs_report ( hostname=0x2b9863ae5059 "redirect2-bo3.dl2.example.com", clientclass=0x2b9863ae507e "sunos", os=<value optimized out>, hinfo=0x1679440, fromline=0x7fff00bf2c50 "\nStatus message received from 10.200.32.51\n", timestr=0x2b9863ae50be "Sat May 21 01:11:24 EDT 2011", msgsstr=0x0) at xymond_client.c:1221
Looking at xymond/client_config.c line 2491 reads:
/* Next, check for a match anywhere in the data*/
if (!patternmatch(logdata, rule->rule.log.matchexp->pattern,
rule->rule.log.matchexp->exp)) continue;
So I'd like to know a bit more about the state of some of those variables. Could you go back into gdb and then instead of getting the callstack, run these three commands:
p rule
p *rule
p *(rule->rule.log.matchexp)
If I'm unlucky, the "rule" variable will have been optimized out....
Also, I am experiencing something I've seen a few other times this week - a service that is not reporting, that was signed out, stays blue even when signed back in.
A blue status won't change to another color until it gets a status update (red, yellow or green).
I can't get rid of the xymond_client blue.
The xymond_client status shows up because you had a crash of the xymond_client module. Use xymon 127.0.0.1 "drop YOURXYMONSERVER xymond_client" to get rid of it.
Regards, Henrik
Thanks Hendrik! I'm running 4.3.2
(tried sending a green status to get rid of the blue, and after some hours it turned purple, waking us up again)
p rule p *rule p *(rule->rule.log.matchexp)
Loaded symbols for /lib64/ld-linux-x86-64.so.2 Core was generated by `hobbitd_client'. Program terminated with signal 6, Aborted. #0 0x0000003833430265 in raise () from /lib64/libc.so.6 (gdb) p rule No symbol "rule" in current context. (gdb) p *rule No symbol "rule" in current context. (gdb) p *(rule->rule.log.matchexp) No symbol "rule" in current context.
(I can recompile with other flags if you point me to it)
thanks much Betsy
PS just to be clear doing the *drop* did work. I'd tried the green status last night
On Sat, May 21, 2011 at 1:23 PM, Elizabeth Schwartz <betsy.schwartz at gmail.com> wrote:
Thanks Hendrik! I'm running 4.3.2
(tried sending a green status to get rid of the blue, and after some hours it turned purple, waking us up again)
p rule p *rule p *(rule->rule.log.matchexp)
Loaded symbols for /lib64/ld-linux-x86-64.so.2 Core was generated by `hobbitd_client'. Program terminated with signal 6, Aborted. #0 0x0000003833430265 in raise () from /lib64/libc.so.6 (gdb) p rule No symbol "rule" in current context.
Doh, sorry. You have to do a "fr 4" first to select that stack-frame.
Again, please ?
Thanks, Henrik
Loaded symbols for /lib64/ld-linux-x86-64.so.2 Core was generated by `hobbitd_client'. Program terminated with signal 6, Aborted. #0 0x0000003833430265 in raise () from /lib64/libc.so.6 (gdb) fr 4 #4 0x00000000004179f6 in scan_log (hinfo=0x1679440, classname=0x2b9863ae507e "sunos", logname=0x2b9863aee44b "/var/adm/messages", logdata=0x2b9863aee45e "May 21 00:57:25 redirect2-bo3.dl2.e-dialog.com last message repeated 36 times\nMay 21 00:57:35 redirect2-bo3.dl2.e-dialog.com monit[10418]: [ID 111343 daemon.error] 'gmond-sample.xml' timestamp test fa"..., section=<value optimized out>, summarybuf=0x1683a80) at client_config.c:2491 2491 if (!patternmatch(logdata, rule->rule.log.matchexp->pattern, rule->rule.log.matchexp->exp)) continue; (gdb) p rule $1 = (c_rule_t *) 0x168ad90 (gdb) p *rule $2 = {hostexp = 0x168a7c0, exhostexp = 0x0, pageexp = 0x0, expageexp = 0x0, classexp = 0x0, exclassexp = 0x0, timespec = 0x0, statustext = 0x0, rrdidstr = 0x0, groups = 0x0, ruletype = C_LOG, cfid = 435, flags = 0, next = 0x168afa0, rule = {load = {warnlevel = 4.27369127e-38, paniclevel = 0}, uptime = {recentlimit = 23637648, ancientlimit = 0}, clock = { maxdiff = 23637648}, disk = {fsexp = 0x168ae90, warnlevel = 0, paniclevel = 0, abswarn = 23637712, abspanic = 0, dmin = 4, dmax = 0, dcount = 0, color = 0, ignored = 0}, inode = {fsexp = 0x168ae90, warnlevel = 0, paniclevel = 0, abswarn = 23637712, abspanic = 0, imin = 4, imax = 0, icount = 0, color = 0, ignored = 0}, mem = {memtype = 23637648, warnlevel = 0, paniclevel = 0}, zos_mem = {zos_memtype = 23637648, warnlevel = 0, paniclevel = 0}, zvse_vsize = { warnlevel = 23637648, paniclevel = 0}, zvse_getvis = {partid = 0x168ae90, warnlevel = 0, paniclevel = 0, anywarnlevel = 0, anypaniclevel = 0}, cics = { applid = 0x168ae90, dsawarnlevel = 0, dsapaniclevel = 0, edsawarnlevel = 0, edsapaniclevel = 0}, asid = {asidtype = 23637648, warnlevel = 0, paniclevel = 0}, proc = {procexp = 0x168ae90, pmin = 0, pmax = 0, pcount = 0, color = 0}, log = {logfile = 0x168ae90, matchexp = 0x0, matchone = 0x0, ignoreexp = 0x168aed0, color = 4}, fcheck = {filename = 0x168ae90, color = 0, ftype = 0, minsize = 0, maxsize = 23637712, eqlsize = 4, minlinks = 0, maxlinks = 0, eqllinks = 0, fmode = 0, ownerid = 0, groupid = 0, ownerstr = 0x0, groupstr = 0x0, minctimedif = 0, maxctimedif = 0, ctimeeql = 0, minmtimedif = 0, maxmtimedif = 0, mtimeeql = 0, minatimedif = 0, maxatimedif = 0, atimeeql = 0, md5hash = 0x0, sha1hash = 0x0, rmd160hash = 0x0}, dcheck = {filename = 0x168ae90, color = 0, maxsize = 0, ---Type <return> to continue, or q <return> to quit--- minsize = 23637712}, port = {localexp = 0x168ae90, exlocalexp = 0x0, remoteexp = 0x0, exremoteexp = 0x168aed0, stateexp = 0x4, exstateexp = 0x0, pmin = 0, pmax = 0, pcount = 0, color = 0}, svc = {svcexp = 0x168ae90, stateexp = 0x0, startupexp = 0x0, svcname = 0x168aed0 "\360\256h\001", startup = 0x4 <Address 0x4 out of bounds>, state = 0x0, scount = 0, color = 0}, paging = {warnlevel = 23637648, paniclevel = 0}, mibval = { mibvalexp = 0x168ae90, keyexp = 0x0, color = 0, minval = 23637712, maxval = 4, matchexp = 0x0, havetree = 0, valdeftree = 0x0}, rrdds = {rrdkey = 0x168ae90, rrdds = 0x0, column = 0x0, color = 23637712, limitval = 1.9762625833649862e-323, limitval2 = 0}, mqqueue = { qmgrname = 0x168ae90, qname = 0x0, warnlen = 0, critlen = 0, warnage = 23637712, critage = 0}, mqchannel = {qmgrname = 0x168ae90, chnname = 0x0, warnstates = 0x0, alertstates = 0x168aed0}}} (gdb) p *(rule->rule.log.matchexp) Cannot access memory at address 0x0 (gdb)
Hi Elizabeth,
(gdb) p rule $1 = (c_rule_t *) 0x168ad90
OK.
(gdb) p *rule [snip] log = {logfile = 0x168ae90, matchexp = 0x0, matchone = 0x0, ignoreexp = 0x168aed0, color = 4}, [snip] (gdb) p *(rule->rule.log.matchexp) Cannot access memory at address 0x0
Definitely not OK. The LOG check comes without any expression to match the log data against ("matchexp" is a NULL pointer). Which explains why it crashes when we try to use to expression in line 2491:
if (!patternmatch(logdata, rule->rule.log.matchexp->pattern,
rule->rule.log.matchexp->exp)) continue;
Now, the "matchexp" setting is built from the regex in the LOG statement. If this turns out to be an invalid regex, it should log a file in the xymond_client logfile like
pcre compile 'your-pattern-here' failed (offset N): <error message>
So could you check if there's such a message in your log?
Still, it is incovenient that xymond_client crashes because of a configuration error. I'll look into improving that.
Regards, Henrik
Hi,
I was playing around with hobbit-clients.cfg file trying to create a LOG rule to ignore this alert: May 21 00:54:21 redirect1-bo3.dl2.example.com monit[2029]: [ID 111343 daemon.error] 'gmond-sample.xml' timestamp test failed for /usr/local/Ganglia/logs/gmond-sample.xml
I **think the rule that put it into conniptions was HOST=%redirect.*bo3.dl2.example.com LOG /var/adm/messages COLOR=yellow IGNORE=%(repeated|gmond|monit|puppetd)
This would trigger it, because there is no match-pattern, only an ignore-pattern.
If you want to match all lines, use something like
LOG /var/adm/messages %. COLOR=yellow IGNORE=%...
Regards, Henrik
Thank you! I've fixed the offending test. Not knowing how to make it stop being blue/purple was the biggest problem, which you've also answered
pcre compile 'your-pattern-here' failed (offset N): <error message> No errors with pcre anywhere in the log directory (possibly because it crashed first) Don't have a xymond_client log, do have a hobbitclient.log but it's empty.
Ideally what I want to do is this: Ignore *any* errors from lpr go yellow on any level of error from puppet or gmond go yellow on any other "warning" message go red on any other "error" message
I think I need to use "greedier" regexp's for the match that include white space - am I understanding correctly that if I match on %ERROR then only the string "error" gets passed to the IGNORE statement?
thanks Betsy PS I would enjoy seeing other people's LOG tests if anyone has good ones
participants (2)
-
betsy.schwartz@gmail.com
-
henrik@hswn.dk