I have recently migrated from a large BigBrother/bbgen installation (hosts.cfg 5300 lines) to xymon 4.3.12. Surprisingly there have been very few issues. Performance is very good compared to BigBrother/bbgen. We have just experienced a potentially major issue wiith alerting.
Our issue seems to be with alerts not being generated for a rule if the initial event transition to red is not within the "TIME" range for an alerting rule.
An example follows of the behaviour experienced: The "http" service went down for a system "butterfly.soe.uq.edu.au" at 03:07am and recovered 3 days later:
Mon Jan 20 10:14:31 2014 green 1 days 4:50:51 Fri Jan 17 03:07:44 2014 red 3 days 7:06:47
Alerting for this test is as follows:
alerts.cfg:
$AISMAILSVCS=cifs,cont,cpu,disk,fping,http,inode,login,loginc,memory,ssh,sslcert,rtmpe,rtmps,rtmpt,svcs,xfer_proxy_c,xfer_proxy_e,xfer_proxy_k $AISSMSSVCS=cifs,cont,cpu,disk,fping,http,inode,login,loginc,memory,ssh,sslcert,rtmpe,rtmps,rtmpt,svcs $AISTFHSVCS=fping,http,login
web/proxy/other/cert alerts
PAGE=%its-ais/ais-(web|proxy|other).* MAIL ais-web at domain SERVICE=$AISMAILSVCS DURATION>2m COLOR=red REPEAT=1w FORMAT=PLAIN RECOVERED MAIL ais-web-sms at domain SERVICE=$AISSMSSVCS DURATION>6m TIME=*:0701:2159 COLOR=red REPEAT=1w FORMAT=SMS RECOVERED
The "info" test output displays alerting rules as:
Alerting: Service Recipient 1st Delay Stop after Repeat Time of Day Colors ais-web at domain (R) 2m 1s - 1w - red ais-web-sms at domain (R) 6m 1s - 1w *:0701:2159 red
============
The notification log displays only email alert/recovery for "ais-web at domain", nothing for "ais-web-sms at domain" recipient:
Time Host Service Recipient Mon Jan 20 10:14:47 2014 butterfly.soe.uq.edu.au http ais-web at domain Fri Jan 17 03:10:29 2014 butterfly.soe.uq.edu.au http ais-web at domain
No notification was sent to "ais-web-sms at domain" by the second "MAIL" rule above after it's start time of 07:01 the morning following the failure even though the "http" test was to remain red for 3 days.
Manually testing the alerting rules with: ~/server/bin/xymoncmd xymond_alert --test butterfly.soe.uq.edu.au http --duration=362
indicates syntax is ok and will send both emails when tested during the 0701:2159 TIME window of the second rule:
00029580 2014-01-17 11:31:30 Matching host:service:dgroup:page 'butterfly.soe.uq.edu.au:http: Linux Servers:its-usg/usg-linux,its-ais/ais-other' against rule line 1002 00029580 2014-01-17 11:31:30 *** Match with 'PAGE=%its-ais/ais-(web|proxy|other).*' *** 00029580 2014-01-17 11:31:30 Matching host:service:dgroup:page 'butterfly.soe.uq.edu.au:http: Linux Servers:its-usg/usg-linux,its-ais/ais-other' against rule line 1003 00029580 2014-01-17 11:31:30 *** Match with 'MAIL ais-web at domain SERVICE=$AISMAILSVCS DURATION>2m COLOR=red REPEAT=1w FORMAT=PLAIN RECOVERED' *** 00029580 2014-01-17 11:31:30 Mail alert with command '/usr/bin/mutt -s "Xymon [12345] butterfly.soe.uq.edu.au:http CRITICAL (RED)" ais-web at domain' 00029580 2014-01-17 11:31:30 Matching host:service:dgroup:page 'butterfly.soe.uq.edu.au:http: Linux Servers:its-usg/usg-linux,its-ais/ais-other' against rule line 1004 00029580 2014-01-17 11:31:30 *** Match with 'MAIL ais-web-sms at domain SERVICE=$AISSMSSVCS DURATION>6m TIME=*:0701:2159 COLOR=red REPEAT=1w FORMAT=SMS RECOVERED' *** 00029580 2014-01-17 11:31:30 Mail alert with command '/usr/bin/mutt ais-web-sms at domain'
Is there anything wrong with the alerting logic I have used in alerts.cfg or am I mis-understanding how it works?
The BigBrother behaviour would have been to send the alert after the rule settle time at the start of the time window for the rule if an event happened prior to the start of the alerting time window.
Contriving a dummy test in the hosts.cfg and alerts.cfg for an unpingable host "dummy.alerting.test" "fping". Event log for "dummy.alerting.test" "fping": Tue Jan 21 15:47:43 2014 red 0:16:12
alerts.cfg: HOST=dummy.alerting.test MAIL g.stone-tolcher at its.uq.edu.au DURATION>2m TIME=*:1600:1700 COLOR=red REPEAT=1w FORMAT=PLAIN RECOVERED
Notification: Tue Jan 21 16:00:36 2014 dummy.alerting.test fping g.stone-tolcher at its.uq.edu.au<mailto:g.stone-tolcher at its.uq.edu.au>
Seems to indicate that it is working similar to what is expected, i.e. send notification at start of TIME window if event is still current (ignore duration/settle time unlike bigbrother)? I do not understand why the other alert would not have occurred.
Any help with this issue would be appreciated.
Cheers, Gavin Stone-Tolcher, IT Support Officer, Network Operations and Incident Response Information Technology Services The University of Queensland Level 4, Prentice Building, St Lucia 4072 T: +61 7 334 66645, M: +61 401 140 838 E: g.stone-tolcher at its.uq.edu.au<mailto:g.stone-tolcher at its.uq.edu.au> W: www.its.uq.edu.au<http://www.its.uq.edu.au>
ITS: Service. Team. Accountability. Results.
IMPORTANT: This email and any attachments are intended solely for the addressee(s), contain copyright material and are confidential. We do not waive any legal privilege or rights in respect of copyright or confidentiality. Except as intended addressees are otherwise permitted, you do not have permission to use, disclose, reproduce or communicate any part of this email or its attachments. Statements, opinions and information not related to the official business of The University of Queensland are neither given nor endorsed by us. By using this email (including accessing any attachments or links) you agree we are not liable for any loss or damage of any kind arising in connection with any electronic defect, virus or other malicious code we did not intentionally include.
Please consider the environment before printing this email.
CRICOS Code 00025B
Hi Gavin,
Den 21-01-2014 07:16, Gavin Stone-Tolcher skrev:
We have just experienced a potentially major issue wiith alerting.
Our issue seems to be with alerts not being generated for a rule if the initial event transition to red is not within the "TIME" range for an alerting rule.
Thanks for a very thorough bug report. I can reproduce your problem here, and it is definitely a bug. I am looking into it and 4.3.14 will have to wait until this is fixed.
Regards, Henrik
Den 21-01-2014 12:38, Henrik Størner skrev:
Hi Gavin,
Den 21-01-2014 07:16, Gavin Stone-Tolcher skrev:
We have just experienced a potentially major issue wiith alerting.
Our issue seems to be with alerts not being generated for a rule if the initial event transition to red is not within the "TIME" range for an alerting rule.
Thanks for a very thorough bug report. I can reproduce your problem here, and it is definitely a bug. I am looking into it and 4.3.14 will have to wait until this is fixed.
I believe this patch should fix it. Applies against 4.3.12.
Regards, Henrik
Thanks, I will patch our systems with this today.
Is there a reliable way of reproducing the issue? My test case appeared to work OK unlike the documented failure case.
Cheers, Gavin...
-----Original Message----- From: Henrik Størner [mailto:henrik at hswn.dk] Sent: Wednesday, 22 January 2014 2:01 AM To: xymon at xymon.com; Gavin Stone-Tolcher Subject: Re: [Xymon] alerting issue?
Den 21-01-2014 12:38, Henrik Størner skrev:
Hi Gavin,
Den 21-01-2014 07:16, Gavin Stone-Tolcher skrev:
We have just experienced a potentially major issue wiith alerting.
Our issue seems to be with alerts not being generated for a rule if the initial event transition to red is not within the "TIME" range for an alerting rule.
Thanks for a very thorough bug report. I can reproduce your problem here, and it is definitely a bug. I am looking into it and 4.3.14 will have to wait until this is fixed.
I believe this patch should fix it. Applies against 4.3.12.
Regards, Henrik
Den 22.01.2014 00:55, Gavin Stone-Tolcher skrev:
Thanks, I will patch our systems with this today.
Is there a reliable way of reproducing the issue? My test case appeared to work OK unlike the documented failure case.
I setup a test installation with the settings from your original report, just modified so the TIME restriction on the second alert was 15 minutes after I started Xymon. So the first alert fired while the TIME restriction was active on the second alert. The second alert then failed to trigger.
The heart of the problem is that when the first alert triggers, Xymon sets a timestamp for the next time it should send an alert - and since the alert that triggered has a "REPEAT=1w" setting, then xymond_alert ignores the event for a week. It simply fails to take into account that there is another recipient defined which triggers before that time. So the circumstances required to hit this bug are
- you need two or more recipients for an alert
- you must have a REPEAT setting of more than 24 hours
- one of the recipients must have a TIME restriction which disables alerts when the event first triggers.
Regards, Henrik
Was hoping this patch would fix an alerting issue of ours and saw this after only running for a couple hours:
Historical Status xymon.foo.net - xymond_alert Log time Wed Jan 22 13:55:57 2014
- Program crashed
Fatal signal caught!
Den 22-01-2014 20:59, Mark Felder skrev:
Was hoping this patch would fix an alerting issue of ours and saw this after only running for a couple hours:
Historical Status xymon.foo.net - xymond_alert Log time Wed Jan 22 13:55:57 2014
- Program crashed
Can you provide a gdb trace of this, as described on http://www.xymon.com/xymon/help/known-issues.html#bugreport ?
Regards, Henrik
This doesn't seem to helpful, but here goes:
GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `xymond_alert'. Program terminated with signal 6, Aborted. Reading symbols from /usr/lib/libssl.so.6...done. Loaded symbols for /usr/lib/libssl.so.6 Reading symbols from /lib/libcrypto.so.6...done. Loaded symbols for /lib/libcrypto.so.6 Reading symbols from /usr/local/lib/libpcre.so.3...done. Loaded symbols for /usr/local/lib/libpcre.so.3 Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /lib/libthr.so.3...done. Loaded symbols for /lib/libthr.so.3 Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x0000000801197d6c in kill () from /lib/libc.so.7 [New Thread 801807400 (LWP 154134/xymond_alert)] (gdb) bt #0 0x0000000801197d6c in kill () from /lib/libc.so.7 #1 0x000000080119699b in abort () from /lib/libc.so.7 #2 0x0000000000410633 in sigsegv_handler (signum=Variable "signum" is not available. ) at sig.c:57 #3 0x00007ffffffff043 in ?? () #4 0x0000000000410610 in sigusr2_handler () at sig.c:70 #5 0x0000000000000000 in ?? () #6 0x0000000000000000 in ?? () #7 0x0000000000000000 in ?? () #8 0x0000000000000000 in ?? () #9 0x0000000801949280 in ?? () #10 0x0000000801919ba0 in ?? () #11 0x0000000000000001 in ?? () #12 0x00000000000004eb in ?? () #13 0x0000000000000000 in ?? () #14 0xffffff82335cec40 in ?? () #15 0x0000000000000000 in ?? () #16 0x0000000000000000 in ?? () #17 0x0000000801919ba0 in ?? () #18 0xfffffe0222ac9940 in ?? () #19 0x0000000000000206 in ?? () #20 0x000000005307a80b in ?? () #21 0x0000000801949280 in ?? () #22 0x0000000052e01b0b in ?? () #23 0x00007fffffffad14 in ?? () #24 0x001b00130000000c in ?? () #25 0x0000000000000008 in ?? () #26 0x003b003b00000001 in ?? () #27 0x0000000000000004 in ?? () #28 0x0000000000405d1c in next_alert (alert=0x801949280) at do_alert.c:708 #29 0x0000000000000000 in ?? () #30 0x0000ffff00001fa0 in ?? () #31 0x0000000000000028 in ?? () #32 0x000000000000ffff in ?? () #33 0x00000000fef0f000 in ?? () #34 0x000000000000ffff in ?? () #35 0x0000000000000001 in ?? () #36 0x000000000000ffff in ?? () #37 0x0000000000000000 in ?? () #38 0x000000000000ffff in ?? () #39 0x000000000000ea88 in ?? () #40 0x000000000000ffff in ?? () #41 0x0000000000000000 in ?? () #42 0x000000000000ffff in ?? () #43 0x0000000000000000 in ?? () #44 0x000000000000ffff in ?? () #45 0x0000000000000000 in ?? () #46 0x0000000000000000 in ?? () #47 0x413baf8080000000 in ?? () #48 0x0000000000000000 in ?? () #49 0x0000000000000000 in ?? () #50 0x0000000000000000 in ?? () #51 0x0000000000000000 in ?? () #52 0x0000000000000000 in ?? () #53 0x0000000000000000 in ?? () #54 0x0000000000000000 in ?? () #55 0x0000000000000000 in ?? () #56 0x0000000000000000 in ?? () #57 0x0000000000000000 in ?? () #58 0x0000000000000000 in ?? () #59 0x0000000000000000 in ?? () #60 0x0000000000000000 in ?? () #61 0x0000000000000000 in ?? () #62 0x0000000000000000 in ?? () #63 0x0000000000000000 in ?? () #64 0x0000000000000000 in ?? () #65 0x0000000000000000 in ?? () #66 0x0000000000000000 in ?? () #67 0x0000000000000000 in ?? () #68 0x0000000000000000 in ?? () #69 0x0000000000000000 in ?? () #70 0x0000000000000000 in ?? () #71 0x0000000000000000 in ?? () #72 0x0000000000000000 in ?? () #73 0x0000000000000000 in ?? () #74 0x0000000000000000 in ?? () #75 0x0000000000000000 in ?? () #76 0x0000000000000000 in ?? () #77 0x0000000000000000 in ?? () #78 0x0000000000000000 in ?? () #79 0x0000000000000000 in ?? () #80 0x0000000000000000 in ?? () #81 0x0000000000000000 in ?? () #82 0x0000000000000000 in ?? () #83 0x0000000000000000 in ?? () #84 0x0000000000000000 in ?? () #85 0x0000000000000000 in ?? () #86 0x0000000000000000 in ?? () #87 0x0000000000000000 in ?? () #88 0x0000000000000000 in ?? () #89 0x0000000000000000 in ?? () #90 0x0000000000000000 in ?? () #91 0x000000080063f7a8 in ?? () #92 0x0000000000000000 in ?? () #93 0x0000000000000000 in ?? () #94 0x0000000000000000 in ?? () #95 0x0000000000000000 in ?? () #96 0x0000000000000000 in ?? () #97 0x0000000000000000 in ?? () #98 0x0000000000000000 in ?? () #99 0x0000000000000000 in ?? () #100 0x0000000000000000 in ?? () #101 0x0000000000000000 in ?? () #102 0x0000000000000004 in ?? () #103 0x0000000000000000 in ?? () #104 0x0000000000000000 in ?? () #105 0x0000000000000000 in ?? () #106 0x0000000000000000 in ?? () #107 0x0000000000000000 in ?? () #108 0x0000000000000000 in ?? () #109 0x0000000000000000 in ?? () #110 0x0000000000000000 in ?? () #111 0x0000000000000000 in ?? () #112 0x0000000000000000 in ?? () #113 0x0000000000000000 in ?? () #114 0x0000000000000000 in ?? () #115 0x0000000000000000 in ?? () #116 0x0000000000000000 in ?? () #117 0x00000100ffffad14 in ?? () #118 0x00000000000003a0 in ?? () #119 0x00000008018a63c0 in ?? () ---Type <return> to continue, or q <return> to quit--- #120 0x00000008018a6960 in ?? () #121 0x0000000801949280 in ?? () #122 0x0000000052e01b0b in ?? () #123 0x00007fffffffad14 in ?? () #124 0x0000000000412cfd in xtreeFind (treehandle=0x320, key=0x10002 <Address 0x10002 out of bounds>) at tree.c:100 #125 0x000000005307a80b in ?? () #126 0x0000000801949280 in ?? () #127 0x0000000052e01b0b in ?? () #128 0x00007fffffffad14 in ?? () #129 0x0000000000405d1c in next_alert (alert=0x20002) at do_alert.c:708 #130 0x0000000000403e7f in main (argc=Variable "argc" is not available. ) at xymond_alert.c:925
Den 2014-01-22 20:59, Mark Felder skrev:
Was hoping this patch would fix an alerting issue of ours and saw this after only running for a couple hours:
Historical Status xymon.foo.net - xymond_alert Log time Wed Jan 22 13:55:57 2014
- Program crashed
Fatal signal caught!
My guess is that you ran into the second alert-related bug in 4.3.13, the one that triggers when Xymon tries to strip newlines from the alert message. See http://lists.xymon.com/archive/2014-January/038847.html
4.3.14 will be released over the weekend, so you can either wait for that to arrive, or apply the patch from the link above.
Regards, Henrik
Thanks for that information Henrik. The criteria for the bug was triggered by alerting rules on our setup overnight and I am glad to report that the behaviour is now correct. i.e. alerts were generated for second recipient at the start of their TIME window.
BTW, I am using RHEL 6 systems and am not seeing core dumps.
Many thanks for the rapid response to this issue.
Cheers, Gavin....
-----Original Message----- From: henrik at hswn.dk [mailto:henrik at hswn.dk] Sent: Wednesday, 22 January 2014 6:52 PM To: Gavin Stone-Tolcher Cc: xymon at xymon.com Subject: RE: [Xymon] alerting issue?
Den 22.01.2014 00:55, Gavin Stone-Tolcher skrev:
Thanks, I will patch our systems with this today.
Is there a reliable way of reproducing the issue? My test case appeared to work OK unlike the documented failure case.
I setup a test installation with the settings from your original report, just modified so the TIME restriction on the second alert was 15 minutes after I started Xymon. So the first alert fired while the TIME restriction was active on the second alert. The second alert then failed to trigger.
The heart of the problem is that when the first alert triggers, Xymon sets a timestamp for the next time it should send an alert - and since the alert that triggered has a "REPEAT=1w" setting, then xymond_alert ignores the event for a week. It simply fails to take into account that there is another recipient defined which triggers before that time. So the circumstances required to hit this bug are
- you need two or more recipients for an alert
- you must have a REPEAT setting of more than 24 hours
- one of the recipients must have a TIME restriction which disables alerts when the event first triggers.
Regards, Henrik
participants (3)
-
feld@feld.me
-
g.stone-tolcher@its.uq.edu.au
-
henrik@hswn.dk