On 12/1/2015 12:03 PM, John Thurston wrote:
On 12/1/2015 11:48 AM, J.C. Cleaver wrote: - snip -
Hmm. This seems to be fundamentally a different issue than the "hostdata module going rogue" thing, which was about zombies never being picked up.
AFAICT, somehow the hosts tree structure is getting clobbered as a result of the drop (assuming all of those hosts are expected to be existing).
- snip -
I haven't yet found a way to induce this failure, so I haven't yet identified the minimal recovery steps. I'm working on it, though.
I think I might be able to reproduce the failure :) Start with the following, stable server arrangement: + x.bar.com is running xymon 4.3.22 on Solaris 10 SPARC + The following is defined in tasks.cfg: CMD xymond_channel --channel=page --log=$XYMONSERVERLOGS/alert.log \ xymond_alert --debug --checkpoint-file=$XYMONTMP/alert.chk \ --checkpoint-interval=600 + Host foo.bar.com is defined in DNS and does not permit ICMP traffic and does not have a xymon client installed on it Throw a spanner in the works by the following actions: + Add host foo.bar.com to an existing page and group in hosts.cfg + ~/server/bin/xymoncmd ~/server/bin/xymonnet foo.bar.com And see the trouble commence in alert.log:
6690 2015-12-14 10:52:06.859998 Got 415 bytes 6690 2015-12-14 10:52:06.860110 xymond_alert: Got message 95 @@page#95/foo.bar.com|1450122726.859873|10.10.10.55|foo.bar.com|conn|0.0.0.0|1450124526|red|none|1450122726|Page/Subpage|65234|||| 6690 2015-12-14 10:52:06.860140 startpos 5659, fillpos 5659, endpos -1 6690 2015-12-14 10:52:06.860172 Got page message from foo.bar.com:conn 6690 2015-12-14 10:52:06.860249 Alert status changed from 0 to 1 6690 2015-12-14 10:52:06.860285 Checking criteria for host 'foo.bar.com', which is not defined 6690 2015-12-14 10:52:06.861674 Checking criteria for host 'foo.bar.com', which is not defined 6690 2015-12-14 10:52:06.861728 Checking criteria for host 'foo.bar.com', which is not defined 6690 2015-12-14 10:52:06.861761 Found no first matching rule 6690 2015-12-14 10:52:06.861813 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg 6690 2015-12-14 10:52:06.861861 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg 6690 2015-12-14 10:52:06.861891 Checking criteria for host 'zebra.bar.com', which is not defined
After killing the "xymond_channel --channel=page" process, a new one is created as a child of xymonlaunch and everything behaves normally again. I currently have a tail on my alert.log to warn me of the appearance of the string, "which is not defined". When that appears, I know it is time to HUP the "page" channel. This is a rather crude hammer to leave laying on the table next to my production server, but it keeps us running :) I have a core file from the xymond_channel process, but its stack contains only:
feee041c _syscall6 (1, 1, 0, 1, 7d0, 3a0f4) + 20 00013c90 _start (0, 0, 0, 0, 0, 0) + 5c
I have a core file from the xymond_alert process, but its stack contains only:
feede7d8 __pollsys (ffbfcd50, 1, ffbfcdc0, 0, 0, 0) + 8 fee79b8c pselect (ffbfcd50, fef56790, fef56790, 40, ffbfcdc0, 0) + 1c8 fee79f04 select (1, ffbfce58, 0, 0, ffbfce48, ffbfced8) + a0 00015fa4 get_xymond_message (4b400, 4b14c, 4b148, ffbfcf88, 4b16c, 35d50) + 270 0003293c main (1, 566f245d, 0, 33b00, 4b000, 33bb8) + 378 00014a34 _start (0, 0, 0, 0, 0, 0) + 5c which is whatever it was happily processing when I killed it, not the stack at the time it ended up at line 815 of loadalerts.c
What can I do and what information can I gather which will help narrow the fault domain? -- Do things because you should, not just because you can. John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska