On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
- snip -
Hmm. This seems to be fundamentally a different issue than the "hostdata module going rogue" thing, which was about zombies never being picked up.
AFAICT, somehow the hosts tree structure is getting clobbered as a result of the drop (assuming all of those hosts are expected to be existing).
See my later message for its relation to 'drop' activity.
There were a few patches for things in xymond.c at one point, and more error checking when going to POSIX btrees generally, but I hadn't encountered this in other intermittent hostlist readers.
- Which version of Solaris is this?
Solaris 10, most recent update, SPARC
- Have you experienced this in other workers for xymon? (IE, xymond_client not being able to look up hostnames after a drop -- would probably lead to random purples)
I haven't seen behavior like that with other worker processes. Is there a way to interactively run a worker process and have it hit the daemon process for the hostnames? Aside from making the process dump core, is there a way to get the daemon to spill its current list of hostnames?
- Does issuing a "reload" command or -HUP to xymond_alert re-sync things?
I didn't do a 'reload', but I killed the "xymond_channel --channel=page --log=/var/log/xymon/alert.log xymond_alert" process and alerts started working again.
I haven't yet found a way to induce this failure, so I haven't yet identified the minimal recovery steps. I'm working on it, though.
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska