I was bit by this in the middle of November, and didn't notice it until a customer alerted me today to a shortage of email messages.
To recap:
Some alerts get sent correctly, but in other cases the alert daemon aborts message processing and no alert is sent. In the cases where the daemon stops processing, my debug log begins to accumulate messages of the sort:
1730 2015-12-01 07:58:39.501785 Checking criteria for host 'upsjdc.state.ak.us', which is not defined
There is sometimes a <defunct> process left hanging around. At other times there is not.
Performing a "xymon.sh restart" makes it all work again.
Today, I had a process tree something like:
29118 /opt/xymon/server/bin/xymonlaunch --config=/opt/xymon/server/etc/tasks.cfg --en 29119 xymond --pidfile=/var/log/xymon/xymond.pid --restart=/opt/xymon/server/tmp/xymo 29120 /opt/xymon/server/bin/xymonfetch --id=1 --interval=79 --no-daemon --pidfile=/va 29144 xymond_channel --channel=stachg --log=/var/log/xymon/history.log xymond_history 29201 xymond_history --pidfile=/var/log/xymon/xymond_history.pid 29145 xymond_channel --channel=page --log=/var/log/xymon/alert.log xymond_alert --deb 29307 xymond_alert --debug --checkpoint-file=/opt/xymon/server/tmp/alert.chk --checkp 1588 <defunct>
I killed off PID 29145, it was recreated, and the alerts began flowing again.
In this occurrence, it does not appear to be related to a "drop" message. My last recorded "drop" was at 20151103-0846 and the alert process didn't start logging "which is not defined" until 20151120-0007
The only thing I can think to do now is make my xymon client monitor the alert.log and warn me when "which is not defined" start appearing so I can manually kill/restart the process.
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska