On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
Hi everyone,
I have a xymon server running 4.3.21 that seems to be accumulating processes like these:
hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28435 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28440 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28444 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28449 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
Hi everyone,
I have a xymon server running 4.3.21 that seems to be accumulating processes like these:
hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28435 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28440 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28444 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28449 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
J.C., do you know if your patch made it into the code-base?
Has anyone else tested this patch? If so, on what operating systems?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
Hi everyone,
I have a xymon server running 4.3.21 that seems to be accumulating processes like these:
hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28435 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28440 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28444 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28449 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
J.C., do you know if your patch made it into the code-base?
Has anyone else tested this patch? If so, on what operating systems?
--
I thought this had sounded familiar.
The patch from http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the most recent Terabithia RPM.
If you could test the direct patch (for hostdata, at http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachm... ) on your OS, that would be very helpful. Signal handling is always a bit tricky to ensure is correct across the board.
Regards,
-jc
J.C. Cleaver wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
Hi everyone,
I have a xymon server running 4.3.21 that seems to be accumulating processes like these:
hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28435 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28440 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28444 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28449 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . . Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior? I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
J.C., do you know if your patch made it into the code-base?
Has anyone else tested this patch? If so, on what operating systems?
--
I thought this had sounded familiar.
The patch from http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the most recent Terabithia RPM.
If you could test the direct patch (for hostdata, at http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachm... ) on your OS, that would be very helpful. Signal handling is always a bit tricky to ensure is correct across the board.
Regards,
-jc
Problem repeated here on Solaris 10, but solved by patch suggested.
Andy
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote: . . .
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
The patch from http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the most recent Terabithia RPM.
If you could test the direct patch (for hostdata, at http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachm... ) on your OS, that would be very helpful. Signal handling is always a bit tricky to ensure is correct across the board.
I have patched one of my servers and it behaves much better under my contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The original report was under Red Hat Enterprise Linux 5.
If my understanding of this is correct, it is a pretty nasty defect :(
My failure scenario was non-delivery of some email alerts for hosts in dire straits. I have several customers who do not monitor the web interface, but rely on email notifications to warn them of impending problems. These folks had been without any alerting capability since early in July when I "dropped" at host and unknowingly clobbered the child of xymond_hostdata.
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Mon, August 31, 2015 10:19 am, John Thurston wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote: . . .
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
The patch from http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the most recent Terabithia RPM.
If you could test the direct patch (for hostdata, at http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachm... ) on your OS, that would be very helpful. Signal handling is always a bit tricky to ensure is correct across the board.
I have patched one of my servers and it behaves much better under my contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The original report was under Red Hat Enterprise Linux 5.
If my understanding of this is correct, it is a pretty nasty defect :(
My failure scenario was non-delivery of some email alerts for hosts in dire straits. I have several customers who do not monitor the web interface, but rely on email notifications to warn them of impending problems. These folks had been without any alerting capability since early in July when I "dropped" at host and unknowingly clobbered the child of xymond_hostdata.
Thanks for the confirmation... Yes, I believe it's probably time to start another release cycle, for this and a few other of the recent bug fixes still pending.
Regards,
-jc
On Mon, Aug 31, 2015, at 16:24, J.C. Cleaver wrote:
On Mon, August 31, 2015 10:19 am, John Thurston wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote: . . .
hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct>
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
The patch from http://lists.xymon.com/pipermail/xymon/2015-June/041833.html was checked in in https://sourceforge.net/p/xymon/code/7669/ , however it's not in the most recent Terabithia RPM.
If you could test the direct patch (for hostdata, at http://lists.xymon.com/pipermail/xymon/attachments/20150610/8b425efb/attachm... ) on your OS, that would be very helpful. Signal handling is always a bit tricky to ensure is correct across the board.
I have patched one of my servers and it behaves much better under my contrived tests :) This is under Solaris 10 (Update 11) on SPARC. The original report was under Red Hat Enterprise Linux 5.
If my understanding of this is correct, it is a pretty nasty defect :(
My failure scenario was non-delivery of some email alerts for hosts in dire straits. I have several customers who do not monitor the web interface, but rely on email notifications to warn them of impending problems. These folks had been without any alerting capability since early in July when I "dropped" at host and unknowingly clobbered the child of xymond_hostdata.
Thanks for the confirmation... Yes, I believe it's probably time to start another release cycle, for this and a few other of the recent bug fixes still pending.
For the record, I can't reproduce this on FreeBSD either.
On Fri, Sep 4, 2015, at 11:08, Mark Felder wrote:
For the record, I can't reproduce this on FreeBSD either.
This specifically was for the extra "xymond_hostdata" child processes...
Now that I think of it, I do recall being unable to identify why some alerts were not sent on a large xymon installation... perhaps this was the culprit? Do we know roughly how long this problem may have existed?
How embarrassing. I was composing a note to mention a problem with the list archives not capturing all messages . . . when I discovered that the message for which I was searching was never sent to the list. I composed the following message back in early October and then sent it only to myself :p No wonder it didn't generate any chatter. On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
I have a xymon server running 4.3.21 that seems to be accumulating processes like these:
hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> . . .
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
J.C., do you know if your patch made it into the code-base?
Has anyone else tested this patch? If so, on what operating systems?
This patch took care of the defunct/zonebie processes on "drop" events, but I've just discovered that it does not solve the underlying problem. It still appears that xymond_hostdata does not behave correctly following a "drop" command. The effect is that alerts fail to be delivered for _some_ messages because hostnames can no longer be retrieved. Example: My xymon server is humming along. I have the alert module debug-logging to alerts.log. Immediately after issuing a "drop" command of the sort: #xymon localhost "drop foo.bar.com sslcert" the following sorts appear in the alerts.log. After this, some messages may result in alert emails being sent, but most quietly disappear. Currently, my resolution is to "xymon.sh restart" but that is much too heavy handed for long term use.
21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted 21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg 21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg 21178 2015-10-05 16:39:43.257718 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257773 Found a first matching rule 21178 2015-10-05 16:39:43.257802 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257830 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257854 Found a first matching rule 21178 2015-10-05 16:39:43.257879 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257910 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.257935 Found a first matching rule 21178 2015-10-05 16:39:43.257960 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.257986 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.258010 Found a first matching rule 21178 2015-10-05 16:39:43.258035 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.258061 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258088 Found a first matching rule 21178 2015-10-05 16:39:43.258113 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258140 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258164 Found a first matching rule 21178 2015-10-05 16:39:43.258188 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258211 0 alerts to go 21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos 134769, endpos -1, usedbytes=0, bufleft=131470 21178 2015-10-05 16:39:47.962032 Got 2831 bytes 21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039 @@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754|||| 21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos -1 21178 2015-10-05 16:39:47.962204 Got page message from soajnuexhs1.bar.com:msgs 21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos 137600, endpos -1, usedbytes=0, bufleft=128639 21178 2015-10-05 16:39:58.022397 Got 297 bytes 21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040 @@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS||||| 21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos -1 21178 2015-10-05 16:39:58.022593 Got page message from doadofjdc-ea05p.bar.com:msgs 21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0 21178 2015-10-05 16:39:58.022666 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022706 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022739 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022776 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022808 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022841 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022873 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022904 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022935 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022967 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022998 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023028 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023059 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023089 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023120 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023151 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023187 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023221 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023252 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023282 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023313 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023342 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023369 Found no first matching rule 21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos 137897, endpos -1, usedbytes=0, bufleft=128342 21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due to EOF
-- Do things because you should, not just because you can. John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
I was bit by this in the middle of November, and didn't notice it until a customer alerted me today to a shortage of email messages.
To recap:
Some alerts get sent correctly, but in other cases the alert daemon aborts message processing and no alert is sent. In the cases where the daemon stops processing, my debug log begins to accumulate messages of the sort:
1730 2015-12-01 07:58:39.501785 Checking criteria for host 'upsjdc.state.ak.us', which is not defined
There is sometimes a <defunct> process left hanging around. At other times there is not.
Performing a "xymon.sh restart" makes it all work again.
Today, I had a process tree something like:
29118 /opt/xymon/server/bin/xymonlaunch --config=/opt/xymon/server/etc/tasks.cfg --en 29119 xymond --pidfile=/var/log/xymon/xymond.pid --restart=/opt/xymon/server/tmp/xymo 29120 /opt/xymon/server/bin/xymonfetch --id=1 --interval=79 --no-daemon --pidfile=/va 29144 xymond_channel --channel=stachg --log=/var/log/xymon/history.log xymond_history 29201 xymond_history --pidfile=/var/log/xymon/xymond_history.pid 29145 xymond_channel --channel=page --log=/var/log/xymon/alert.log xymond_alert --deb 29307 xymond_alert --debug --checkpoint-file=/opt/xymon/server/tmp/alert.chk --checkp 1588 <defunct>
I killed off PID 29145, it was recreated, and the alerts began flowing again.
In this occurrence, it does not appear to be related to a "drop" message. My last recorded "drop" was at 20151103-0846 and the alert process didn't start logging "which is not defined" until 20151120-0007
The only thing I can think to do now is make my xymon client monitor the alert.log and warn me when "which is not defined" start appearing so I can manually kill/restart the process.
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Tue, December 1, 2015 9:32 am, John Thurston wrote: *snip*
In this occurrence, it does not appear to be related to a "drop" message. My last recorded "drop" was at 20151103-0846 and the alert process didn't start logging "which is not defined" until 20151120-0007
Hmm. Okay, that does change things slightly. Fortunately, that means it's probably specifically caused by drops per se. Were there any other errors that occurred with other components around this time? Perhaps the system being low enough on memory that some re-allocations might have failed?
Regards, -jc
On 12/1/2015 11:51 AM, J.C. Cleaver wrote:
On Tue, December 1, 2015 9:32 am, John Thurston wrote: *snip*
In this occurrence, it does not appear to be related to a "drop" message. My last recorded "drop" was at 20151103-0846 and the alert process didn't start logging "which is not defined" until 20151120-0007
Hmm. Okay, that does change things slightly. Fortunately, that means it's probably specifically caused by drops per se. Were there any other errors that occurred with other components around this time?
I have several instances of "Oversize status msg from " in the xymond.log, but those are appearing six hours before the bad behavior appeared in xymon_alert. I have difficulty believing they are related.
Perhaps the system being low enough on memory that some re-allocations might have failed?
I think this is unlikely. The system has 256GB of RAM, and there are no memory caps placed on the non-global zone in which xymon is running. I don't have information of its size on Nov 20, but today it using about 400MB of RAM. All of the zones on the system are consuming less than 10GB of the 256GB and it wouldn't have been significantly different a few weeks ago.
I've been doing some 'drops' today to try to break it, but haven't succeeded. I'll continue to beat on it and see if I can find a repeatable failure scenario.
fwiw, this is under 4.3.22
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Tue, December 1, 2015 1:41 pm, John Thurston wrote:
On 12/1/2015 11:51 AM, J.C. Cleaver wrote:
On Tue, December 1, 2015 9:32 am, John Thurston wrote: *snip*
In this occurrence, it does not appear to be related to a "drop" message. My last recorded "drop" was at 20151103-0846 and the alert process didn't start logging "which is not defined" until 20151120-0007
Hmm. Okay, that does change things slightly. Fortunately, that means it's probably specifically caused by drops per se. Were there any other errors that occurred with other components around this time?
I have several instances of "Oversize status msg from " in the xymond.log, but those are appearing six hours before the bad behavior appeared in xymon_alert. I have difficulty believing they are related.
Ack. Yeah, that should have been 'NOT specifically' :)
Perhaps the system being low enough on memory that some re-allocations might have failed?
I think this is unlikely. The system has 256GB of RAM, and there are no memory caps placed on the non-global zone in which xymon is running. I don't have information of its size on Nov 20, but today it using about 400MB of RAM. All of the zones on the system are consuming less than 10GB of the 256GB and it wouldn't have been significantly different a few weeks ago.
I've been doing some 'drops' today to try to break it, but haven't succeeded. I'll continue to beat on it and see if I can find a repeatable failure scenario.
fwiw, this is under 4.3.22
Hmm. This is an area where it's possible that glibc/NULL issues might be causing subtle things too. I could easily see the btree getting hosed by tree re-insertion of a key we weren't really expecting.
-jc
On Tue, December 1, 2015 9:14 am, John Thurston wrote:
How embarrassing. I was composing a note to mention a problem with the list archives not capturing all messages . . . when I discovered that the message for which I was searching was never sent to the list.
I composed the following message back in early October and then sent it only to myself :p No wonder it didn't generate any chatter.
On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
On 8/28/2015 12:45 PM, John Thurston wrote:
On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
I have a xymon server running 4.3.21 that seems to be accumulating processes like these:
hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> . . .
It seemed related to drop messages . . .
Hey, I think I'm seeing the same thing on Solaris with 4.3.21
I've ended up here after a customer let me know that email alerts were not working as expected. After a few hours of digging around, I decided that the alert daemon was failing to retrieve hostnames and failing miserably.
Have other people seen this behavior?
I have duplicated this behavior on another xymon server on Solaris. It certainly looks like this behavior breaks the alert daemon. Fortunately, I "drop" hosts in batches so can restart Xymon at that time, but this is still pretty icky.
J.C., do you know if your patch made it into the code-base?
Has anyone else tested this patch? If so, on what operating systems?
This patch took care of the defunct/zonebie processes on "drop" events, but I've just discovered that it does not solve the underlying problem. It still appears that xymond_hostdata does not behave correctly following a "drop" command. The effect is that alerts fail to be delivered for _some_ messages because hostnames can no longer be retrieved.
Example:
My xymon server is humming along. I have the alert module debug-logging to alerts.log. Immediately after issuing a "drop" command of the sort:
#xymon localhost "drop foo.bar.com sslcert"
the following sorts appear in the alerts.log. After this, some messages may result in alert emails being sent, but most quietly disappear. Currently, my resolution is to "xymon.sh restart" but that is much too heavy handed for long term use.
21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted 21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg 21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg 21178 2015-10-05 16:39:43.257718 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257773 Found a first matching rule 21178 2015-10-05 16:39:43.257802 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257830 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257854 Found a first matching rule 21178 2015-10-05 16:39:43.257879 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined 21178 2015-10-05 16:39:43.257910 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.257935 Found a first matching rule 21178 2015-10-05 16:39:43.257960 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.257986 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.258010 Found a first matching rule 21178 2015-10-05 16:39:43.258035 Checking criteria for host 'steam.bar.com', which is not defined 21178 2015-10-05 16:39:43.258061 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258088 Found a first matching rule 21178 2015-10-05 16:39:43.258113 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258140 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258164 Found a first matching rule 21178 2015-10-05 16:39:43.258188 Checking criteria for host 'upsjdc.bar.com', which is not defined 21178 2015-10-05 16:39:43.258211 0 alerts to go 21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos 134769, endpos -1, usedbytes=0, bufleft=131470 21178 2015-10-05 16:39:47.962032 Got 2831 bytes 21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039 @@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754|||| 21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos -1 21178 2015-10-05 16:39:47.962204 Got page message from soajnuexhs1.bar.com:msgs 21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos 137600, endpos -1, usedbytes=0, bufleft=128639 21178 2015-10-05 16:39:58.022397 Got 297 bytes 21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040 @@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS||||| 21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos -1 21178 2015-10-05 16:39:58.022593 Got page message from doadofjdc-ea05p.bar.com:msgs 21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0 21178 2015-10-05 16:39:58.022666 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022706 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022739 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022776 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022808 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022841 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022873 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022904 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022935 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022967 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.022998 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023028 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023059 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023089 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023120 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023151 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023187 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023221 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023252 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023282 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023313 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023342 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined 21178 2015-10-05 16:39:58.023369 Found no first matching rule 21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos 137897, endpos -1, usedbytes=0, bufleft=128342 21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due to EOF
Hmm. This seems to be fundamentally a different issue than the "hostdata module going rogue" thing, which was about zombies never being picked up. AFAICT, somehow the hosts tree structure is getting clobbered as a result of the drop (assuming all of those hosts are expected to be existing). There were a few patches for things in xymond.c at one point, and more error checking when going to POSIX btrees generally, but I hadn't encountered this in other intermittent hostlist readers. 1) Which version of Solaris is this? 2) Have you experienced this in other workers for xymon? (IE, xymond_client not being able to look up hostnames after a drop -- would probably lead to random purples) 3) Does issuing a "reload" command or -HUP to xymond_alert re-sync things? -jc
On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
- snip -
Hmm. This seems to be fundamentally a different issue than the "hostdata module going rogue" thing, which was about zombies never being picked up.
AFAICT, somehow the hosts tree structure is getting clobbered as a result of the drop (assuming all of those hosts are expected to be existing).
See my later message for its relation to 'drop' activity.
There were a few patches for things in xymond.c at one point, and more error checking when going to POSIX btrees generally, but I hadn't encountered this in other intermittent hostlist readers.
- Which version of Solaris is this?
Solaris 10, most recent update, SPARC
- Have you experienced this in other workers for xymon? (IE, xymond_client not being able to look up hostnames after a drop -- would probably lead to random purples)
I haven't seen behavior like that with other worker processes. Is there a way to interactively run a worker process and have it hit the daemon process for the hostnames? Aside from making the process dump core, is there a way to get the daemon to spill its current list of hostnames?
- Does issuing a "reload" command or -HUP to xymond_alert re-sync things?
I didn't do a 'reload', but I killed the "xymond_channel --channel=page --log=/var/log/xymon/alert.log xymond_alert" process and alerts started working again.
I haven't yet found a way to induce this failure, so I haven't yet identified the minimal recovery steps. I'm working on it, though.
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On 12/1/2015 12:03 PM, John Thurston wrote:
On 12/1/2015 11:48 AM, J.C. Cleaver wrote: - snip -
Hmm. This seems to be fundamentally a different issue than the "hostdata module going rogue" thing, which was about zombies never being picked up.
AFAICT, somehow the hosts tree structure is getting clobbered as a result of the drop (assuming all of those hosts are expected to be existing).
- snip -
I haven't yet found a way to induce this failure, so I haven't yet identified the minimal recovery steps. I'm working on it, though.
I think I might be able to reproduce the failure :) Start with the following, stable server arrangement: + x.bar.com is running xymon 4.3.22 on Solaris 10 SPARC + The following is defined in tasks.cfg: CMD xymond_channel --channel=page --log=$XYMONSERVERLOGS/alert.log \ xymond_alert --debug --checkpoint-file=$XYMONTMP/alert.chk \ --checkpoint-interval=600 + Host foo.bar.com is defined in DNS and does not permit ICMP traffic and does not have a xymon client installed on it Throw a spanner in the works by the following actions: + Add host foo.bar.com to an existing page and group in hosts.cfg + ~/server/bin/xymoncmd ~/server/bin/xymonnet foo.bar.com And see the trouble commence in alert.log:
6690 2015-12-14 10:52:06.859998 Got 415 bytes 6690 2015-12-14 10:52:06.860110 xymond_alert: Got message 95 @@page#95/foo.bar.com|1450122726.859873|10.10.10.55|foo.bar.com|conn|0.0.0.0|1450124526|red|none|1450122726|Page/Subpage|65234|||| 6690 2015-12-14 10:52:06.860140 startpos 5659, fillpos 5659, endpos -1 6690 2015-12-14 10:52:06.860172 Got page message from foo.bar.com:conn 6690 2015-12-14 10:52:06.860249 Alert status changed from 0 to 1 6690 2015-12-14 10:52:06.860285 Checking criteria for host 'foo.bar.com', which is not defined 6690 2015-12-14 10:52:06.861674 Checking criteria for host 'foo.bar.com', which is not defined 6690 2015-12-14 10:52:06.861728 Checking criteria for host 'foo.bar.com', which is not defined 6690 2015-12-14 10:52:06.861761 Found no first matching rule 6690 2015-12-14 10:52:06.861813 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg 6690 2015-12-14 10:52:06.861861 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg 6690 2015-12-14 10:52:06.861891 Checking criteria for host 'zebra.bar.com', which is not defined
After killing the "xymond_channel --channel=page" process, a new one is created as a child of xymonlaunch and everything behaves normally again. I currently have a tail on my alert.log to warn me of the appearance of the string, "which is not defined". When that appears, I know it is time to HUP the "page" channel. This is a rather crude hammer to leave laying on the table next to my production server, but it keeps us running :) I have a core file from the xymond_channel process, but its stack contains only:
feee041c _syscall6 (1, 1, 0, 1, 7d0, 3a0f4) + 20 00013c90 _start (0, 0, 0, 0, 0, 0) + 5c
I have a core file from the xymond_alert process, but its stack contains only:
feede7d8 __pollsys (ffbfcd50, 1, ffbfcdc0, 0, 0, 0) + 8 fee79b8c pselect (ffbfcd50, fef56790, fef56790, 40, ffbfcdc0, 0) + 1c8 fee79f04 select (1, ffbfce58, 0, 0, ffbfce48, ffbfced8) + a0 00015fa4 get_xymond_message (4b400, 4b14c, 4b148, ffbfcf88, 4b16c, 35d50) + 270 0003293c main (1, 566f245d, 0, 33b00, 4b000, 33bb8) + 378 00014a34 _start (0, 0, 0, 0, 0, 0) + 5c which is whatever it was happily processing when I killed it, not the stack at the time it ended up at line 815 of loadalerts.c
What can I do and what information can I gather which will help narrow the fault domain? -- Do things because you should, not just because you can. John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
participants (4)
-
abs@shadymint.com
-
cleaver@terabithia.org
-
feld@feld.me
-
john.thurston@alaska.gov