hostname retrieval is broken after adding a host
This defect is getting to be a serious problem in my production Xymon.
For specifics, see my notes of 20151201 and 20151214 : http://lists.xymon.com/pipermail/xymon/2015-December/042712.html http://lists.xymon.com/pipermail/xymon/2015-December/042787.html
In general, adding a host to hosts.cfg corrupts the in-memory list of valid hosts. This causes other worker processes (specifically "alert") to fail. It doesn't fail _completely_. Some alerts continue to be sent, but there are footprints in the logs. I have a script watching for these footprints. When seen, I kill the "xymond_channel --channel=page" process, a new one is started, and business continues.
I need to squash this bug.
Is there a way to interactively run a worker process and have it hit the in-memory table of hostnames?
If not, is there a way to spill the in-memory table of hostnames without using a debugger?
Can anyone tell me which worker processes us the in-memory host list?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Hi,
Actually, I think I must have missed your final response on this at http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my apologies.
On what's happening, I think this might be a side-effect of https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record for the purposes of command-line --test functionality when the host doesn't exist. For an incoming unknown host (from xymond_alert's perspective), the same path is being executed.
The problem is that localhostinfo re-initializes the hostlist, which would almost certainly cause problems somewhat similar to what you're describing here. The attached patch should fix that (by only doing it if we're in test mode). The only other place this is used is in xymond_client when it's itself running in --local mode, in which case it doesn't have a pre-existing tree to get corrupted and then exits immediately anyway.
This really calls for a re-factoring around host loading, but I'm leery of too much direct modification in 4.3, this probably being caused by that recent code.
Can you give it a test and let us know the result?
Regards, -jc
On Mon, February 1, 2016 11:10 am, John Thurston wrote:
This defect is getting to be a serious problem in my production Xymon.
For specifics, see my notes of 20151201 and 20151214 : http://lists.xymon.com/pipermail/xymon/2015-December/042712.html http://lists.xymon.com/pipermail/xymon/2015-December/042787.html
In general, adding a host to hosts.cfg corrupts the in-memory list of valid hosts. This causes other worker processes (specifically "alert") to fail. It doesn't fail _completely_. Some alerts continue to be sent, but there are footprints in the logs. I have a script watching for these footprints. When seen, I kill the "xymond_channel --channel=page" process, a new one is started, and business continues.
I need to squash this bug.
Is there a way to interactively run a worker process and have it hit the in-memory table of hostnames?
If not, is there a way to spill the in-memory table of hostnames without using a debugger?
Can anyone tell me which worker processes us the in-memory host list?
Do things because you should, not just because you can.John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
On 2/1/2016 2:41 PM, J.C. Cleaver wrote:
Hi,
Actually, I think I must have missed your final response on this at http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my apologies.
On what's happening, I think this might be a side-effect of https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record for the purposes of command-line --test functionality when the host doesn't exist. For an incoming unknown host (from xymond_alert's perspective), the same path is being executed.
I've applied the patch to my non-production server and performed my failure-reproduction steps. The behavior is certainly better. The alert process is no longer tanking for every message received :)
What I do get, for a newly added host, is "Checking criteria for host 'foo.bar.com', which is not defined. Will not alert until hostlist reload." This happens following all subsequent runs of xymonnet.
Is there anything which will trigger a hostlist reload?
Is there a tidy way to manually reload the list?
It doesn't seem to happen until I kill the "xymond_channel --channel=page" process. This seems like a hamfisted thing to do after every edit of hosts.cfg :(
Related question:
If this is in main code, and not some odd-ball null/EOF/posix problem (as has often tripped up my Solaris systems in the recent past), why am I the only one seeing this failure? Why aren't the folks running linux having their alerts fail?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Mon, February 1, 2016 4:59 pm, John Thurston wrote:
On 2/1/2016 2:41 PM, J.C. Cleaver wrote:
Hi,
Actually, I think I must have missed your final response on this at http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my apologies.
On what's happening, I think this might be a side-effect of https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record for the purposes of command-line --test functionality when the host doesn't exist. For an incoming unknown host (from xymond_alert's perspective), the same path is being executed.
I've applied the patch to my non-production server and performed my failure-reproduction steps. The behavior is certainly better. The alert process is no longer tanking for every message received :)
What I do get, for a newly added host, is "Checking criteria for host 'foo.bar.com', which is not defined. Will not alert until hostlist reload." This happens following all subsequent runs of xymonnet.
Is there anything which will trigger a hostlist reload?
Is there a tidy way to manually reload the list?
It doesn't seem to happen until I kill the "xymond_channel --channel=page" process. This seems like a hamfisted thing to do after every edit of hosts.cfg :(
Related question:
If this is in main code, and not some odd-ball null/EOF/posix problem (as has often tripped up my Solaris systems in the recent past), why am I the only one seeing this failure? Why aren't the folks running linux having their alerts fail?
This one took me quite a while to figure out, mainly because I was looking at the wrong code base for a while. It turns out the host info record here is *only* used for display groups and holiday lookups (probably rarely used), within the context of alerting. In all other cases, it not being in the hostlist doesn't impact the application of alert rule, since all the needed info is coming in via the '@@page' message itself. The patch should be updated to let those come straight through instead of exiting out if it doesn't see it. My confusion came from different issue: xymond_alert actually never reloads the hosts config at all! I found/fixed this back in Sept '14 in the RPMs but it wasn't applied into 4.3 back then. I'd been living with that code for so long I forgot that that reload wasn't needed here -- and, obviously, alerts have been working *in general*... (We only noticed the lack of reload because we were dependent on a dynamic value in the hosts.cfg line coming through to the alert script via XMH_RAW in updated form.) xymond_alert reloading was put into 4.4 at https://sourceforge.net/p/xymon/code/7776/ among the patch bursts, but the live host add issue has probably been in since this release. There are a few takeaways from this... but this needs to be fixed in 4.3 (among several other incoming issues that are pending confirmation). Can you please check the included two patches? One is an update for the previous one, which passes the alert check through (only adding the dummy record in --test mode to begin with), the other adds hosts.cfg reloading on intervals or on demand. It's based on the 4.4 version, but with only a small change. I'd like to add both, as I can't see any drawback to reloading hosts.cfg from xymond_alert's perspective, but the first may be sufficient to get back to the status quo. Regards, -jc
Possibly worth chiming in here that I use the holidays list, in case anyone is thinking of "simplifying" the code. :-) -- ____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |---------------------*O*--------------------- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novosirj at rutgers.edu<mailto:novosirj at rutgers.edu>- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `' On Feb 2, 2016, at 09:42, J.C. Cleaver <cleaver at terabithia.org<mailto:cleaver at terabithia.org>> wrote: On Mon, February 1, 2016 4:59 pm, John Thurston wrote: On 2/1/2016 2:41 PM, J.C. Cleaver wrote: Hi, Actually, I think I must have missed your final response on this at http://lists.xymon.com/pipermail/xymon/2015-December/042787.html ; my apologies. On what's happening, I think this might be a side-effect of https://sourceforge.net/p/xymon/code/7651/ , which added a dummy record for the purposes of command-line --test functionality when the host doesn't exist. For an incoming unknown host (from xymond_alert's perspective), the same path is being executed. I've applied the patch to my non-production server and performed my failure-reproduction steps. The behavior is certainly better. The alert process is no longer tanking for every message received :) What I do get, for a newly added host, is "Checking criteria for host 'foo.bar.com<http://foo.bar.com>', which is not defined. Will not alert until hostlist reload." This happens following all subsequent runs of xymonnet. Is there anything which will trigger a hostlist reload? Is there a tidy way to manually reload the list? It doesn't seem to happen until I kill the "xymond_channel --channel=page" process. This seems like a hamfisted thing to do after every edit of hosts.cfg :( Related question: If this is in main code, and not some odd-ball null/EOF/posix problem (as has often tripped up my Solaris systems in the recent past), why am I the only one seeing this failure? Why aren't the folks running linux having their alerts fail? This one took me quite a while to figure out, mainly because I was looking at the wrong code base for a while. It turns out the host info record here is *only* used for display groups and holiday lookups (probably rarely used), within the context of alerting. In all other cases, it not being in the hostlist doesn't impact the application of alert rule, since all the needed info is coming in via the '@@page' message itself. The patch should be updated to let those come straight through instead of exiting out if it doesn't see it. My confusion came from different issue: xymond_alert actually never reloads the hosts config at all! I found/fixed this back in Sept '14 in the RPMs but it wasn't applied into 4.3 back then. I'd been living with that code for so long I forgot that that reload wasn't needed here -- and, obviously, alerts have been working *in general*... (We only noticed the lack of reload because we were dependent on a dynamic value in the hosts.cfg line coming through to the alert script via XMH_RAW in updated form.) xymond_alert reloading was put into 4.4 at https://sourceforge.net/p/xymon/code/7776/ among the patch bursts, but the live host add issue has probably been in since this release. There are a few takeaways from this... but this needs to be fixed in 4.3 (among several other incoming issues that are pending confirmation). Can you please check the included two patches? One is an update for the previous one, which passes the alert check through (only adding the dummy record in --test mode to begin with), the other adds hosts.cfg reloading on intervals or on demand. It's based on the 4.4 version, but with only a small change. I'd like to add both, as I can't see any drawback to reloading hosts.cfg from xymond_alert's perspective, but the first may be sufficient to get back to the status quo. Regards, -jc <localalertmode-2.patch> <reloadalert.patch> _______________________________________________ Xymon mailing list Xymon at xymon.com<mailto:Xymon at xymon.com> http://lists.xymon.com/mailman/listinfo/xymon
On 2/2/2016 5:42 AM, J.C. Cleaver wrote:
- snip -
- snip -
It turns out the host info record here is *only* used for display groups and holiday lookups (probably rarely used), within the context of alerting.
. . . why am I the only one seeing this failure? Why aren't the folks running linux having their alerts fail?
On Mon, February 1, 2016 4:59 pm, John Thurston wrote:
And I suspect I am one of the few people using display groups to drive my alerting. I resisted defining alert groups back in the BB days because it seemed like too much work. When I moved to Xymon and I could leverage my existing display groups, I jumped on board.
- snip -
Can you please check the included two patches? One is an update for the previous one, which passes the alert check through (only adding the dummy record in --test mode to begin with), the other adds hosts.cfg reloading on intervals or on demand.
With these patches, my non-production server running 4.3.22 on Solaris 10 is running much better. This is very encouraging :)
Looking at the patch files and reading the new source, am I correct it adds a couple of startup options to xymond_alert? --reload-interval=number-of-seconds --loadhostsfromxymond where the first specifies the number of seconds after which the contents of hosts.cfg should be reloaded, and the second says hosts.cfg could be retrieved from xymond rather than the filesystem (similar to the existing option for xymongen).
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Tue, February 2, 2016 12:24 pm, John Thurston wrote:
On 2/2/2016 5:42 AM, J.C. Cleaver wrote:
- snip -
- snip -
It turns out the host info record here is *only* used for display groups and holiday lookups (probably rarely used), within the context of alerting.
. . . why am I the only one seeing this failure? Why aren't the folks running linux having their alerts fail?
On Mon, February 1, 2016 4:59 pm, John Thurston wrote:
And I suspect I am one of the few people using display groups to drive my alerting. I resisted defining alert groups back in the BB days because it seemed like too much work. When I moved to Xymon and I could leverage my existing display groups, I jumped on board.
Ahh, yes, this would definitely have affected this then...
Can you please check the included two patches? One is an update for the previous one, which passes the alert check through (only adding the dummy record in --test mode to begin with), the other adds hosts.cfg reloading on intervals or on demand.
With these patches, my non-production server running 4.3.22 on Solaris 10 is running much better. This is very encouraging :)
Indeed! This is slated for RC2 now.
Looking at the patch files and reading the new source, am I correct it adds a couple of startup options to xymond_alert? --reload-interval=number-of-seconds --loadhostsfromxymond where the first specifies the number of seconds after which the contents of hosts.cfg should be reloaded, and the second says hosts.cfg could be retrieved from xymond rather than the filesystem (similar to the existing option for xymongen).
Correct. Easier to grok in both cases. Actually, all long-running processes that manipulate hosts in something other than a textual way should be periodically reloading, from whichever source they're using.
Regards,
-jc
participants (3)
-
cleaver@terabithia.org
-
john.thurston@alaska.gov
-
novosirj@ca.rutgers.edu