We've run xymon for years, and BB for years before that. We've recently bumped into a potential problem with xymonnet. In the middle of the night, all of our tests populated by xymonnet went purple. We think we've tracked the problem, and attribute it to an ldap query which took too long to respond. It looks like xymonnet just stopped while it waited for this query to complete. While it was waiting on this response, none of the other xymonnet tests were being performed.
The man for xymonnet indicates there is a --timeout=N which determines how long to wait for a service to accept a connection. Is there some parameter which will control how long xymonnet can try to get an answer after connecting?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Department of Administration State of Alaska
We've had a similar problem with another service. Looking at the xymonnet man page, there doesn't appear to be a general timeout (http, yes, but not generally). I use an expect script (as an external test) where I can spawn a process (in this case the ldap call). I set a timeout in expect, so if the spawned connection hangs, the timeout triggers and terminates the connection.
From: Xymon <xymon-bounces at xymon.com> on behalf of John Thurston <john.thurston at alaska.gov> Sent: Thursday, September 20, 2018 7:48:59 AM To: xymon at xymon.com Subject: [Xymon] xymonnet blocking
We've run xymon for years, and BB for years before that. We've recently bumped into a potential problem with xymonnet. In the middle of the night, all of our tests populated by xymonnet went purple. We think we've tracked the problem, and attribute it to an ldap query which took too long to respond. It looks like xymonnet just stopped while it waited for this query to complete. While it was waiting on this response, none of the other xymonnet tests were being performed.
The man for xymonnet indicates there is a --timeout=N which determines how long to wait for a service to accept a connection. Is there some parameter which will control how long xymonnet can try to get an answer after connecting?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon<http://lists.xymon.com/mailman/listinfo/xymon>
In your tasks.cfg, try "MAXTIME 10m" to the "[xymonnet]" block. Alter the "10m" to your taste. Probably something shorter than your STATUSLIFETIME.
Dave
On Wed, 19 Sep 2018, John Thurston wrote:
We've run xymon for years, and BB for years before that. We've recently bumped into a potential problem with xymonnet. In the middle of the night, all of our tests populated by xymonnet went purple. We think we've tracked the problem, and attribute it to an ldap query which took too long to respond. It looks like xymonnet just stopped while it waited for this query to complete. While it was waiting on this response, none of the other xymonnet tests were being performed.
The man for xymonnet indicates there is a --timeout=N which determines how long to wait for a service to accept a connection. Is there some parameter which will control how long xymonnet can try to get an answer after connecting?
On 9/20/2018 10:34 AM, Dave "doughnut" Fogarty wrote:
In your tasks.cfg, try "MAXTIME 10m" to the "[xymonnet]" block. Alter the "10m" to your taste. Probably something shorter than your STATUSLIFETIME.
This is exactly what I was looking for, Dave. Thank you.
I'll note here that it kinda-sorta works as expected. The documentation indicates, "The time is in seconds by default, you can specify minutes, hours or days by adding an "m", "h" or "d" after the number." So MAXTIME 30 means the limit on the task is 30 seconds, while MAXTIME 30m means the limit is 30 minutes.
Unfortunately, that is not the behavior I see (4.3.28 on Solaris 10) If I append any letters after the number, it appears to work as expected until it randomly stops and throws a line in xymonlaunch.log Killing hung task xymonnet (PID 3210) after 5 seconds The time mentioned varies a bit, but it bears no relationship to the time I've specified in the MAXTIME option.
When I specify 1m, it will sometimes kill while announcing a 5s limit. When I specify 7m, it will sometimes kill while announcing a 10s limit
It seems to work as expected if I specify the time in seconds (with no alphabetic unit suffix). So to get my desired 7 minute max, I've put MAXTIME 420
The syntax noted for MAXTIME is the same as is noted for INTERVAL. I am using 5m as the value for INTERVAL with no difficulty. I've also dug in the source code for xymonlaunch.c. The segments of code for the two options look the same. I am unable to explain why INTERVAL accepts my time with an 'm' uni-suffix while MAXTIME will not.
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Department of Administration State of Alaska
participants (3)
-
doughnut@doughnut.net
-
john.thurston@alaska.gov
-
Phil.Crooker@orix.com.au