How many times does xymonnet retry?
Hello, all,
I didn’t find this information in the man pages for xymonnet and xymonnet-again. How many times does xymonnet retry a failed test? Does it go red on the first failure (assuming default configuration), or only after all the retries failed?
Thanks,
glauber
On 1/8/2016 7:13 AM, Ribeiro, Glauber wrote:
Hello, all,
I didn’t find this information in the man pages for xymonnet and xymonnet-again. How many times does xymonnet retry a failed test? Does it go red on the first failure (assuming default configuration), or only after all the retries failed?
It goes red when xymonnet detects the failure. The test is then assigned to xymonnet-again which executes it more frequently for a total of 30 minutes. If it is still red, xymonnet-again quits hammering it.
From the man page for xymonnet-again
Only tests whose first failure occurred within 30 minutes are included in the tests that are run by xymonnet-again.sh. The 30 minute limit is there to avoid hosts that are down for longer periods of time to bog down xymonnet-again.sh. You can change this limit with the "--frequenttestlimit=SECONDS" when you run xyxmonnet.
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston Sent: Friday, January 08, 2016 11:35 To: xymon at xymon.com Subject: Re: [Xymon] How many times does xymonnet retry?
On 1/8/2016 7:13 AM, Ribeiro, Glauber wrote:
Hello, all,
I didn’t find this information in the man pages for xymonnet and xymonnet-again. How many times does xymonnet retry a failed test? Does it go red on the first failure (assuming default configuration), or only after all the retries failed?
It goes red when xymonnet detects the failure. The test is then assigned to xymonnet-again which executes it more frequently for a total of 30 minutes. If it is still red, xymonnet-again quits hammering it.
From the man page for xymonnet-again
Only tests whose first failure occurred within 30 minutes are included in the tests that are run by xymonnet-again.sh. The 30 minute limit is there to avoid hosts that are down for longer periods of time to bog down xymonnet-again.sh. You can change this limit with the "--frequenttestlimit=SECONDS" when you run xyxmonnet.
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
I see no reference to the _number_ of retries, only to the _duration_ of the effort.
The number of retries will depend on how frequently the attempt is made and how long each attempt takes to fail. The first is probably controlled in code (and may be configurable at run time). The second is dependent on the protocol being tested, the behavior of the network, and the form of the failure.
An ICMP test, for example, may reliably fail and timeout in 4 seconds.
An SSH test (also handled by xymonnet) may fail in 4 seconds when it can't initiate a TCP connection. It may also be able to linger on for several minutes if a TCP connection can be established but not maintained.
Is the number of retires significant in your business case?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Q: Is the number of retires significant in your business case?
A: Not really, I was just trying to understand how this works to see if it would provide precedent for one of our custom tests, which we are adding retries to.
I think I have a good idea how the retries work now. When a test fails, xymonnet writes information to a text file.
Xymonnet-again is a simple script, which is kicked off once a minute, to look for that text file - if it's present, it feeds it into xymonnet. The file (frequenttests) is simply the command line options for the xymonnet run, including the names of the hosts that had failed tests (but not which tests failed).
So theoretically, things could be retried up to 30 times.
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston Sent: Friday, January 08, 2016 12:41 To: xymon at xymon.com Subject: Re: [Xymon] How many times does xymonnet retry?
On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
I see no reference to the _number_ of retries, only to the _duration_ of the effort.
The number of retries will depend on how frequently the attempt is made and how long each attempt takes to fail. The first is probably controlled in code (and may be configurable at run time). The second is dependent on the protocol being tested, the behavior of the network, and the form of the failure.
An ICMP test, for example, may reliably fail and timeout in 4 seconds.
An SSH test (also handled by xymonnet) may fail in 4 seconds when it can't initiate a TCP connection. It may also be able to linger on for several minutes if a TCP connection can be established but not maintained.
Is the number of retires significant in your business case?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This is correct. In some other monitoring systems (Nagios/Icinga come to mind), there's a notion of a Hard Fail vs Soft Fail and the scheduling system can run checks several times before a "Hard Fail" is recorded.
Because there's no discrete scheduling system (or dispatcher) within xymon, it doesn't really have that same model, and the built-in tools like xymonnet don't conceptualize it.
Fundamentally, you have any number of things testing and whatever frequency or decision process they're independently doing, and xymond is simply accepting reports (and displaying/handling them) as needed.
As xymonnet runs at intervals, each run is distinct. If it's down/slow/hung/whatever, it's marked as such and is not tested again during that execution.
If you add that together, though, it provides other options for administrator-defined recurrence, such as the "xymonnet-again.sh" script, as you've seen.
When we were migrating from a system that had been configured to retry 3 times before alerting, we realized that we saved so much power in efficiency moving to xymon (shameless plug ;) ), that we could lower our xymonnet interval greatly and just make sure that 3 entire runs would complete before the "red" alert was sent (using the DURATION value in alerts.cfg(5)).
xymonnet-again.sh itself is somewhat basic, but you can script up any number of additional ways of dispatching with the same concept. I have a script on another server that queries xymond for any non-green 'dns' tests every 10s and re-scans just those hosts with lower --timeout values.
As above, I've found interval scanning and adjusting your duration to be simpler conceptually and to handle most of the cases that are needed. It also sidesteps the problem of an overloaded scheduler during a crisis, leaving just the extra time needed for failing TCP tests in the first place.
HTH, -jc
On Fri, January 8, 2016 11:59 am, Ribeiro, Glauber wrote:
Q: Is the number of retires significant in your business case?
A: Not really, I was just trying to understand how this works to see if it would provide precedent for one of our custom tests, which we are adding retries to.
I think I have a good idea how the retries work now. When a test fails, xymonnet writes information to a text file.
Xymonnet-again is a simple script, which is kicked off once a minute, to look for that text file - if it's present, it feeds it into xymonnet. The file (frequenttests) is simply the command line options for the xymonnet run, including the names of the hosts that had failed tests (but not which tests failed).
So theoretically, things could be retried up to 30 times.
-----Original Message----- From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston Sent: Friday, January 08, 2016 12:41 To: xymon at xymon.com Subject: Re: [Xymon] How many times does xymonnet retry?
On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
I see no reference to the _number_ of retries, only to the _duration_ of the effort.
The number of retries will depend on how frequently the attempt is made and how long each attempt takes to fail. The first is probably controlled in code (and may be configurable at run time). The second is dependent on the protocol being tested, the behavior of the network, and the form of the failure.
An ICMP test, for example, may reliably fail and timeout in 4 seconds.
An SSH test (also handled by xymonnet) may fail in 4 seconds when it can't initiate a TCP connection. It may also be able to linger on for several minutes if a TCP connection can be established but not maintained.
Is the number of retires significant in your business case?
On Sat, Jan 9, 2016 at 5:41 AM John Thurston <john.thurston at alaska.gov> wrote:
On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
Thanks, I got that, so there is no set number of repetitions? I.e. it will keep trying for 30 minutes?
The number of retries will depend on how frequently the attempt is made and how long each attempt takes to fail.
I don't think it is dependent on how long it takes to fail. The xymonnetagain.sh script is run every minute, and so it will probably retry either 29 or 30 times, depending on the exact timing of its launch.
participants (4)
-
cleaver@terabithia.org
-
glauber.ribeiro@experian.com
-
jlaidman@rebel-it.com.au
-
john.thurston@alaska.gov