http checks red with timeout, xymonnet fast
Dear all,
From one moment to the next we had the situation that all our https tests went to red because of timeouts. As far as we found no firewall or network issue. Using another machine everything is fine. We assume a hardware problem with the main board (Sun T3 Solaris), but support says the machine is ok. Doing the request directly with xymonnet the result is a quick reply. Any ideas how to debug this or what I am doing wrong? Where is a difference between the check made by xymon server and the manual check? We are using xymon 4.3.17
example entry in hosts.cfg:
ip column_name # noconn HIDEHTTP cont=xxx;https://address?id=354400185;keyword
Result: Server timeout Seconds: 18.94
console:
xymonnet --timing --ssl=ssl --content=keyword https://address?id=354400185
Result: TIME TOTAL 0.011761
Any help appreciated,
regards
Rolf
Rolf Schrittenlocher
LBS-IT Systembetreuung lbs-it at ub.uni-frankfurt.de
Sammelnummer LBS-IT: 069 798-28830
Pers?nlich schritte at ub.uni-frankfurt.de
Direkt: 069 798-28908
Hi Rolf
Can you show the red alert page for a failed https test?
Is there anything showing in the xymonnet.log file?
What is the status of the xymonnet test page for your Xymon server? (This will probably show the most recent messages from the xymonnet.log file, among other info.)
When running xymonnet, are you switching to xymon user and running xymoncmd to setup the environment?
Cheers Jeremy
On Tue, 27 Apr 2021 at 23:12, Schrittenlocher, Rolf < R.Schrittenlocher at ub.uni-frankfurt.de> wrote:
Dear all,
From one moment to the next we had the situation that all our https tests went to red because of timeouts. As far as we found no firewall or network issue. Using another machine everything is fine. We assume a hardware problem with the main board (Sun T3 Solaris), but support says the machine is ok. Doing the request directly with xymonnet the result is a quick reply. Any ideas how to debug this or what I am doing wrong? Where is a difference between the check made by xymon server and the manual check? We are using xymon 4.3.17
example entry in hosts.cfg:
ip column_name # noconn HIDEHTTP cont=xxx; https://address?id=354400185;keyword
Result: Server timeout Seconds: 18.94
console:
xymonnet --timing --ssl=ssl --content=keyword https://address?id=354400185
Result: TIME TOTAL 0.011761
Any help appreciated,
regards
Rolf
Rolf Schrittenlocher
LBS-IT Systembetreuung lbs-it at ub.uni-frankfurt.de
Sammelnummer LBS-IT: 069 798-28830
Pers?nlich schritte at ub.uni-frankfurt.de
Direkt: 069 798-28908
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Hi Jeremy and others,
thanx for help
Can you show the red alert page for a failed https test?
Wed Apr 28 09:02:48 2021: Server timeout[red] https://xxx - Server timeout
Seconds: 18.92
Is there anything showing in the xymonnet.log file? no, empty
What is the status of the xymonnet test page for your Xymon server? (This will probably show the most recent messages from the xymonnet.log file, among other info.)
not quite sure what that test page is. I added a http check http://our_xymon_server/xymon-cgi/svcstatus.sh?HOST=our_xymon_server&SERVICE... for our xymon server (no certifcate, no https) and the result is interesting. It alternates between less than 1 second and timeout
[cid:part2.BB926664.19F50B6A at ub.uni-frankfurt.de]
HTTP/1.1 200 OK Date: Wed, 28 Apr 2021 07:13:15 GMT Server: Apache/2.4.26 (Unix) OpenSSL/1.0.2u Last-Modified: Wed, 28 Apr 2021 07:12:16 GMT ETag: "1606f-5c103193e9130" Accept-Ranges: bytes Content-Length: 90223 Connection: close Content-Type: text/html
Seconds: 0.81
HTTP/1.1 200 OK Date: Wed, 28 Apr 2021 07:17:55 GMT Server: Apache/2.4.26 (Unix) OpenSSL/1.0.2u Last-Modified: Wed, 28 Apr 2021 07:17:19 GMT ETag: "14eaa-5c1032b4f2e80" Accept-Ranges: bytes Content-Length: 85674 Connection: close Content-Type: text/html
Seconds: 18.91
When running xymonnet, are you switching to xymon user and running xymoncmd to setup the environment?
I did it as xymon user but without xymoncmd. Did it again with xymoncmd, very little difference.
Once again. Error occured at a time where noone was doing any changes. Switching xymon server to another machine in the same subnet everything is fine. xymon, apache, etc. installation on both machines is identical only hardware differs.
Any ideas?
Greetings
Rolf
Cheers Jeremy
On Tue, 27 Apr 2021 at 23:12, Schrittenlocher, Rolf <R.Schrittenlocher at ub.uni-frankfurt.de<mailto:R.Schrittenlocher at ub.uni-frankfurt.de>> wrote:
Dear all,
From one moment to the next we had the situation that all our https tests went to red because of timeouts. As far as we found no firewall or network issue. Using another machine everything is fine. We assume a hardware problem with the main board (Sun T3 Solaris), but support says the machine is ok. Doing the request directly with xymonnet the result is a quick reply. Any ideas how to debug this or what I am doing wrong? Where is a difference between the check made by xymon server and the manual check? We are using xymon 4.3.17
example entry in hosts.cfg:
ip column_name # noconn HIDEHTTP cont=xxx;https://address?id=354400185;keyword
Result: Server timeout Seconds: 18.94
console:
xymonnet --timing --ssl=ssl --content=keyword https://address?id=354400185
Result: TIME TOTAL 0.011761
Any help appreciated,
regards
Rolf
Rolf Schrittenlocher
LBS-IT Systembetreuung lbs-it at ub.uni-frankfurt.de<mailto:lbs-it at ub.uni-frankfurt.de>
Sammelnummer LBS-IT: 069 798-28830
Pers?nlich schritte at ub.uni-frankfurt.de<mailto:schritte at ub.uni-frankfurt.de>
Direkt: 069 798-28908
Xymon mailing list Xymon at xymon.com<mailto:Xymon at xymon.com> http://lists.xymon.com/mailman/listinfo/xymon
-- Rolf Schrittenlocher
LBS-IT Systembetreuung lbs-it at ub.uni-frankfurt.de<mailto:lbs-it at ub.uni-frankfurt.de> Sammelnummer LBS-IT: 069 798-28830 Pers?nlich schritte at ub.uni-frankfurt.de<mailto:schritte at ub.uni-frankfurt.de> Direkt: 069 798-28908
On Wed, 28 Apr 2021 at 17:37, Schrittenlocher, Rolf < R.Schrittenlocher at ub.uni-frankfurt.de> wrote:
What is the status of the xymonnet test page for your Xymon server? (This
will probably show the most recent messages from the xymonnet.log file, among other info.)
not quite sure what that test page is.
Here's an example:
[image: image.png]
Click on the green dot to see the xymonnet test page.
I added a http check http://our_xymon_server/xymon-cgi/svcstatus.sh?HOST=our_xymon_server&SERVICE... for our xymon server (no certifcate, no https) and the result is interesting. It alternates between less than 1 second and timeout
This seems to show a failure and then 6 seconds later a success. It's like there are two xymonnet processes, one reporting a failure and the other reporting success.
Can you please take a look at two status messages (click on the red/green dot) that are a few seconds apart, and for each, check the IP address in the message near the bottom that says, "Status message received from <IP>" and see if they're both the same, and confirm that they are of the Xymon server?
I'm not sure what's going on here. But I would try running a packet trace (eg tcpdump) and inspecting the traffic for the two connections - one that works and one that doesn't - and compare.
Also perhaps try running xymonnet manually again, but many times, to see if it's an intermittent fault.
Check the execution parameters in tasks.cfg for the [xymonnet] section, and make sure you're running xymonnet with the same parameters.
The problem might also be a DNS lookup issue. Try setting "testip" for the web server in your hosts.cfg file and see if the delay goes away.
Dear Jeremy, dear all,
What is the status of the xymonnet test page for your Xymon server? (This will probably show the most recent messages from the xymonnet.log file, among other info.)
xymonnet is green, nothing peculiar
I added a http check http://our_xymon_server/xymon-cgi/svcstatus.sh?HOST=our_xymon_server&SERVICE... for our xymon server (no certifcate, no https) and the result is interesting. It alternates between less than 1 second and timeout
[X]
This seems to show a failure and then 6 seconds later a success. It's like there are two xymonnet processes, one reporting a failure and the other reporting success.
Can you please take a look at two status messages (click on the red/green dot) that are a few seconds apart, and for each, check the IP address in the message near the bottom that says, "Status message received from <IP>" and see if they're both the same, and confirm that they are of the Xymon server?
Yes, both from xymon servers IP I'm not sure what's going on here. But I would try running a packet trace (eg tcpdump) and inspecting the traffic for the two connections - one that works and one that doesn't - and compare.
we did a (general) tcpdump but couldn't detect something. Well, noone here is specialist in this. I'll try this again more precisely.
I would exclude DNS as Solaris looks first in /etc/hosts before asking a nameserver and some of the adressesare included in /etc/hosts. As well, ping works fine.
Thank you for your help, Jeremy, I think, this is really a very special and local problem. I'll try to get our network specialists involved. It is difficult they are more than busy in times where everything happens online :-)
We'll use another machine for xymon and I am afraid we have to live with the fact that this machine isn't totally realiable any more,
cheers
Rolf
Also perhaps try running xymonnet manually again, but many times, to see if it's an intermittent fault.
Check the execution parameters in tasks.cfg for the [xymonnet] section, and make sure you're running xymonnet with the same parameters.
The problem might also be a DNS lookup issue. Try setting "testip" for the web server in your hosts.cfg file and see if the delay goes away.
-- Rolf Schrittenlocher
LBS-IT Systembetreuung lbs-it at ub.uni-frankfurt.de<mailto:lbs-it at ub.uni-frankfurt.de> Sammelnummer LBS-IT: 069 798-28830 Pers?nlich schritte at ub.uni-frankfurt.de<mailto:schritte at ub.uni-frankfurt.de> Direkt: 069 798-28908
On Thu, 29 Apr 2021 at 18:54, Schrittenlocher, Rolf < R.Schrittenlocher at ub.uni-frankfurt.de> wrote:
I would exclude DNS as Solaris looks first in /etc/hosts before asking a nameserver and some of the adressesare included in /etc/hosts. As well, ping works fine.
Xymon (typically) uses its own DNS resolver library, so testing that DNS works for Solaris commands might not give you the same results.
One option you might look into is to run truss (strace on Linux) and attach it to the xymonnet process. Then when it performs its check, truss will show you all of the system calls that it makes. You might see it pause on a particular system call for 15 seconds before continuing on its way, and knowing that system call might lead you to the cause of the problem.
(It's been many years since I've used truss. I think that dtrace might have replaced it?)
Good luck with it.
Cheers Jeremy
participants (2)
-
jeremy@laidman.org
-
R.Schrittenlocher@ub.uni-frankfurt.de