Josh's point about how the IP address used for probes comes from DNS rather than hosts.cfg is a good one. It would be worth adding "testip" to the hosts.cfg line, to isolate any issues with DNS. (Most of the servers I monitor are DNS servers, so they all have "testip" because I need to be able to monitor all hosts even if the DNS service is broken.)

Changes to hosts.cfg should be picked up automatically, although not immediately. The hosts.cfg file is checked for updates every 5 minutes by default. You can see when this happens because the xymond.log says "Reloading hostnames". You can trigger an immediate reload of the hosts.cfg file by sending a HUP signal to the xymond process (eg "sudo -u xymon pkill -HUP -x xymond"). All other Xymon processes such as xymonnet (unless configured otherwise) get their list of hosts from interactions with xymond, so as long as xymond has reloaded the contents of hosts.cfg, any updates should be live everywhere.

You can verify that changes to the hosts.cfg file have been picked up by asking the xymond process to show you what it has: xymon 127.1 "hostinfo host=hostname"

The "drop" command lines you provided don't have quotes after 127.0.0.1. The xymon command takes at most two arguments, so you need to wrap the "drop hostname" in quotes so that it's treated as a single argument:

$ xymon 127.0.0.1 "drop hostname sslcert"

Also, the hostname must match the second field in hosts.cfg, so you can't use an IP address.

It should be sufficient to drop the "http" and "sslcert" test status, rather than all statuses.

One last thought is that I had an issue with a stuck status once, which I had to remedy by modifying the xymond checkpoint file. The xymond process periodically dumps its state to this file, so that if it crashes, it can restore the state. The file is called xymond.chk. In my case, I recall there was a state that had a timestamp in the future, or something silly like that. It might be interesting to search for the hostname in that file, and see what the state looks like.

To clear a faulty state, you'd need to stop the xymond process (probably by stopping all of Xymon) and then edit/delete the state file, and then start the xymond process back up again. Editing the host's entry in the file would be less disruptive than deleting the whole file, but the fields aren't well documented so the simplest thing would be to find and delete the line with the troublesome hostname.

On Fri, 21 Mar 2025 at 07:56, Josh Luthman <josh@imaginenetworksllc.com> wrote:

Why is the http test red?

On Thu, Mar 20, 2025 at 4:46 PM Jaime Kikpole <jkikpole@ichabodcrane.org> wrote:
From hosts.cfg, with redactions between < and >:
<IP address> <hostname>.ichabodcrane.org # route:10.5.0.1 https://<hostname>.ichabodcrane.org noconn

The ghosts report shows no clients.

The hostname definitely resolves (tested on the VM with the xymon server) to the IP address. The purple alert is in a column labeled "sslcert" and the red is in "http". The data in the page with the details about sslcert shows the correct dates and a green alert. The data in the page with the http details shows the message "SSL error" after the https://.... URL. When viewing that URL, it redirects path on the same server. This hasn't given me trouble over the last several years and the only thing that changed on the monitored system is that its IP address changed at the VM hosting service.

Jaime Kikpole
Director of Technology
Ichabod Crane Central School District
(518) 758-7575, x5425

On Thu, Mar 20, 2025 at 4:33 PM Josh Luthman <josh@imaginenetworksllc.com> wrote:
The IP in the first column is used in the event the name can't be resolved (or # testip is appended).

The purple test means there is no report for the test (for half an hour I think?) In this case you're testing from the Xymon Server so it is responsible for doing the test.

Are you certain the name of the host matches? Are there any ghosts?

Are you looking at the "sslcert" column? What does your hosts.cfg line look like - does it have an https url?

On Thu, Mar 20, 2025 at 3:38 PM Jaime Kikpole via Xymon <xymon@xymon.com> wrote:
Thanks for the ideas. That gave me more to try and I made some progress, but not much.

I discovered that hosts.cfg still had the old IP address. I updated that, but there was no change. I restarted the xymon server process and there was no change. I did a "xymon 127.0.0.1 drop <IP address>" and there was no change. Would you suggest a "xymon 127.0.0.1 drop <FQDN> http" or something like that?

I see what I would expect when using "curl https://<FQDN>" and the logs only show one interesting line within 2025, so I suspect that it all comes down to that one hosts.cfg line with the old IP address.

When I used "grep <server> * | grep 2025 | less", I see the following line and then everything else was from notifications.log about emailing my department's outages notification email address.

history.log:2025-03-18 14:54:12.042214 Will not update /usr/local/www/xymon/data/hist/server,fqdn,redacted.sslcert - color unchanged (purple)

What do you think? Should I just deal with the loss of historical data and drop the host by FQDN?

Jaime Kikpole
Director of Technology
Ichabod Crane Central School District
(518) 758-7575, x5425

On Wed, Mar 19, 2025 at 7:49 PM Jeremy Laidman <jeremy@laidman.org> wrote:
Hi Jaime

On Thu, 20 Mar 2025 at 02:11, Jaime Kikpole via Xymon <xymon@xymon.com> wrote:
I have a Xymon install that was monitoring (among other things) an HTTPS site. That site moved to a new IP and while I did update the public DNS records it was a few days (or maybe two weeks?) before I remembered to update the internal DNS records. By then, Xymon was reporting red for HTTPS and purple for SSL.

I've corrected the DNS records and the VM running Xymon is resolving it correctly, but now it is still showing the same red and purple alerts. Every now and then, they'll switch to green for a little while and then back to red and purple. I'm not sure why.

Xymon's web interface shows that the certificate is valid for more than 2 months. So that isn't it. I've restarted the Xymon server process and even the whole OS, but this issue remains.

Any ideas on what I could check?

First, review the content of the https status page. There may be non-obvious clues here. There's a difference between "connection refused" and "connection timed out" and this can lead you in different directions to identify the cause. For example, if you have a content check configured ("cont=..." in hosts.cfg), which is failing due to new content on the new webserver, this will be explained on the status page.

Second, I would check that your Xymon server can actually connect to the new IP address on port 443. I usually use "telnet <IP> <port>" and see if it connects, gets a connection refused, or a timeout, but netcat/ncat/nc works too. For an HTTPS website (as distinct from HTTP) I sometimes use an openssl command to see if the TLS/SSL interaction works, which tells me that there's no issue on the webserver itself, something like "openssl s_client -connect <IP>:<port> </dev/null". Even simpler, run curl or wget to fetch a webpage, but if you have a proxy set anywhere, this may give you an invalid result.

Thirdly, I would check the xymonnet.log file for any indications of problems, which might suggest a cause. Having xymonnet in debug mode might help (adding "--debug" into the CMD line of the [xymonnet] block in tasks.cfg). Alternatively, you could run xymonnet directly yourself, and look at the output. Something like this: xymoncmd xymonnet --no-update --debug <hostname>

Good luck with finding the problem.

Cheers
Jeremy

_______________________________________________
Xymon mailing list -- xymon@xymon.com
To unsubscribe send an email to xymon-leave@xymon.com