On Mon, 28 Oct 2024 at 23:12, Neil Simmonds via Xymon <xymon@xymon.com> wrote:
Hi all,
This morning our Xymon system (Xymon 4.3.30-1.el8.terabithia on RHEL 8.6) was showing the main screens but when clicking on any status to see the data I was getting “no such host”. I restarted Xymon and it all came back but all pages completely reloaded. We’re missing RRD data for almost 24 hours.
Neil, that's quite a nasty incident. I'm hoping you had no un-detected faults due to Xymon being inoperable.
at 08:30 yesterday we started getting the following in xymongen.log which carried on every minute until I restarted it at 08:25 this morning.
2024-10-27 08:30:30.501407 xymond status-board not available, code 0
2024-10-27 08:30:30.501489 Failed to load current Xymon status, aborting page-update
There's nothing in xymonlaunch.log or xymonnet.log but at that time we got the following in xymond.log (actual servernames have been changed to “*servername” *
2024-10-27 08:30:13.048311 WARNING: Cannot open directory /etc/xymon/hosts.d
2024-10-27 08:30:13.048338 WARNING: Cannot open directory /etc/xymon/v9hosts.d
2024-10-27 08:30:13.048344 WARNING: Cannot open directory /etc/xymon/dynamicHosts.d
Do these three directories exist, and have files in them?
2024-10-27 08:30:13 Flushing filecache
2024-10-27 08:30:13 Rescanning host tree
2024-10-27 08:30:13 Sending dropstate (from xymond) with * servername*
2024-10-27 08:30:13 Sending dropstate (from xymond) with * servername*
<snip>
2024-10-27 08:30:13 Sending dropstate (from xymond) with * servername * etc, etc …..
presumably these servernames were all from different entries from your hosts.cfg file(s)?
Then we repeatedly got
2024-10-27 08:35:13 Reloading hostnames
2024-10-27 08:35:13 Flushing filecache
2024-10-27 08:35:13 Reloading client config
2024-10-27 08:35:58.233008 Bogus message from 10.105.3.100: Invalid new hostname ‘*servername*'
2024-10-27 08:35:58.233102 Bogus message from 10.105.3.100: Invalid new hostname '* servername*'
again, servername is for various _different_ servers? And did each one have a leading space inside the quotes, or is that perhaps a by-product of your sanitising?
The "Bogus message...Invalid new hostname" message comes from xymond when it receives a hostname it doesn't recognise (a ghost) AND when that hostname contains a character that's not in the set [a-zA-Z0-9:,._-]. This is consistent with a leading space in the hostname, as if misconfigured on the client, although if you suddenly get lots of these from different hosts, then it seems more likely that the bad hostname was corrupted within xymond.
2024-10-27 08:36:00 Saving checkpoint file
2024-10-27 08:36:38 Generating stats
2024-10-27 08:36:38.074722 xymond servername MACHINE='ukawsmon01' not listed in hosts.cfg, dropping xymond status
2024-10-27 08:36:57.044131 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
2024-10-27 08:36:57.044237 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
2024-10-27 08:37:56.564893 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
2024-10-27 08:37:56.564989 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
2024-10-27 08:39:02.010103 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
2024-10-27 08:39:02.010177 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
2024-10-27 08:40:00.879305 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
2024-10-27 08:40:00.879360 Bogus message from 10.105.3.100: Invalid new hostname '* servername*’
Has anyone ever seen this before?
I haven't.
But my best guess is that xymond tried to re-read its hosts.cfg file(s) for some reason, but wasn't able to. If xymond knows no hosts, then xymonnet won't probe any hosts, and client messages will be dropped as ghost messages. Could this have been a filesystem error, causing corruption, and preventing access to the file? That doesn't explain why it all started working after a reload, however. Could this have been the result of someone accidentally altering the permissions of hosts.cfg (or the directory containing int) so xymond couldn't read it?
Cheers Jeremy