On Mon, 28 Oct 2024 at 23:12, Neil Simmonds via Xymon <xymon@xymon.com> wrote:

Hi all,

This morning our Xymon system  (Xymon 4.3.30-1.el8.terabithia on RHEL 8.6) was showing the main screens but when clicking on any status to see the data I was getting “no such host”. I restarted Xymon and it all came back but all pages completely reloaded. We’re missing RRD data for almost 24 hours.

Neil, that's quite a nasty incident. I'm hoping you had no un-detected faults due to Xymon being inoperable.

at 08:30 yesterday we started getting the following in xymongen.log which carried on every minute until I restarted it at 08:25 this morning.

 

2024-10-27 08:30:30.501407 xymond status-board not available, code 0

2024-10-27 08:30:30.501489 Failed to load current Xymon status, aborting page-update

 

There's nothing in xymonlaunch.log or xymonnet.log but at that time we got the following in xymond.log (actual servernames have been changed to “servername”

 

2024-10-27 08:30:13.048311 WARNING: Cannot open directory /etc/xymon/hosts.d

2024-10-27 08:30:13.048338 WARNING: Cannot open directory /etc/xymon/v9hosts.d

2024-10-27 08:30:13.048344 WARNING: Cannot open directory /etc/xymon/dynamicHosts.d


Do these three directories exist, and have files in them?
 

2024-10-27 08:30:13 Flushing filecache

2024-10-27 08:30:13 Rescanning host tree

2024-10-27 08:30:13 Sending dropstate (from xymond) with servername

2024-10-27 08:30:13 Sending dropstate (from xymond) with servername

<snip> 

2024-10-27 08:30:13 Sending dropstate (from xymond) with servername  etc, etc …..

presumably these servernames were all from different entries from your hosts.cfg file(s)?

 Then we repeatedly got

 

2024-10-27 08:35:13 Reloading hostnames

2024-10-27 08:35:13 Flushing filecache

2024-10-27 08:35:13 Reloading client config

2024-10-27 08:35:58.233008 Bogus message from 10.105.3.100: Invalid new hostname ‘servername'

2024-10-27 08:35:58.233102 Bogus message from 10.105.3.100: Invalid new hostname ' servername'


again, servername is for various _different_ servers? And did each one have a leading space inside the quotes, or is that perhaps a by-product of your sanitising?

The "Bogus message...Invalid new hostname" message comes from xymond when it receives a hostname it doesn't recognise (a ghost) AND when that hostname contains a character that's not in the set [a-zA-Z0-9:,._-]. This is consistent with a leading space in the hostname, as if misconfigured on the client, although if you suddenly get lots of these from different hosts, then it seems more likely that the bad hostname was corrupted within xymond.

2024-10-27 08:36:00 Saving checkpoint file

2024-10-27 08:36:38 Generating stats

2024-10-27 08:36:38.074722 xymond servername MACHINE='ukawsmon01' not listed in hosts.cfg, dropping xymond status

2024-10-27 08:36:57.044131 Bogus message from 10.105.3.100: Invalid new hostname ' servername

2024-10-27 08:36:57.044237 Bogus message from 10.105.3.100: Invalid new hostname ' servername

2024-10-27 08:37:56.564893 Bogus message from 10.105.3.100: Invalid new hostname ' servername

2024-10-27 08:37:56.564989 Bogus message from 10.105.3.100: Invalid new hostname ' servername

2024-10-27 08:39:02.010103 Bogus message from 10.105.3.100: Invalid new hostname ' servername

2024-10-27 08:39:02.010177 Bogus message from 10.105.3.100: Invalid new hostname ' servername

2024-10-27 08:40:00.879305 Bogus message from 10.105.3.100: Invalid new hostname ' servername

2024-10-27 08:40:00.879360 Bogus message from 10.105.3.100: Invalid new hostname ' servername

Has anyone ever seen this before?


I haven't.

But my best guess is that xymond tried to re-read its hosts.cfg file(s) for some reason, but wasn't able to. If xymond knows no hosts, then xymonnet won't probe any hosts, and client messages will be dropped as ghost messages. Could this have been a filesystem error, causing corruption, and preventing access to the file? That doesn't explain why it all started working after a reload, however. Could this have been the result of someone accidentally altering the permissions of hosts.cfg (or the directory containing int) so xymond couldn't read it?

Cheers
Jeremy