Another thing that happened after my recent migration/Xymon upgrade is that I started getting phantom "disk" alerts purporting to be from the Xymon server itself.
They looked like this:
-- To: sysadmins at my.do.main Subject: Xymon [556665157] mgmt:disk CRITICAL (RED) Message-Id: <20190318195008.53EF4635153 at mgmt.my.do.main> From: xymon at mgmt.my.do.main (xymon Monitor (client))
red Mon Mar 18 12:50:04 PDT 2019 - Filesystems NOT ok &red /export/bkd05d (98% used) has reached the PANIC level (98%)
[...]
The thing is, the partition "/export/bkd05d" does not exist on the Xymon server host "mgmt".
It exists on a completely different system (and I know exactly which one it is). I've seen other alerts like this where the disk partitions mentioned are from other systems, too.
In short, the Xymon server is getting the reports from the clients but somehow they are getting mangled into looking like they are coming from 127.0.0.1 instead and thus are local to itself, and so it generates red alerts from itself as a result.
In many cases they are filesystems where I already had exception clauses in "analysis.cfg" for them already, so I never get alarms from the actual client host. So to suddenly get "back from the dead" red alarms for them was a surprise, to say the least.
I've kludged around it by making a special pseudo-clause in "analysis.cfg" for the Xymon server for all of these disk partition exceptions:
--
XXX - KLUDGE dummy entries to prevent Xymon from reporting false
XXX - "red" disk alerts for systems with faulty "127.0.0.1" reports
HOST=mgmt DISK /export/brick1 101 101 DISK /export/data 101 101 DISK /export/data1 100 100 DISK /export/data2 99 100 DISK /export/work 99 100 DISK %(?-i)^.*/Volumes/Time 101 101 DISK /media/Oracle_Solaris-11_3-Text-SPARC 101 101 DISK /media/Solaris-11_3_28_4_0-Boot-SPARC 101 101 [... more here ...]
but obviously I would prefer to solve the problem so I can remove this.
What changed between Xymon 4.3.12 and 4.3.28 to cause this? How do I debug it?
- Greg