All:
Ever since I upgraded my work setup from Xymon 4.3.12 to 4.3.28 (on RHEL 7.6), I've been seeing situations where we'll get alerts that should be one-time only (or maybe twice), but they stick around persistently.
We keep getting the same alert every hour until someone restarts the Xymon service.
Once the service is restarted, the alert goes away and we get a "Xymon [hostname]:[service] recovered? (stale)" alert in its place.
The frequency of them seems to be somewhat random. Some days I don't get any. Then 3 days ago I had 6 instances of them, 2 days ago there were only 2, then 5 more yesterday. They are usually "msgs" service alerts, but not all the time. I've seen some for "conn" or "cpu", and even one for a "telnet" check we have for an APC UPS.
What really surprised me is that in my 5 1/2+ years of archives of this mailing list, I don't see any mentions of this issue.
I could toss in a "cron" job to automatically restart Xymon once a day, but that's a kludge.
What could be possible causes of these 'stuck'/repeated alerts, which end up becoming stale?
I noticed that alerts.cfg(5) says
(A stale alert is one where the service recovered during a +time that xymond_alert was not running.)
but that doesn't seem to be applicable here - unless it's describing the brief period between stopping & restarting the Xymon service.
- Greg