On Wed, June 15, 2016 9:30 am, Christoph Berg wrote:
Re: Axel Beckert 2016-06-15 <20160615155816.GD29167 at phys.ethz.ch>
in the past few months I found more and more indices for a strange bug in (at least) Xymon 4.3.27 which occasionally mixes up hosts when handling reports:
- Machines with a single disk (e.g. VMs) occassional report status of a "raid" test which is not deployed to them -- and then (for obvious reasons) went purple on it. On that server, there's only one machine in having a RAID, but its "raid" reports have been misassigned to at least three other hosts, all host which have rather many tests (compared to a bunch of sensors which send in only very few tests per host). [...]
Fwiw, I've seen instances of such behavior ever since I've started taking care of a hobbit installation at a customer site in late 2007. Symptoms are randomly mixed up hosts. I can say if there are tests that are hit more than others, the problem is mostly visible through disk tests by finding rrd files on disk for partitions that do not exist on this host.
It doesn't seem to happen constantly, but rather in bursts, but I don't have hard data on that. My impression was that it only happens during busy periods, but that could be totally wrong.
We've been on 4.3.0 for a long time until finally upgrading about two years ago, and I thought the problem was gone then, but what Axel is describing is exactly what we were (are?) seeing there.
Christoph
In some cases, I've seen this and tracked it down to malformed messages resulting from incomplete client reports. Unfortunately, I wasn't able to track down all of them from that, but many correllated with periods of intense load.
The client message (well, all messages, really, but client messages might be more noticable since they're the largest on a plain system) doesn't have an EOM indicator, so it's impossible to see if something's gotten truncated.
This will be solved in V5 style messages (which have a size indicator) or when combining into an extcombo.
One work-around is to add --filter=\[clock\] to: xymond_channel --channel=client --filter=\[clock\] xymond_client (etc)
This will block partial client messages from getting further into xymond when they happen, at the expense of some increased CPU load on xymond_channel, with potential back-pressure into xymond if the message load is high enough.
Of course, not having truncated messages in the first place would be nice :)
HTH, -jc