Hi,
On Wed, Jun 15, 2016 at 02:36:14PM -0700, J.C. Cleaver wrote:
On Wed, June 15, 2016 9:30 am, Christoph Berg wrote:
Fwiw, I've seen instances of such behavior ever since I've started taking care of a hobbit installation at a customer site in late 2007.
Oops. Never ran into that knowingly before -- and I run Hobbit/Xymon servers since about 2007 or so, too.
Symptoms are randomly mixed up hosts. I can say if there are tests that are hit more than others, the problem is mostly visible through disk tests by finding rrd files on disk for partitions that do not exist on this host.
I remember having seen misnamed rrd files in the past, that was more like bitflips in the test names or so. (We get some data from some embedded devices which were buggy in the beginning and occassionally sent garbled messages to our Xymon server.) But I never noticed device or path names which don't fit to the machine. So probably not the same thing.
It doesn't seem to happen constantly, but rather in bursts,
That explains why it seemed to coincident with my upgrade to 4.3.27 and then I found cases from a few days earlier, too.
but I don't have hard data on that. My impression was that it only happens during busy periods, but that could be totally wrong.
Hrm, according to xymon, that xymon server has an average load of 0.2 and only a few peaks which go over 1.0 (highest load in the RRD is 1.5).
So I wouldn't say that this server is often "busy".
I also checked the past two days: The times where I have load peaks up to 1.4 are other times than the ones where disk tests got misassigned. :-(
In some cases, I've seen this and tracked it down to malformed messages resulting from incomplete client reports.
I can imagine that. Where the incomplete client reports from the host where they were assigned to or from the one where they should have been assigned to?
Unfortunately, I wasn't able to track down all of them from that, but many correllated with periods of intense load.
Fits with what Christoph experienced, but I doubt that it's related to load on my server. Load on the clients might be possible, though.
The client message (well, all messages, really, but client messages might be more noticable since they're the largest on a plain system) doesn't have an EOM indicator, so it's impossible to see if something's gotten truncated.
This will be solved in V5 style messages (which have a size indicator)
Nice!
One work-around is to add --filter=\[clock\] to: xymond_channel --channel=client --filter=\[clock\] xymond_client (etc)
This will block partial client messages from getting further into xymond when they happen, at the expense of some increased CPU load on xymond_channel, with potential back-pressure into xymond if the message load is high enough.
Hrm, I'm a bit reluctant to add this since the man page says:
--filter=EXPRESSION
EXPRESSION is a Perl-compatible regular expression.
xymond_channel will match the first line of each message
against this expression, and silently drops any message that
does not match the expression.
If I download the client data of an arbitray host, the first line is always empty and the second line reads "[collector:]". "[clock]" only shows up at the very end:
---8<---
[collector:] client <hostname>.linux linux [date] Thu Jun 16 16:31:51 CEST 2016 [uname] Linux <hostname> 3.16.0-4-amd64 x86_64 [osversion] Debian 8.5 Distributor ID: Debian Description: Debian GNU/Linux 8.5 (jessie) Release: 8.5 Codename: jessie [uptime] 16:31:51 up 365 days, 28 min, 0 users, load average: 8.43, 8.46, 8.23 [who] [df] Filesystem 1024-blocks Used Available Capacity Mounted on [...] [clientversion] Xymon version 4.3.17 [clock] epoch: 1466087516.425434 local: 2016-06-16 16:31:56 CEST UTC: 2016-06-16 14:31:56 GMT --->8---
So filtering for messages containing "[clock]" seems to make sense as the message needs to be nearly complete to contain that string.
OTOH the xymond_channel(8) man page says it only matches the first line of the message. What's considered to be a "message"? Each block starting with "[something]" (but then the man page would claim that it drops all other blocks) or the whole set of data linked as "Client data" on service status pages?
Seems to me that either way something's wrong in the man page.
Or are those data block shown in reverse order on the web?
Can you confirm that adding "--filter=\[clock\]" won't drop nearly all of the valid messages?
Of course, not having truncated messages in the first place would be nice :)
:-)
Kind regards, Axel Beckert
-- Axel Beckert <beckert at phys.ethz.ch> support: +41 44 633 26 68 IT Services Group, HPT H 6 voice: +41 44 633 41 89 Departement of Physics, ETH Zurich CH-8093 Zurich, Switzerland http://nic.phys.ethz.ch/