Hi Jeremy
Added some debug code to my script. Here's an extract. DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$2 }') # Current IO latency $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo "$DATA"; echo)" echo $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo "$DATA"; echo)" DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Max IO latency $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo "$DATA"; echo)" echo $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo "$DATA"; echo)" DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Avg IO latency $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo "$DATA"; echo)" echo $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo "$DATA"; echo)"
And I managed to get a couple of bizarre data files. e-series-dcuriolat,icmpOutParmProbs.rrd e-series-dcuriolat,icmpOutRedirects.rrd e-series-dcuriolat,ipv6InTruncatedPkts.rrd e-series-dcuriolat,ipv6OutFragFails.rrd e-series-dcuriolat,UDP_udpInDatagrams.rrd e-series-dcuriolat,udpInCksumErrs.rrd
And if I grep in my log file for icmp or any of those terms, I come up with nothing. So I am guessing it's not coming from the client.
I want to try the snoop, but this client script is running on the server, as a client script. It collects data from a bunch of NetApp E-series devices, and sends it to the server in the normal way. So you can imagine what the snoop data is going to look like. But I will give it a go, and see if there is something in it.
As for debugging the rrd tasks, John was right. Adding --debug to the rrd config causes it to crash. Then I just het heaps of this. 2015-02-25 11:31:07 Peer not up, flushing message queue 2015-02-25 11:31:07 Peer not up, flushing message queue 2015-02-25 11:31:07 Peer not up, flushing message queue And the occasional 19073 2015-02-25 11:31:14 2015-02-25 11:31:15 Child process 19073 died: Signal 6
But I think I am reasonably happy that the strange data isn't coming from the client script. Martin Flemming is a list member in Germany (think) who is helping me test this script. I will ask him if he's seeing the same issues. If not, I think we can rule out the script.
Regards Vernon
On 24 February 2015 at 14:26, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
I'm assuming you've checked your debug output from your script to see if the $TEMPFILE.* file contents look OK.
Perhaps run your own instance of "xymond_channel --channel=data" to capture the messages as they come from xymond to xymond_rrd. This will generate a lot of output, so you'll want to use "--filter" and perhaps "grep" to trim it down.
You could also run snoop/tcpdump at the same time and try to capture the data message as it arrives at your Xymon server. If you have lots of Xymon traffic it might be better to do so on the client side.
The trick is to get a snapshot at the time that the RRD file is created, without collecting so much data that you run out of disk! So doing things like this:
while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host blabla; gzip dump.out; mv dump.out.gz dump.out-
date +%s; doneThis will capture 10k of packets at a time, then compress and rotate.
You can also run xymond in a host-specific debug mode, by appending "--dbghost=HOSTNAME". That will spit out all the traffic into /tmp/xymond.dbg for analysis. Again, you might need to periodically rotate that file and signal xymond to re-open output files (I'm guessing a HUP signal might do this, or just kill the process and have xymonlaunch restart it).
The path the data take would be:
[script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel] -> [xymond_rrd] -> [rrd file]
What we want to do is to watch the traffic/messages to determine which of these components is causing the problem. My first step would be to try to isolate whether it's a client or server problem, hence watching the traffic with tcpdump/snoop. If the traffic is transmitted over the wire in the correct form, then I'd look at what xymond gives to xymond_channel. And so on. Once we can identify the process that creates the phantom entity, we can look for the root cause and then work-arounds/solutions.
J
On 24 February 2015 at 16:46, Vernon Everett <everett.vernon at gmail.com> wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap.
Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating?
I am trying to figure out if it's a bug in Xymon, or a bug in my script. So far I have no evidence to support it being either.
Regards Vernon
On 24 February 2015 at 13:14, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 14 November 2014 at 14:43, Vernon Everett <everett.vernon at gmail.com> wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help.
As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.
It's usually normal. See Henrik's response to a similar question:
http://lists.xymon.com/archive/2014-April/039461.html
Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes. It's puzzling, but rare enough that I haven't had time to investigate.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton