Hi all
Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help.
As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.
I might as well explain why I am doing this. I am running an experimental script, zresources.ksh Its supposed to allow me to graph the resources being used by individual zones on Solaris. Here's part of the code $XYMON $XYMSRV "status $MACHINE.zmem green $(date) $OUT " $XYMON $XYMSRV "data $MACHINE.zmem $COLOR $(echo; cat "$TEMPFILE.mem" ;echo;echo)" $XYMON $XYMSRV "data $MACHINE.zcpu $COLOR $(echo; cat "$TEMPFILE.cpu" ;echo;echo)" $XYMON $XYMSRV "data $MACHINE.zswap $COLOR $(echo; cat "$TEMPFILE.swap" ;echo;echo)" echo $XYMON $XYMSRV "status $MACHINE.zmem green $(date) $OUT " echo $XYMON $XYMSRV "data $MACHINE.zmem $COLOR $(echo; cat "$TEMPFILE.mem" ;echo;echo)" echo $XYMON $XYMSRV "data $MACHINE.zcpu $COLOR $(echo; cat "$TEMPFILE.cpu" ;echo;echo)" echo $XYMON $XYMSRV "data $MACHINE.zswap $COLOR $(echo; cat "$TEMPFILE.swap" ;echo;echo)" cat $TEMPFILE.swap cat $TEMPFILE.cpu cat $TEMPFILE.mem The first 4 lines are "real" and the rest I added for debug. It's supposed to send data containing the zone names, and the amount of resources it's using. global : 293 zone1 : 575 zone2 : 899 zone3 : 754 All good, it makes nice pretty graphs and what have you. And I expect and get rrd files as follows zmem,global.rrd zcpu,global.rrd zswap,global.rrd zmem,zone1.rrd zcpu,zone.rrd zswap,zone1.rrd etc.etc.
Except every now and then, I get something like zmem,c2t0d1.rrd I get others too, like zmem,rmt_0 and sometimes even with database SIDs too. And not just for zmem, for zcpu and zswap with about equal frequency. And often, the database SID is not even a database running on that host! I look inside the rrd file, with rrdtool dump, and there will usualy be only a single data point. Looking in the client logs, which contains the output of the debug code above, there is no indication that the client ever sent any data using anything else but the zone names. So where are these spurious data points coming from?
Has anybody seen anything like this?
Regards Vernon
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
On 14 November 2014 at 14:43, Vernon Everett <everett.vernon at gmail.com> wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help.
As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.
It's usually normal. See Henrik's response to a similar question:
http://lists.xymon.com/archive/2014-April/039461.html
Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes. It's puzzling, but rare enough that I haven't had time to investigate.
J
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap.
Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating?
I am trying to figure out if it's a bug in Xymon, or a bug in my script. So far I have no evidence to support it being either.
Regards Vernon
On 24 February 2015 at 13:14, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 14 November 2014 at 14:43, Vernon Everett <everett.vernon at gmail.com> wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help.
As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.
It's usually normal. See Henrik's response to a similar question:
http://lists.xymon.com/archive/2014-April/039461.html
Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes. It's puzzling, but rare enough that I haven't had time to investigate.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
I'm assuming you've checked your debug output from your script to see if the $TEMPFILE.* file contents look OK.
Perhaps run your own instance of "xymond_channel --channel=data" to capture the messages as they come from xymond to xymond_rrd. This will generate a lot of output, so you'll want to use "--filter" and perhaps "grep" to trim it down.
You could also run snoop/tcpdump at the same time and try to capture the data message as it arrives at your Xymon server. If you have lots of Xymon traffic it might be better to do so on the client side.
The trick is to get a snapshot at the time that the RRD file is created, without collecting so much data that you run out of disk! So doing things like this:
while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host
blabla; gzip dump.out; mv dump.out.gz dump.out-date +%s; done
This will capture 10k of packets at a time, then compress and rotate.
You can also run xymond in a host-specific debug mode, by appending "--dbghost=HOSTNAME". That will spit out all the traffic into /tmp/xymond.dbg for analysis. Again, you might need to periodically rotate that file and signal xymond to re-open output files (I'm guessing a HUP signal might do this, or just kill the process and have xymonlaunch restart it).
The path the data take would be:
[script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel] -> [xymond_rrd] -> [rrd file]
What we want to do is to watch the traffic/messages to determine which of these components is causing the problem. My first step would be to try to isolate whether it's a client or server problem, hence watching the traffic with tcpdump/snoop. If the traffic is transmitted over the wire in the correct form, then I'd look at what xymond gives to xymond_channel. And so on. Once we can identify the process that creates the phantom entity, we can look for the root cause and then work-arounds/solutions.
J
On 24 February 2015 at 16:46, Vernon Everett <everett.vernon at gmail.com> wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap.
Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating?
I am trying to figure out if it's a bug in Xymon, or a bug in my script. So far I have no evidence to support it being either.
Regards Vernon
On 24 February 2015 at 13:14, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 14 November 2014 at 14:43, Vernon Everett <everett.vernon at gmail.com> wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help.
As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.
It's usually normal. See Henrik's response to a similar question:
http://lists.xymon.com/archive/2014-April/039461.html
Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes. It's puzzling, but rare enough that I haven't had time to investigate.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Hi Jeremy
Added some debug code to my script. Here's an extract. DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$2 }') # Current IO latency $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo "$DATA"; echo)" echo $XYMON $XYMSRV "data $ENAME.e-series-dcuriolat $(echo; echo; echo "$DATA"; echo)" DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Max IO latency $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo "$DATA"; echo)" echo $XYMON $XYMSRV "data $ENAME.e-series-dmaxiolat $(echo; echo; echo "$DATA"; echo)" DATA=$(cat $TMPFILE.drvperf | awk '{ print $1" : "$3 }') # Avg IO latency $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo "$DATA"; echo)" echo $XYMON $XYMSRV "data $ENAME.e-series-davgiolat $(echo; echo; echo "$DATA"; echo)"
And I managed to get a couple of bizarre data files. e-series-dcuriolat,icmpOutParmProbs.rrd e-series-dcuriolat,icmpOutRedirects.rrd e-series-dcuriolat,ipv6InTruncatedPkts.rrd e-series-dcuriolat,ipv6OutFragFails.rrd e-series-dcuriolat,UDP_udpInDatagrams.rrd e-series-dcuriolat,udpInCksumErrs.rrd
And if I grep in my log file for icmp or any of those terms, I come up with nothing. So I am guessing it's not coming from the client.
I want to try the snoop, but this client script is running on the server, as a client script. It collects data from a bunch of NetApp E-series devices, and sends it to the server in the normal way. So you can imagine what the snoop data is going to look like. But I will give it a go, and see if there is something in it.
As for debugging the rrd tasks, John was right. Adding --debug to the rrd config causes it to crash. Then I just het heaps of this. 2015-02-25 11:31:07 Peer not up, flushing message queue 2015-02-25 11:31:07 Peer not up, flushing message queue 2015-02-25 11:31:07 Peer not up, flushing message queue And the occasional 19073 2015-02-25 11:31:14 2015-02-25 11:31:15 Child process 19073 died: Signal 6
But I think I am reasonably happy that the strange data isn't coming from the client script. Martin Flemming is a list member in Germany (think) who is helping me test this script. I will ask him if he's seeing the same issues. If not, I think we can rule out the script.
Regards Vernon
On 24 February 2015 at 14:26, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
I'm assuming you've checked your debug output from your script to see if the $TEMPFILE.* file contents look OK.
Perhaps run your own instance of "xymond_channel --channel=data" to capture the messages as they come from xymond to xymond_rrd. This will generate a lot of output, so you'll want to use "--filter" and perhaps "grep" to trim it down.
You could also run snoop/tcpdump at the same time and try to capture the data message as it arrives at your Xymon server. If you have lots of Xymon traffic it might be better to do so on the client side.
The trick is to get a snapshot at the time that the RRD file is created, without collecting so much data that you run out of disk! So doing things like this:
while true; do tcpdump -w dump.out -n -c 10000 dest port 1984 and host blabla; gzip dump.out; mv dump.out.gz dump.out-
date +%s; doneThis will capture 10k of packets at a time, then compress and rotate.
You can also run xymond in a host-specific debug mode, by appending "--dbghost=HOSTNAME". That will spit out all the traffic into /tmp/xymond.dbg for analysis. Again, you might need to periodically rotate that file and signal xymond to re-open output files (I'm guessing a HUP signal might do this, or just kill the process and have xymonlaunch restart it).
The path the data take would be:
[script] -> [xymon client] -> [TCP/1984] -> [xymond] -> [xymond_channel] -> [xymond_rrd] -> [rrd file]
What we want to do is to watch the traffic/messages to determine which of these components is causing the problem. My first step would be to try to isolate whether it's a client or server problem, hence watching the traffic with tcpdump/snoop. If the traffic is transmitted over the wire in the correct form, then I'd look at what xymond gives to xymond_channel. And so on. Once we can identify the process that creates the phantom entity, we can look for the root cause and then work-arounds/solutions.
J
On 24 February 2015 at 16:46, Vernon Everett <everett.vernon at gmail.com> wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap.
Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating?
I am trying to figure out if it's a bug in Xymon, or a bug in my script. So far I have no evidence to support it being either.
Regards Vernon
On 24 February 2015 at 13:14, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 14 November 2014 at 14:43, Vernon Everett <everett.vernon at gmail.com> wrote:
Am busy trying to investigate a curious problem with rrd graphs, and I stumbled on something else I don't understand, and was hoping somebody out there could help.
As part of my investigation, I added --debug to the [rrdstatus] and [rrddata] entries on the server tasks.cfg And the logs started showing heaps of the message 2014-11-14 10:41:36 Peer not up, flushing message queue What is that? It doesn't look right to me.
It's usually normal. See Henrik's response to a similar question:
http://lists.xymon.com/archive/2014-April/039461.html
Except every now and then, I get something like
zmem,c2t0d1.rrd
Has anybody seen anything like this?
Yes. It's puzzling, but rare enough that I haven't had time to investigate.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Yes, I'd be surprised if it's a problem in the script.
On 25 February 2015 at 14:40, Vernon Everett <everett.vernon at gmail.com> wrote:
e-series-dcuriolat,ipv6OutFragFails.rrd e-series-dcuriolat,UDP_udpInDatagrams.rrd e-series-dcuriolat,udpInCksumErrs.rrd
These all look to be coming from the [netstat] client data. Perhaps have a look at the server's client data, scan through to the [netstat] section, and see if there are an unusual characters or corruption, either within [netstat] or the sections before and after it.
I'm not sure what you'd be looking for, but worth a look.
J
Hmm, that theory makes a lot of sense. Since I have also seen files like e-series-dcuriolat,vnet.rrd which looks like it comes from the ifstat section.
But of course, that would be too easy. :-) I am also seeing e-series-dmaxiolat,syndev72.rrd And syndev72 is the hostname of a system completely unrelated to the test where I am seeing the issues. There are a few others like that which have hostnames in them.
I am also seeing the likes of e-series-dcuriolat,vdc1.rrd There are only 2 hosts throwing out "vdc?" data points, and they also has nothing to do with the e-series. The vdc? is from another custom script, which logs and graphs the % busy of LUNs on 2 servers that are sensitive to disk latency fluctuation.
These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring.
I have asked Martin if he's seeing the same issue, but he hasn't got back to me yet.
The Xymon version I have here is 4.3.12, but I have also seen it at my other customers on version 4.3.17 and 4.3.18 (although to a far lesser degree)
Regards Vernon
On 25 February 2015 at 12:30, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
Yes, I'd be surprised if it's a problem in the script.
On 25 February 2015 at 14:40, Vernon Everett <everett.vernon at gmail.com> wrote:
e-series-dcuriolat,ipv6OutFragFails.rrd e-series-dcuriolat,UDP_udpInDatagrams.rrd e-series-dcuriolat,udpInCksumErrs.rrd
These all look to be coming from the [netstat] client data. Perhaps have a look at the server's client data, scan through to the [netstat] section, and see if there are an unusual characters or corruption, either within [netstat] or the sections before and after it.
I'm not sure what you'd be looking for, but worth a look.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
On 25 February 2015 at 19:16, Vernon Everett <everett.vernon at gmail.com> wrote:
These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring.
Yup, makes sense.
My best guess is memory corruption within xymond. So let's see if the corruption is visible in the messages being passed between xymond and xymond_channel. If we see corrupt messages in there, we can start to delve into the source code to see if there's a bug somewhere. Are you able to run your own instance of xymond_channel? Maybe something like this:
sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat
One you get an idea what it looks like, change "cat" for something like "egrep -A5 ^@" to get only the first 5 lines. Also, redirect to a file until you notice a dodgy RRD file and then kill the process.
Did you try running xymond with "--dbghost=HOSTNAME" ? It might be too voluminous, but might be worth a try, if you can manage to snag the traffic at the right time.
J
Only back at that client again on Wednesday. But I do have remote access, so will see what I can do this evening.
On 25 February 2015 at 18:06, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 25 February 2015 at 19:16, Vernon Everett <everett.vernon at gmail.com> wrote:
These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring.
Yup, makes sense.
My best guess is memory corruption within xymond. So let's see if the corruption is visible in the messages being passed between xymond and xymond_channel. If we see corrupt messages in there, we can start to delve into the source code to see if there's a bug somewhere. Are you able to run your own instance of xymond_channel? Maybe something like this:
sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat
One you get an idea what it looks like, change "cat" for something like "egrep -A5 ^@" to get only the first 5 lines. Also, redirect to a file until you notice a dodgy RRD file and then kill the process.
Did you try running xymond with "--dbghost=HOSTNAME" ? It might be too voluminous, but might be worth a try, if you can manage to snag the traffic at the right time.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
Hi all
Back at the customer's site, and back on this problem.
I just captured data as this happened. Ran the command you suggested, slightly modified for our environment. I also redirected it to a file for analysis. Here's what I ran, with error output. ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out 2015-03-04 08:45:22 Using default environment file /opt/local/xymon/server/etc/xymonserver.cfg 2015-03-04 08:45:58 Peer not up, flushing message queue 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.
What is that GOCLIENT thing? It might be relevant, since it occurred just *after *some errant data files were created. (Note timestamp) The errant data files are -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversionsize.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversionrss.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversionmemory.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversioncpu.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,subversion.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:02 SINFSPNA01/e-series-davgiolat,energisesize.rrd
This is supposed to graph the average IO latency of the disks in the e-series, so we expect output to look like this. -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:24 SINFSPNA01/e-series-davgiolat,Tray99_Slot1.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:24 SINFSPNA01/e-series-davgiolat,Tray0_Slot1.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:22 SINFSPNA03/e-series-davgiolat,Tray99_Slot12.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:22 SINFSPNA03/e-series-davgiolat,Tray0_Slot12.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:21 SINFSPNA01/e-series-davgiolat,Tray99_Slot8.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:21 SINFSPNA01/e-series-davgiolat,Tray99_Slot7.rrd -rw-r--r-- 1 xymon xymon 19580 Mar 4 09:21 SINFSPNA01/e-series-davgiolat,Tray99_Slot6.rrd
The subversion and energize are actually host names for completely unrelated servers. The subversionsize and subversionrsss and similar data points are being sent for another host, but are not related to the e-series graphs.
In the output file, /var/tmp/xymon.out from ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out there is no mention of the subversion or energise stuff either.
Does this narrow the field at all? Based on Jeremy's earlier email, it looks like the issue is in xymond_rrd, unless that GOCLIENT error can tell us something more?
Regards Vernon
On 25 February 2015 at 18:06, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 25 February 2015 at 19:16, Vernon Everett <everett.vernon at gmail.com> wrote:
These hosts all have nothing at all to do with the storage arrays being monitored, which makes me think the client data might be a red herring.
Yup, makes sense.
My best guess is memory corruption within xymond. So let's see if the corruption is visible in the messages being passed between xymond and xymond_channel. If we see corrupt messages in there, we can start to delve into the source code to see if there's a bug somewhere. Are you able to run your own instance of xymond_channel? Maybe something like this:
sudo -u xymon xymoncmd xymond_channel --channel=data --filter=zmem cat
One you get an idea what it looks like, change "cat" for something like "egrep -A5 ^@" to get only the first 5 lines. Also, redirect to a file until you notice a dodgy RRD file and then kill the process.
Did you try running xymond with "--dbghost=HOSTNAME" ? It might be too voluminous, but might be worth a try, if you can manage to snag the traffic at the right time.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
On 4 March 2015 at 12:40, Vernon Everett <everett.vernon at gmail.com> wrote:
Here's what I ran, with error output. ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out 2015-03-04 08:45:22 Using default environment file /opt/local/xymon/server/etc/xymonserver.cfg 2015-03-04 08:45:58 Peer not up, flushing message queue 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.
What is that GOCLIENT thing?
From what I can understand, it's a semaphore shared between xymond and all of the xymond_channel instances. When there are several channel readers, they all get sent the message address, and as each one accepts the message, she decrements GOCLIENT. When GOCLIENT is zero, it means all readers have received (and probably copied) the message, and the memory can be freed. Each reader waits until GOCLIENT goes back to zero before waiting for the next message.
There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go back to zero. If the time is exceeded in a channel reader, it means another reader is taking too long to handle a message, and so the first reader gives up, logs the error you saw, and carries on with the next message loop. I'm not sure if this is a sign of trouble. Or it might be normal when you're running your own instance of xymond_channel. Or it might be a side-effect of the "cat" command blocking when writing to your output file due to a high message rate and contention on whatever filesystem has /var/tmp/.
There's a description of how GOCLIENT works in the file new-daemon.txt, in the source code.
In the output file, /var/tmp/xymon.out from ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out there is no mention of the subversion or energise stuff either.
Does it have mention of the correct data set names? We can't draw any conclusions if it's not collecting the data we expect.
Did any of the RRD files skip an update at the time the new rogue files were created? Do these files match up with entries in xymon.out? Or anything interesting at the same time as the rogue entries were created?
If you're seeing correct entries in xymon.out, but not the bogus entries, then I'm inclined to agree that xymond_rrd is at fault, and is possibly using memory it's not supposed to. I wonder if running xymond_rrd with "--no-cache" might have an effect. Obviously, it's better if you can cache updates to the RRD files, but it might narrow down the region of code that's responsible.
This is not conclusive. It's possible that when you have two instances of xymond_channel, only one is corrupting data names, and it just so happened that it was the one being used by xymond_rrd. Could be that another time you would see your extra reader getting the bogus entries. That's the problem with using a second instance for analysis, rather than somehow getting the analysis happening on the one that writes to the RRD files.
On the other hand, if you ran two instances of xymond_rrd, both on the same data channel, and if both instances create the bogus RRD files, then you know that you can probably use the second instance to narrow down the fault, without impacting the creation of RRD files for real work.
Are you still running xymond_rrd with "--debug"? Did this show anything interesting when the bogus RRD files were created?
What version of Xymon are you running? Did this start happening after an upgrade? I wonder if it's a bug with some versions but not others.
J
Thanks for your assistance Jeremy. Quite a bit to digest. Everything in the xymon.out file I am collecting looks exactly like I would expect it to look. There is a line, prefixed by @@data#<sequence> (or at least I think it's a sequence number),followed by pipe-separated data that looks like the host name, time stamp, IP address, an empty field, host name again, the rrd-file prefix, another blank field and the last field is other. I then get a blank line, and the data looks like what I am trying to send. Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either. At this site, we are running Xymon 4.3.12, but I have seen similar behaviour (although not to such an extent) elsewhere with 4.3.17, and I think I also saw it with 4.3.18, but I no longer have access to that site. I am not seeing any lost data points in the other graphs. But that could be difficult to spot. Will run a few rrdtool dumps, and look for gaps at that timestamp. Let you know what I find. As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go. It gets ugly. Earlier in this thread, John Thurston mentioned this behaviour too. It also creates a red xymond_rrd button on the xymon server, but the results are not overly helpful. - Program crashedFatal signal caught! Don't think it started after an upgrade. Something I did notice, the problem appears to be limited to data only, used to display graphs in trends. I am not seeing this for data when there is a status and data component. Or at least I haven't seen it yet. What are the implications of running with "--no-cache"? I have implemented this by adding "--no-cache" but if it's going to have a long-term impact, I don't want to leave it that way indefinitely. Regards Vernon On 4 March 2015 at 14:03, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 4 March 2015 at 12:40, Vernon Everett <everett.vernon at gmail.com> wrote:
Here's what I ran, with error output. ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out 2015-03-04 08:45:22 Using default environment file /opt/local/xymon/server/etc/xymonserver.cfg 2015-03-04 08:45:58 Peer not up, flushing message queue 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.
What is that GOCLIENT thing?
From what I can understand, it's a semaphore shared between xymond and all of the xymond_channel instances. When there are several channel readers, they all get sent the message address, and as each one accepts the message, she decrements GOCLIENT. When GOCLIENT is zero, it means all readers have received (and probably copied) the message, and the memory can be freed. Each reader waits until GOCLIENT goes back to zero before waiting for the next message.
There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go back to zero. If the time is exceeded in a channel reader, it means another reader is taking too long to handle a message, and so the first reader gives up, logs the error you saw, and carries on with the next message loop. I'm not sure if this is a sign of trouble. Or it might be normal when you're running your own instance of xymond_channel. Or it might be a side-effect of the "cat" command blocking when writing to your output file due to a high message rate and contention on whatever filesystem has /var/tmp/.
There's a description of how GOCLIENT works in the file new-daemon.txt, in the source code.
In the output file, /var/tmp/xymon.out from ./xymoncmd xymond_channel --channel=data --filter=e-series cat > /var/tmp/xymon.out there is no mention of the subversion or energise stuff either.
Does it have mention of the correct data set names? We can't draw any conclusions if it's not collecting the data we expect.
Did any of the RRD files skip an update at the time the new rogue files were created? Do these files match up with entries in xymon.out? Or anything interesting at the same time as the rogue entries were created?
If you're seeing correct entries in xymon.out, but not the bogus entries, then I'm inclined to agree that xymond_rrd is at fault, and is possibly using memory it's not supposed to. I wonder if running xymond_rrd with "--no-cache" might have an effect. Obviously, it's better if you can cache updates to the RRD files, but it might narrow down the region of code that's responsible.
This is not conclusive. It's possible that when you have two instances of xymond_channel, only one is corrupting data names, and it just so happened that it was the one being used by xymond_rrd. Could be that another time you would see your extra reader getting the bogus entries. That's the problem with using a second instance for analysis, rather than somehow getting the analysis happening on the one that writes to the RRD files.
On the other hand, if you ran two instances of xymond_rrd, both on the same data channel, and if both instances create the bogus RRD files, then you know that you can probably use the second instance to narrow down the fault, without impacting the creation of RRD files for real work.
Are you still running xymond_rrd with "--debug"? Did this show anything interesting when the bogus RRD files were created?
What version of Xymon are you running? Did this start happening after an upgrade? I wonder if it's a bug with some versions but not others.
J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
On 04/03/2015 6:02 PM, "Vernon Everett" <everett.vernon at gmail.com> wrote:
Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.
If there's a correlation it might help us work out where the fault is. But it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?
Don't think it started after an upgrade. Something I did notice, the problem appears to be limited to data only, used to display graphs in trends. I am not seeing this for data when there is a status and data component. Or at least I haven't seen it yet.
Interesting. Maybe you could modify your script to send a status also?
What are the implications of running with "--no-cache"?
I think then cache is only to save on disk I/O. If your disks aren't "running hot" (ie queueing requests long enough to cause problems) then youre probably OK to leave it.
I have implemented this by adding "--no-cache" but if it's going to have a long-term impact, I don't want to leave it that way indefinitely.
Some people have reported solving some other problems by disabling the cache, with presumably no I'll effects. For example http://lists.xymon.com/archive/2009-November/026572.html. It's a performance improvement that hasn't always been there and so is probably not necessary for average size monitoring service. Also I think that if you have operations staff that refresh lots of graphs (hence forcing the cache to flush) you get minimal benefit from the cache.
But it all depends on how many RRD updates per second your server needs to make, and how fast your disks and filesystem can update. If you have enough RAM, your OS might have sufficient buffers to not need the cache anyway.
J
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <everett.vernon at gmail.com> wrote:
Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.
If there's a correlation it might help us work out where the fault is. But it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd.
The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here.
Regards,
-jc
Hi all
And even with --no-cache, I am still getting these corrupted rrd files. I tried again with --debug (and --no-cache) and it core dumps.
Here's the backtrace.
adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd
./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program
``
/zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform
SUNW,SPARC-Enterprise-T2000
SIGABRT: Abort
$c
libc.so.1_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6) libc.so.1abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0)
sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160)
libc.so.1__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1) libc.so.1call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160)
libc.so.1sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924) libc.so.1strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0)
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c)
dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388)
dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23)
rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400)
find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e)
main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374)
_start+0x5c(0, 0, 0, 0, 0, 0)
Anything else I can offer that will assist?
Regards Vernon
On 5 March 2015 at 09:11, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <everett.vernon at gmail.com> wrote:
Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.
If there's a correlation it might help us work out where the fault is. But it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd.
The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here.
Regards,
-jc
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
If it helps, here's the same core, with pstack.
pstack ./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899: xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache feebebc4 _lwp_kill (6, 0, 0, 3a9a0, ffffffff, 6) + 8 fee329b0 abort (0, 1, feebafec, ffb3c, fef35518, 0) + 110 0003a9a0 sigsegv_handler (b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) + 30 feebafec __sighndlr (b, 0, ffbfa160, 3a970, 0, 1) + c feeaf69c call_user_handler (b, 0, 0, 0, fe8a2a00, ffbfa160) + 3b8 feeaf884 sigacthandler (b, 0, ffbfa160, ffbfb290, 0, fee91924) + 60 --- called from signal handler with signal 11 (SIGSEGV) --- fee22d10 strlen (514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) + 50 fee91924 vfprintf (6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) + ec 0002fb4c dbgprintf (514c0, 0, 51400, 6c3f0, bf, 2ab388) + a4 00033e1c dump_tcp_services (a0, 1c00, fef37940, 0, 51400, 51400) + 74 000347f0 init_tcp_services (6a400, 51400, 51400, 6c3f0, bf, 23) + 91c 00030188 rrd_setup (93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) + 15c 00030518 find_xymon_rrd (93314, 511e8, 54ff8bda, 0, 932f3, 2e) + 4 00016614 main (93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) + 948 00015a40 _start (0, 0, 0, 0, 0, 0) + 5c
On 11 March 2015 at 08:37, Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
And even with --no-cache, I am still getting these corrupted rrd files. I tried again with --debug (and --no-cache) and it core dumps.
Here's the backtrace.
adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd
./core_smcconsole_xymond_rrd_61_61_1426033626_20899 core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program `` /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform SUNW,SPARC-Enterprise-T2000 SIGABRT: Abort $c libc.so.1
_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6) libc.so.1abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0) sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) libc.so.1__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1) libc.so.1call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160) libc.so.1sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924) libc.so.1strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400) init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23) rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e) main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) _start+0x5c(0, 0, 0, 0, 0, 0)Anything else I can offer that will assist?
Regards Vernon
On 5 March 2015 at 09:11, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <everett.vernon at gmail.com> wrote:
Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.
If there's a correlation it might help us work out where the fault is. But it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd.
The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here.
Regards,
-jc
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
This might help too.
pflags ./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899: xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache data model = _ILP32 flags = MSACCT|MSFORK sigpend = 0x00020000,0x00000000 /1: flags = 0 sigmask = 0xffffbefc,0x0000ffff cursig = SIGABRT
On 11 March 2015 at 09:13, Vernon Everett <everett.vernon at gmail.com> wrote:
If it helps, here's the same core, with pstack.
pstack ./core_smcconsole_xymond_rrd_61_61_1426033626_20899
core './core_smcconsole_xymond_rrd_61_61_1426033626_20899' of 20899: xymond_rrd --debug --rrddir=/opt/local/xymon/data/rrd --no-cache feebebc4 _lwp_kill (6, 0, 0, 3a9a0, ffffffff, 6) + 8 fee329b0 abort (0, 1, feebafec, ffb3c, fef35518, 0) + 110 0003a9a0 sigsegv_handler (b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) + 30 feebafec __sighndlr (b, 0, ffbfa160, 3a970, 0, 1) + c feeaf69c call_user_handler (b, 0, 0, 0, fe8a2a00, ffbfa160) + 3b8 feeaf884 sigacthandler (b, 0, ffbfa160, ffbfb290, 0, fee91924) + 60 --- called from signal handler with signal 11 (SIGSEGV) --- fee22d10 strlen (514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) + 50 fee91924 vfprintf (6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) + ec 0002fb4c dbgprintf (514c0, 0, 51400, 6c3f0, bf, 2ab388) + a4 00033e1c dump_tcp_services (a0, 1c00, fef37940, 0, 51400, 51400) + 74 000347f0 init_tcp_services (6a400, 51400, 51400, 6c3f0, bf, 23) + 91c 00030188 rrd_setup (93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) + 15c 00030518 find_xymon_rrd (93314, 511e8, 54ff8bda, 0, 932f3, 2e) + 4 00016614 main (93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) + 948 00015a40 _start (0, 0, 0, 0, 0, 0) + 5c
On 11 March 2015 at 08:37, Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
And even with --no-cache, I am still getting these corrupted rrd files. I tried again with --debug (and --no-cache) and it core dumps.
Here's the backtrace.
adb /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd
./core_smcconsole_xymond_rrd_61_61_1426033626_20899 core file = ./core_smcconsole_xymond_rrd_61_61_1426033626_20899 -- program `` /zones/smcconsole/root/opt/local/xymon/server/bin/xymond_rrd'' on platform SUNW,SPARC-Enterprise-T2000 SIGABRT: Abort $c libc.so.1
_lwp_kill+8(6, 0, 0, 3a9a0, ffffffff, 6) libc.so.1abort+0x110(0, 1, feebafec, ffb3c, fef35518, 0) sigsegv_handler+0x30(b, 0, ffbfa160, 0, fe8a2a00, ffbfa160) libc.so.1__sighndlr+0xc(b, 0, ffbfa160, 3a970, 0, 1) libc.so.1call_user_handler+0x3b8(b, 0, 0, 0, fe8a2a00, ffbfa160) libc.so.1sigacthandler+0x60(b, 0, ffbfa160, ffbfb290, 0, fee91924) libc.so.1strlen+0x50(514cf, ffbfb3ec, ffbfa9d1, 0, 0, 0) libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400) init_tcp_services+0x91c(6a400, 51400, 51400, 6c3f0, bf, 23) rrd_setup+0x15c(93314, ffbfcfc4, 63c00, ffbfcf50, 6a400, 6a400) find_xymon_rrd+4(93314, 511e8, 54ff8bda, 0, 932f3, 2e) main+0x948(93314, ffbfcfc4, 63c00, ffbfcf50, 54ff8bda, 93374) _start+0x5c(0, 0, 0, 0, 0, 0)Anything else I can offer that will assist?
Regards Vernon
On 5 March 2015 at 09:11, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 4, 2015 2:52 pm, Jeremy Laidman wrote:
On 04/03/2015 6:02 PM, "Vernon Everett" <everett.vernon at gmail.com> wrote:
Looks like we might need to check with JC for more on that GOCLIENT thing. I just find it odd that it happened about the same time as the corruption. I haven't seen it again today, and haven't seen any other corruption either.
If there's a correlation it might help us work out where the fault is. But it might be only a symptom.
As for the --debug option, it caused xymond_rrd to crash and burn, dumping cores as we go.
Could be that thensame bug causing the crash during debug is also causing the corrupt filename. Have you analyzed the core dumps?
GOCLIENT is indeed the means by which xymond_channel listeners communicate with xymond for the picking up of messages over SysV IPC. I believe the messages there are just a side effect of it re-launching the channel listener pipe to xymond_rrd.
The cache routines in xymond_rrd should be stable at this point. Can you send a backtrace in from one of the cores? I'm curious where things could be acting up here.
Regards,
-jc
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
On 11 March 2015 at 11:37, Vernon Everett <everett.vernon at gmail.com> wrote:
And even with --no-cache, I am still getting these corrupted rrd files.
:-(
I tried again with --debug (and --no-cache) and it core dumps.
Here's the backtrace.
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which in turn calls vprintf() from libc, but with bad parameters. I've had a look through the code in dump_tcp_services() and I don't know enough C to recognize any problems. But it might be useful to know which call to dbgprintf() is causing the problem.
Does the log file for xymond_rrd show any debug output at all? If so, what's the last line that is shown.
It might be helpful if you can recompile xymond_rrd with dump_tcp_services() modified. Initially, I would simply try it with "return" added after the first call to dbgprintf(). That is, dump_tcp_services() will output "Service list dump" and return. This might stop the core dumps so that we can get debug output for other parts of the xymond_rrd processing.
If adding "return" at that point fixes this core dump, more diagnostic lines would be useful to determine what the problem is. For example, there's a global array called svcinfo that is iterated over, but if the array is empty, it might cause the core dump. So adding a line that checks whether the array is empty and displays the result would help to pin this down.
Note that "svcinfo" appears to be populated from the protocols.cfg file and/or XYMONNETSVCS. Is it possible that your protocols.cfg file is empty, or has some syntax error that causes it to be unparseable? The same for XYMONNETSVCS (in xymonserver.cfg)?
J
Looking at the logs below, it appears something might be broken in the [telnet] definition. Here's the first few lines from my protocols.cfg file [ftp] send "quit\r\n" expect "220" options banner port 21 [ftps] send "quit\r\n" expect "220" options ssl,banner port 990 [ssh|ssh1|ssh2] send "SSH-2.0-OpenSSH_4.1\r\n" expect "SSH" options banner port 22 [telnet] options banner,telnet port 23
From rrd-data.log 5614 2015-03-11 08:29:07 Opening file /opt/local/xymon/server/etc/rrddefinitions.cfg 5614 2015-03-11 08:29:07 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=2101247 5614 2015-03-11 08:29:07 Got 403 bytes 5614 2015-03-11 08:29:07 xymond_rrd: Got message 849930 @@data#849930/transfer|1426033746.620045|172.21.219.192||transfer|iostatdisk|sunos|unix/UnixInf 5614 2015-03-11 08:29:07 startpos 403, fillpos 403, endpos -1 5614 2015-03-11 08:29:07 Transport setup is: 5614 2015-03-11 08:29:07 xymondportnumber = 1984 5614 2015-03-11 08:29:07 xymonproxyhost = NONE 5614 2015-03-11 08:29:07 xymonproxyport = 0 5614 2015-03-11 08:29:07 Recipient listed as '172.21.219.192' 5614 2015-03-11 08:29:07 Standard protocol on port 1984 5614 2015-03-11 08:29:07 Will connect to address 172.21.219.192 port 1984 5614 2015-03-11 08:29:07 Connect status is 0 5614 2015-03-11 08:29:07 Sent 16 bytes 5614 2015-03-11 08:29:07 Read 8192 bytes 5614 2015-03-11 08:29:07 Read 7657 bytes 5614 2015-03-11 08:29:07 Closing connection 5614 2015-03-11 08:29:07 Opening file /opt/local/xymon/server/etc/analysis.cfg 5614 2015-03-11 08:29:07 Compiling regex ctrldom[A-D] 5614 2015-03-11 08:29:07 Compiling regex infdom[A-B]1 5614 2015-03-11 08:29:07 Compiling regex .+dom.. 5614 2015-03-11 08:29:07 Compiling regex infdom[A-B]1|notesdomD1 5614 2015-03-11 08:29:07 Compiling regex [.:]80$ 5614 2015-03-11 08:29:07 Compiling regex syn.*,%oradom.* 5614 2015-03-11 08:29:07 Compiling regex syn.*|oradom.*|SINFV.* 5614 2015-03-11 08:29:07 Compiling regex [.:]514$ 5614 2015-03-11 08:29:07 Compiling regex [.:]3389$ 5614 2015-03-11 08:29:07 Compiling regex energise.* 5614 2015-03-11 08:29:07 Compiling regex [.:]8989$ 5614 2015-03-11 08:29:07 Compiling regex [.:]80$ 5614 2015-03-11 08:29:07 Compiling regex [.:]4848$ 5614 2015-03-11 08:29:07 Compiling regex [.:]8080$ 5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.* 5614 2015-03-11 08:29:07 Compiling regex [.:]22$ 5614 2015-03-11 08:29:07 Compiling regex sinfct.*|domino.* 5614 2015-03-11 08:29:07 Compiling regex [.:]80$ 5614 2015-03-11 08:29:07 Compiling regex [.:]25$ 5614 2015-03-11 08:29:07 Compiling regex sinfct.*|domino.*|SINFV.*|hermes|sirocco 5614 2015-03-11 08:29:07 Compiling regex SINFV.* 5614 2015-03-11 08:29:07 Compiling regex .* 5614 2015-03-11 08:29:07 Compiling regex ^error.* 5614 2015-03-11 08:29:07 Compiling regex ^error.* 5614 2015-03-11 08:29:07 Compiling regex .*restart/shutdown.*|.*Citrix 5614 2015-03-11 08:29:07 Compiling regex .* 5614 2015-03-11 08:29:07 Compiling regex ^warning.* 5614 2015-03-11 08:29:07 Compiling regex ^warning.* 5614 2015-03-11 08:29:07 Compiling regex .*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer 5614 2015-03-11 08:29:07 Compiling regex .* 5614 2015-03-11 08:29:07 Compiling regex ^failure.* 5614 2015-03-11 08:29:07 Compiling regex ^failure.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.* 5614 2015-03-11 08:29:07 Compiling regex [.:]22$ 5614 2015-03-11 08:29:07 Compiling regex ^/var/crash/.* 5614 2015-03-11 08:29:07 Compiling regex ^/var/core/.* 5614 2015-03-11 08:29:07 Compiling regex [.:]22$ 5614 2015-03-11 08:29:07 Opening file /opt/local/xymon/server/etc/protocols.cfg 5614 2015-03-11 08:29:07 Service list dump 5614 2015-03-11 08:29:07 Name : ftp 5614 2015-03-11 08:29:07 Sendtext: quit\r\n 5614 2015-03-11 08:29:07 Sendlen : 6 5614 2015-03-11 08:29:07 Exp.text: 220 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 21 5614 2015-03-11 08:29:07 Name : ftps 5614 2015-03-11 08:29:07 Sendtext: quit\r\n 5614 2015-03-11 08:29:07 Sendlen : 6 5614 2015-03-11 08:29:07 Exp.text: 220 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 5 5614 2015-03-11 08:29:07 Port : 990 5614 2015-03-11 08:29:07 Name : ssh 5614 2015-03-11 08:29:07 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5614 2015-03-11 08:29:07 Sendlen : 21 5614 2015-03-11 08:29:07 Exp.text: SSH 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 22 5614 2015-03-11 08:29:07 Name : ssh1 5614 2015-03-11 08:29:07 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5614 2015-03-11 08:29:07 Sendlen : 21 5614 2015-03-11 08:29:07 Exp.text: SSH 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 22 5614 2015-03-11 08:29:07 Name : ssh2 5614 2015-03-11 08:29:07 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5614 2015-03-11 08:29:07 Sendlen : 21 5614 2015-03-11 08:29:07 Exp.text: SSH 5614 2015-03-11 08:29:07 Exp.len : 3 5614 2015-03-11 08:29:07 Exp.ofs : 0 5614 2015-03-11 08:29:07 Flags : 1 5614 2015-03-11 08:29:07 Port : 22 5614 2015-03-11 08:29:07 Name : telnet 5614 2015-03-11 08:29:07 2015-03-11 08:29:08 Child process 5614 died: Signal 6
From rrd-status.log 5376 2015-03-11 08:29:06 Opening file /opt/local/xymon/server/etc/rrddefinitions.cfg 5376 2015-03-11 08:29:06 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=2101247 5376 2015-03-11 08:29:06 Got 6075 bytes 5376 2015-03-11 08:29:06 xymond_rrd: Got message 621101 @@status#621101/sinfctap01|1426033745.004291|172.21.198.73||sinfctap01|svcs|1426035545|green||green|1417272435|0||0||1426033594|sunos|unix/appzones|0| 5376 2015-03-11 08:29:06 startpos 6075, fillpos 6075, endpos -1 5376 2015-03-11 08:29:06 Transport setup is: 5376 2015-03-11 08:29:06 xymondportnumber = 1984 5376 2015-03-11 08:29:06 xymonproxyhost = NONE 5376 2015-03-11 08:29:06 xymonproxyport = 0 5376 2015-03-11 08:29:06 Recipient listed as '172.21.219.192' 5376 2015-03-11 08:29:06 Standard protocol on port 1984 5376 2015-03-11 08:29:06 Will connect to address 172.21.219.192 port 1984 5376 2015-03-11 08:29:06 Connect status is 0 5376 2015-03-11 08:29:06 Sent 16 bytes 5376 2015-03-11 08:29:06 Read 8192 bytes 5376 2015-03-11 08:29:06 Read 7657 bytes 5376 2015-03-11 08:29:06 Closing connection 5376 2015-03-11 08:29:06 Opening file /opt/local/xymon/server/etc/analysis.cfg 5376 2015-03-11 08:29:06 Compiling regex ctrldom[A-D] 5376 2015-03-11 08:29:06 Compiling regex infdom[A-B]1 5376 2015-03-11 08:29:06 Compiling regex .+dom.. 5376 2015-03-11 08:29:06 Compiling regex infdom[A-B]1|notesdomD1 5376 2015-03-11 08:29:06 Compiling regex [.:]80$ 5376 2015-03-11 08:29:06 Compiling regex syn.*,%oradom.* 5376 2015-03-11 08:29:06 Compiling regex syn.*|oradom.*|SINFV.* 5376 2015-03-11 08:29:06 Compiling regex [.:]514$ 5376 2015-03-11 08:29:06 Compiling regex [.:]3389$ 5376 2015-03-11 08:29:06 Compiling regex energise.* 5376 2015-03-11 08:29:06 Compiling regex [.:]8989$ 5376 2015-03-11 08:29:06 Compiling regex [.:]80$ 5376 2015-03-11 08:29:06 Compiling regex [.:]4848$ 5376 2015-03-11 08:29:06 Compiling regex [.:]8080$ 5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.* 5376 2015-03-11 08:29:06 Compiling regex [.:]22$ 5376 2015-03-11 08:29:06 Compiling regex sinfct.*|domino.* 5376 2015-03-11 08:29:06 Compiling regex [.:]80$ 5376 2015-03-11 08:29:06 Compiling regex [.:]25$ 5376 2015-03-11 08:29:06 Compiling regex sinfct.*|domino.*|SINFV.*|hermes|sirocco 5376 2015-03-11 08:29:06 Compiling regex SINFV.* 5376 2015-03-11 08:29:06 Compiling regex .* 5376 2015-03-11 08:29:06 Compiling regex ^error.* 5376 2015-03-11 08:29:06 Compiling regex ^error.* 5376 2015-03-11 08:29:06 Compiling regex .*restart/shutdown.*|.*Citrix 5376 2015-03-11 08:29:06 Compiling regex .* 5376 2015-03-11 08:29:06 Compiling regex ^warning.* 5376 2015-03-11 08:29:06 Compiling regex ^warning.* 5376 2015-03-11 08:29:06 Compiling regex .*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer 5376 2015-03-11 08:29:06 Compiling regex .* 5376 2015-03-11 08:29:06 Compiling regex ^failure.* 5376 2015-03-11 08:29:06 Compiling regex ^failure.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.* 5376 2015-03-11 08:29:06 Compiling regex [.:]22$ 5376 2015-03-11 08:29:06 Compiling regex ^/var/crash/.* 5376 2015-03-11 08:29:06 Compiling regex ^/var/core/.* 5376 2015-03-11 08:29:06 Compiling regex [.:]22$ 5376 2015-03-11 08:29:06 Opening file /opt/local/xymon/server/etc/protocols.cfg 5376 2015-03-11 08:29:06 Service list dump 5376 2015-03-11 08:29:06 Name : ftp 5376 2015-03-11 08:29:06 Sendtext: quit\r\n 5376 2015-03-11 08:29:06 Sendlen : 6 5376 2015-03-11 08:29:06 Exp.text: 220 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 21 5376 2015-03-11 08:29:06 Name : ftps 5376 2015-03-11 08:29:06 Sendtext: quit\r\n 5376 2015-03-11 08:29:06 Sendlen : 6 5376 2015-03-11 08:29:06 Exp.text: 220 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 5 5376 2015-03-11 08:29:06 Port : 990 5376 2015-03-11 08:29:06 Name : ssh 5376 2015-03-11 08:29:06 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5376 2015-03-11 08:29:06 Sendlen : 21 5376 2015-03-11 08:29:06 Exp.text: SSH 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 22 5376 2015-03-11 08:29:06 Name : ssh1 5376 2015-03-11 08:29:06 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5376 2015-03-11 08:29:06 Sendlen : 21 5376 2015-03-11 08:29:06 Exp.text: SSH 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 22 5376 2015-03-11 08:29:06 Name : ssh2 5376 2015-03-11 08:29:06 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 5376 2015-03-11 08:29:06 Sendlen : 21 5376 2015-03-11 08:29:06 Exp.text: SSH 5376 2015-03-11 08:29:06 Exp.len : 3 5376 2015-03-11 08:29:06 Exp.ofs : 0 5376 2015-03-11 08:29:06 Flags : 1 5376 2015-03-11 08:29:06 Port : 22 5376 2015-03-11 08:29:06 Name : telnet 5376 2015-03-11 08:29:06 2015-03-11 08:29:07 Peer at 0.0.0.0:0 failed: Broken pipe 2015-03-11 08:29:07 Child process 5376 died: Signal 6
On 11 March 2015 at 09:52, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 11 March 2015 at 11:37, Vernon Everett <everett.vernon at gmail.com> wrote:
And even with --no-cache, I am still getting these corrupted rrd files.
:-(
I tried again with --debug (and --no-cache) and it core dumps.
Here's the backtrace.
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which in turn calls vprintf() from libc, but with bad parameters. I've had a look through the code in dump_tcp_services() and I don't know enough C to recognize any problems. But it might be useful to know which call to dbgprintf() is causing the problem.
Does the log file for xymond_rrd show any debug output at all? If so, what's the last line that is shown.
It might be helpful if you can recompile xymond_rrd with dump_tcp_services() modified. Initially, I would simply try it with "return" added after the first call to dbgprintf(). That is, dump_tcp_services() will output "Service list dump" and return. This might stop the core dumps so that we can get debug output for other parts of the xymond_rrd processing.
If adding "return" at that point fixes this core dump, more diagnostic lines would be useful to determine what the problem is. For example, there's a global array called svcinfo that is iterated over, but if the array is empty, it might cause the core dump. So adding a line that checks whether the array is empty and displays the result would help to pin this down.
Note that "svcinfo" appears to be populated from the protocols.cfg file and/or XYMONNETSVCS. Is it possible that your protocols.cfg file is empty, or has some syntax error that causes it to be unparseable? The same for XYMONNETSVCS (in xymonserver.cfg)?
J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
Hi Jeremy
As much as I like the idea of modifying the code and recompiling, we need to remember, this is a production system. I get frowned upon when I add --debug just to create a core dump. :-(
I doubt changing source code and recompiling is going to get a green light.
As for your question about XYMONNETSVC, it's not a variable that's defined anywhere in my config. etc]# grep XYMONNETSVC * return nothing.
Regards Vernon
On 11 March 2015 at 09:52, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 11 March 2015 at 11:37, Vernon Everett <everett.vernon at gmail.com> wrote:
And even with --no-cache, I am still getting these corrupted rrd files.
:-(
I tried again with --debug (and --no-cache) and it core dumps.
Here's the backtrace.
libc.so.1`vfprintf+0xec(6c3d0, 514c0, ffbfb3e8, 0, a0ba4, 33e1c) dbgprintf+0xa4(514c0, 0, 51400, 6c3f0, bf, 2ab388) dump_tcp_services+0x74(a0, 1c00, fef37940, 0, 51400, 51400)
So dump_tcp_services() calls dbgprintf() (both on lib/netservices.c) which in turn calls vprintf() from libc, but with bad parameters. I've had a look through the code in dump_tcp_services() and I don't know enough C to recognize any problems. But it might be useful to know which call to dbgprintf() is causing the problem.
Does the log file for xymond_rrd show any debug output at all? If so, what's the last line that is shown.
It might be helpful if you can recompile xymond_rrd with dump_tcp_services() modified. Initially, I would simply try it with "return" added after the first call to dbgprintf(). That is, dump_tcp_services() will output "Service list dump" and return. This might stop the core dumps so that we can get debug output for other parts of the xymond_rrd processing.
If adding "return" at that point fixes this core dump, more diagnostic lines would be useful to determine what the problem is. For example, there's a global array called svcinfo that is iterated over, but if the array is empty, it might cause the core dump. So adding a line that checks whether the array is empty and displays the result would help to pin this down.
Note that "svcinfo" appears to be populated from the protocols.cfg file and/or XYMONNETSVCS. Is it possible that your protocols.cfg file is empty, or has some syntax error that causes it to be unparseable? The same for XYMONNETSVCS (in xymonserver.cfg)?
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
On 11 March 2015 at 13:13, Vernon Everett <everett.vernon at gmail.com> wrote:
As much as I like the idea of modifying the code and recompiling, we need to remember, this is a production system. I get frowned upon when I add --debug just to create a core dump. :-(
I doubt changing source code and recompiling is going to get a green light.
Yep, fair enough. I'm guessing you can't reproduce the core dump on a non-production system.
As for your question about XYMONNETSVC, it's not a variable that's defined anywhere in my config. etc]# grep XYMONNETSVC * return nothing.
Haha, looks like it's an internal environment variable that Xymon uses to pass around the list of xymonnet services. It's mentioned in the man page for xymonnet, but otherwise it appears to be unexposed and intended to be for-internal-use-only.
Interesting about the telnet definition. Do you do any telnet testing? If not, can you comment the stanzas for telnet and telnets out of protocols.cfg and try again?
Something else to try is to put in a "send" and "expect" lines such as:
[telnet] send "" expect "" options banner,telnet port 23
I think this might be the cause of the core dumps. For protocols.cfg stanzas that don't have "send" or "expect" strings, the debug code tries to print such strings, and they turn out to be null pointers or something like that.
J
Hi Jeremy
I think you got that one nailed. I added send "" expect "" as you requested, and it cores again. This time on the next entry, which is telnets [telnets] options ssl,banner,telnet port 992 Again, no send and expect. Updated telnets, and it crapped out at ldap. [ldap] port 389
[ldaps] options ssl port 636
[rsync] expect "@RSYNCD" options banner port 873 About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
Any suggestions? I think we have some debug code update recommendations for JC though. :-) A good win, but still doesn't solve my original problem. :-(
Regards Vernon
On 11 March 2015 at 11:02, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 11 March 2015 at 13:13, Vernon Everett <everett.vernon at gmail.com> wrote:
As much as I like the idea of modifying the code and recompiling, we need to remember, this is a production system. I get frowned upon when I add --debug just to create a core dump. :-(
I doubt changing source code and recompiling is going to get a green light.
Yep, fair enough. I'm guessing you can't reproduce the core dump on a non-production system.
As for your question about XYMONNETSVC, it's not a variable that's defined anywhere in my config. etc]# grep XYMONNETSVC * return nothing.
Haha, looks like it's an internal environment variable that Xymon uses to pass around the list of xymonnet services. It's mentioned in the man page for xymonnet, but otherwise it appears to be unexposed and intended to be for-internal-use-only.
Interesting about the telnet definition. Do you do any telnet testing? If not, can you comment the stanzas for telnet and telnets out of protocols.cfg and try again?
Something else to try is to put in a "send" and "expect" lines such as:
[telnet] send "" expect "" options banner,telnet port 23
I think this might be the cause of the core dumps. For protocols.cfg stanzas that don't have "send" or "expect" strings, the debug code tries to print such strings, and they turn out to be null pointers or something like that.
J
-- "Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
I think you got that one nailed.
You might be overselling this finding a little. But it's progress nonetheless.
I added send "" expect "" as you requested, and it cores again. This time on the next entry, which is telnets
I run v4.3.10, and I can't get it to dump core the same way yours does, when "--debug" is used. Mine says "(null)" when there is no send or expect defined. The source code file containing the code that triggers the core dump is identical to one from 4.3.18. I've also tested it on a 4.3.18 installation (using the RPM package from Terabithia) but also won't fail. Please can you try this: printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" | XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug (adjust path to xymond_rrd and xymon username if required) For me, this displays the debug output at least as far as when parsing the protocols.cfg file. If this also dumps core for you then we can work on fixing the core dump without risking your production Xymon config. You could also build a binary and run it from an alternative location so that the production binary isn't touched. I realize this doesn't fix your problem for you. But I'm optimistic that the debug output might point to where your problem is. J
We have 4.3.12 And the command dumps core in the same way. $ printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" | XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug 16816 2015-03-11 12:38:02 Opening file /opt/local/xymon/server/etc/rrddefinitions.cfg 16816 2015-03-11 12:38:02 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 16816 2015-03-11 12:38:02 Got 46 bytes 16816 2015-03-11 12:38:02 /opt/local/xymon/server/bin/xymond_rrd: Got message 0 @@data#0/host|||||conn|| 16816 2015-03-11 12:38:02 startpos 46, fillpos 46, endpos -1 16816 2015-03-11 12:38:02 Transport setup is: 16816 2015-03-11 12:38:02 xymondportnumber = 1984 16816 2015-03-11 12:38:02 xymonproxyhost = NONE 16816 2015-03-11 12:38:02 xymonproxyport = 0 16816 2015-03-11 12:38:02 Recipient listed as '172.21.219.192' 16816 2015-03-11 12:38:02 Standard protocol on port 1984 16816 2015-03-11 12:38:02 Will connect to address 172.21.219.192 port 1984 16816 2015-03-11 12:38:02 Connect status is 0 16816 2015-03-11 12:38:02 Sent 16 bytes 16816 2015-03-11 12:38:02 Read 15849 bytes 16816 2015-03-11 12:38:02 Closing connection 16816 2015-03-11 12:38:02 Opening file /opt/local/xymon/server/etc/analysis.cfg 16816 2015-03-11 12:38:02 Compiling regex ctrldom[A-D] 16816 2015-03-11 12:38:02 Compiling regex infdom[A-B]1 16816 2015-03-11 12:38:02 Compiling regex .+dom.. 16816 2015-03-11 12:38:02 Compiling regex infdom[A-B]1|notesdomD1 16816 2015-03-11 12:38:02 Compiling regex [.:]80$ 16816 2015-03-11 12:38:02 Compiling regex syn.*,%oradom.* 16816 2015-03-11 12:38:02 Compiling regex syn.*|oradom.*|SINFV.* 16816 2015-03-11 12:38:02 Compiling regex [.:]514$ 16816 2015-03-11 12:38:02 Compiling regex [.:]3389$ 16816 2015-03-11 12:38:02 Compiling regex energise.* 16816 2015-03-11 12:38:02 Compiling regex [.:]8989$ 16816 2015-03-11 12:38:02 Compiling regex [.:]80$ 16816 2015-03-11 12:38:02 Compiling regex [.:]4848$ 16816 2015-03-11 12:38:02 Compiling regex [.:]8080$ 16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.* 16816 2015-03-11 12:38:02 Compiling regex [.:]22$ 16816 2015-03-11 12:38:02 Compiling regex sinfct.*|domino.* 16816 2015-03-11 12:38:02 Compiling regex [.:]80$ 16816 2015-03-11 12:38:02 Compiling regex [.:]25$ 16816 2015-03-11 12:38:02 Compiling regex sinfct.*|domino.*|SINFV.*|hermes|sirocco 16816 2015-03-11 12:38:02 Compiling regex SINFV.* 16816 2015-03-11 12:38:02 Compiling regex .* 16816 2015-03-11 12:38:02 Compiling regex ^error.* 16816 2015-03-11 12:38:02 Compiling regex ^error.* 16816 2015-03-11 12:38:02 Compiling regex .*restart/shutdown.*|.*Citrix 16816 2015-03-11 12:38:02 Compiling regex .* 16816 2015-03-11 12:38:02 Compiling regex ^warning.* 16816 2015-03-11 12:38:02 Compiling regex ^warning.* 16816 2015-03-11 12:38:02 Compiling regex .*MetaFrameEvents.*|.*restart/shutdown.*|.*Citrix-Multimedia-Flash.*|.*Printer 16816 2015-03-11 12:38:02 Compiling regex .* 16816 2015-03-11 12:38:02 Compiling regex ^failure.* 16816 2015-03-11 12:38:02 Compiling regex ^failure.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.* 16816 2015-03-11 12:38:02 Compiling regex [.:]22$ 16816 2015-03-11 12:38:02 Compiling regex ^/var/crash/.* 16816 2015-03-11 12:38:02 Compiling regex ^/var/core/.* 16816 2015-03-11 12:38:02 Compiling regex [.:]22$ 16816 2015-03-11 12:38:02 Opening file /opt/local/xymon/server/etc/protocols.cfg 16816 2015-03-11 12:38:02 Service list dump 16816 2015-03-11 12:38:02 Name : ftp 16816 2015-03-11 12:38:02 Sendtext: quit\r\n 16816 2015-03-11 12:38:02 Sendlen : 6 16816 2015-03-11 12:38:02 Exp.text: 220 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 21 16816 2015-03-11 12:38:02 Name : ftps 16816 2015-03-11 12:38:02 Sendtext: quit\r\n 16816 2015-03-11 12:38:02 Sendlen : 6 16816 2015-03-11 12:38:02 Exp.text: 220 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 5 16816 2015-03-11 12:38:02 Port : 990 16816 2015-03-11 12:38:02 Name : ssh 16816 2015-03-11 12:38:02 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 16816 2015-03-11 12:38:02 Sendlen : 21 16816 2015-03-11 12:38:02 Exp.text: SSH 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 22 16816 2015-03-11 12:38:02 Name : ssh1 16816 2015-03-11 12:38:02 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 16816 2015-03-11 12:38:02 Sendlen : 21 16816 2015-03-11 12:38:02 Exp.text: SSH 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 22 16816 2015-03-11 12:38:02 Name : ssh2 16816 2015-03-11 12:38:02 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 16816 2015-03-11 12:38:02 Sendlen : 21 16816 2015-03-11 12:38:02 Exp.text: SSH 16816 2015-03-11 12:38:02 Exp.len : 3 16816 2015-03-11 12:38:02 Exp.ofs : 0 16816 2015-03-11 12:38:02 Flags : 1 16816 2015-03-11 12:38:02 Port : 22 16816 2015-03-11 12:38:02 Name : telnet 16816 2015-03-11 12:38:02 Abort (core dumped) On 11 March 2015 at 12:02, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
I think you got that one nailed.
You might be overselling this finding a little. But it's progress nonetheless.
I added send "" expect "" as you requested, and it cores again. This time on the next entry, which is telnets
I run v4.3.10, and I can't get it to dump core the same way yours does, when "--debug" is used. Mine says "(null)" when there is no send or expect defined. The source code file containing the code that triggers the core dump is identical to one from 4.3.18. I've also tested it on a 4.3.18 installation (using the RPM package from Terabithia) but also won't fail.
Please can you try this:
printf "@@data#0/host|||||conn||\ndata host.trends\n\n@@\n" | XYMONTMP=/tmp ~xymon/server/bin/xymond_rrd --rrddir=/dev/null --debug
(adjust path to xymond_rrd and xymon username if required)
For me, this displays the debug output at least as far as when parsing the protocols.cfg file. If this also dumps core for you then we can work on fixing the core dump without risking your production Xymon config. You could also build a binary and run it from an alternative location so that the production binary isn't touched.
I realize this doesn't fix your problem for you. But I'm optimistic that the debug output might point to where your problem is.
J
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
On 3/10/2015 7:18 PM, Vernon Everett wrote:
Hi Jeremy
I think you got that one nailed. I added send "" expect "" as you requested, and it cores again. This time on the next entry, which is telnets [telnets] options ssl,banner,telnet port 992 Again, no send and expect. Updated telnets, and it crapped out at ldap. [ldap] port 389
[ldaps] options ssl port 636
[rsync] expect "@RSYNCD" options banner port 873 About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
Any suggestions? I think we have some debug code update recommendations for JC though. :-) A good win, but still doesn't solve my original problem. :-(
I identified and shared this exact situation in August and September of last year: http://lists.xymon.com/pipermail/xymon/2014-August/040130.html and http://lists.xymon.com/pipermail/xymon/2014-September/040163.html
My C isn't good enough to offer a correction for the defect, but at least someone had now duplicated the problem. My solution has been not to enable the --debug option on rrd handler....and hope I don't have a problem I need to debug.
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On 12/03/2015 3:47 am, "John Thurston" <john.thurston at alaska.gov> wrote:
I identified and shared this exact situation in August and September of last year:
Oh, how embarrassment. I even contributed to that thread and didn't remember. I must be getting old!
My C isn't good enough to offer a correction for the defect, but at least someone had now duplicated the problem. My solution has been not to enable the --debug option on rrd handler....and hope I don't have a problem I need to debug.
OK, I'll work on a patch to stop these core dumps and try to get it into the mainstream. I don't think we can make any more progress on the original problem without a working debug mode.
J
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none. Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release. --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags); This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations. In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me: sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug' This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines: 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619 @@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC ... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737 @@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF 15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603 J
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none.
Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me:
sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619 @@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC ... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737 @@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF 15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603
J
This is some excellent sleuthing! :) As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output. The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in. In the meantime, this will be in 4.3.19 and can be patched directly from below. HTH, -jc --- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i; - if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; } - if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
Hi all I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system) However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files. Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this. Total power drawn- 487 Watts Number of trays- 2 Tray power input details- TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines. I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd So I thought I would check my configs. In xymonserver
From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE"
And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting. Regards Vernon On 13 March 2015 at 15:24, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none.
Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me:
sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603
J
This is some excellent sleuthing! :)
As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output.
The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in.
In the meantime, this will be in 4.3.19 and can be patched directly from below.
HTH,
-jc
--- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i;
- if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; }
- if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
_______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
Vernon The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN. Click on the graphs images for each version to get the 4-graph view and compare the URLs. J On Fri, 20 Mar 2015 19:35 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system)
However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files.
Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this.
Total power drawn- 487 Watts Number of trays- 2 Tray power input details-
TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts
All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines.
I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd
So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE"
And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting.
Regards Vernon
On 13 March 2015 at 15:24, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none.
Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me:
sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603
J
This is some excellent sleuthing! :)
As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output.
The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in.
In the meantime, this will be in 4.3.19 and can be patched directly from below.
HTH,
-jc
--- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i;
- if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; }
- if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
_______________________________________________
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
Hi Jeremy That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test. Would have made this one too easy if it was that. :-) Regards Vernon On 20 March 2015 at 16:43, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
Vernon
The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN.
Click on the graphs images for each version to get the 4-graph view and compare the URLs.
J
On Fri, 20 Mar 2015 19:35 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system)
However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files.
Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this.
Total power drawn- 487 Watts Number of trays- 2 Tray power input details-
TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts
All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines.
I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd
So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE"
And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting.
Regards Vernon
On 13 March 2015 at 15:24, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none.
Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me:
sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603
J
This is some excellent sleuthing! :)
As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output.
The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in.
In the meantime, this will be in 4.3.19 and can be patched directly from below.
HTH,
-jc
--- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i;
- if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; }
- if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
_______________________________________________
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
So the URLs are different? But both have service=power in the URLs? On Sat, 21 Mar 2015 10:16 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Jeremy
That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test.
Would have made this one too easy if it was that. :-)
Regards Vernon
On 20 March 2015 at 16:43, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
Vernon
The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN.
Click on the graphs images for each version to get the 4-graph view and compare the URLs.
J
On Fri, 20 Mar 2015 19:35 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system)
However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files.
Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this.
Total power drawn- 487 Watts Number of trays- 2 Tray power input details-
TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts
All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines.
I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd
So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE"
And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting.
Regards Vernon
On 13 March 2015 at 15:24, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
About now, I am getting a little nervous adding send and expect, because unlike telnet and telnets, we are doing ldap and ldaps testing.
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none.
Any suggestions?
I think we have some debug code update recommendations for JC though. :-)
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me:
sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603
J
This is some excellent sleuthing! :)
As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output.
The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in.
In the meantime, this will be in 4.3.19 and can be patched directly from below.
HTH,
-jc
--- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i;
- if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; }
- if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
_______________________________________________
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
Very confused now. In the test graph, showing the history graphs, the URL contains service=ncv:power And in the history graphs in the status, it's this. service=power That doesn't confuse me as much as what the graphs look like now. Both the test and the trends graph now contain the spurious values. Yesterday, they were only in the trends graph. A need to get that debug output fixed. Regards Vernon On 21 March 2015 at 17:48, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
So the URLs are different? But both have service=power in the URLs?
On Sat, 21 Mar 2015 10:16 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Jeremy
That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test.
Would have made this one too easy if it was that. :-)
Regards Vernon
On 20 March 2015 at 16:43, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
Vernon
The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN.
Click on the graphs images for each version to get the 4-graph view and compare the URLs.
J
On Fri, 20 Mar 2015 19:35 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system)
However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files.
Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this.
Total power drawn- 487 Watts Number of trays- 2 Tray power input details-
TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts
All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines.
I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd
So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE"
And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting.
Regards Vernon
On 13 March 2015 at 15:24, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote:
On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com> wrote:
> About now, I am getting a little nervous adding send and expect, because > unlike telnet and telnets, we are doing ldap and ldaps testing. >
That's understandable. A read through the code suggests that at least in some places, an empty string is equivalent to an undefined string, as the string length (shown in Sendlen in the debug output) is zero in both cases. So until a patch is in place, a work-around might be to define empty "send" and "expect" strings for those that have none.
Any suggestions? > I think we have some debug code update recommendations for JC though. > :-) >
Here's my patch. I'll push this into the dev list for proposed inclusion in a future release.
--- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 @@ -328,9 +328,9 @@ dbgprintf("Service list dump\n"); for (i=0; (svcinfo[i].svcname); i++) { dbgprintf(" Name : %s\n", svcinfo[i].svcname); - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, svcinfo[i].sendlen)); + dbgprintf(" Sendtext: %s\n", svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, svcinfo[i].sendlen):"[null]"); dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, svcinfo[i].explen)); + dbgprintf(" Exp.text: %s\n", svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, svcinfo[i].explen):"[null]"); dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); dbgprintf(" Flags : %d\n", svcinfo[i].flags);
This produces "[null]" where we would have seen "(null)" on a GNU-based OS, to differentiate between the two situations.
In the mean time, you could compile a special version of xymond_rrd, and run it manually on the same data channel as the real one, but have it make RRD files and log file to a different location. This shouldn't interfere with your production Xymon. Here's one I prepared earlier that works for me:
sudo -u xymon mkdir /tmp/my-rrd-data/ sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; /usr/lib/xymon/server/bin/xymond_channel --channel=data --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch --rrddir=/tmp/my-rrd-data/ --debug'
This seems to show some really useful stuff that's relevant to solving your problem. Some sample debug lines:
15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619
@@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC
... 15306 2015-03-12 11:36:28 Creating rrd /tmp/my-rrd-data//servername/vmstat.rrd 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' 15306 2015-03-12 11:36:28 RRD create param 01: '/tmp/my-rrd-data//servername/vmstat.rrd' 15306 2015-03-12 11:36:28 RRD create param 02: '-s' 15306 2015-03-12 11:36:28 RRD create param 03: '300' 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' ... 15306 2015-03-12 11:39:42 Got 265 bytes 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737
@@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF
15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 15306 2015-03-12 11:39:42 Flushing '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, template 'sec' 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos 216644, endpos -1, usedbytes=0, bufleft=1884603
J
This is some excellent sleuthing! :)
As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output.
The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in.
In the meantime, this will be in 4.3.19 and can be patched directly from below.
HTH,
-jc
--- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i;
- if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; }
- if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
_______________________________________________
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
Stupid question but can't you just set up a test Xymon server and have the client system point to both servers? That way you can update the test instance and do whatever you need to tweak it without impacting prod. On Sat, Mar 21, 2015 at 6:31 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
Very confused now. In the test graph, showing the history graphs, the URL contains service=ncv:power And in the history graphs in the status, it's this. service=power
That doesn't confuse me as much as what the graphs look like now. Both the test and the trends graph now contain the spurious values. Yesterday, they were only in the trends graph.
A need to get that debug output fixed.
Regards Vernon
On 21 March 2015 at 17:48, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
So the URLs are different? But both have service=power in the URLs?
On Sat, 21 Mar 2015 10:16 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Jeremy
That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test.
Would have made this one too easy if it was that. :-)
Regards Vernon
On 20 March 2015 at 16:43, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
Vernon
The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN.
Click on the graphs images for each version to get the 4-graph view and compare the URLs.
J
On Fri, 20 Mar 2015 19:35 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system)
However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files.
Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this.
Total power drawn- 487 Watts Number of trays- 2 Tray power input details-
TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts
All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines.
I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd
So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE"
And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting.
Regards Vernon
On 13 March 2015 at 15:24, J.C. Cleaver <cleaver at terabithia.org> wrote:
On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote: > On 11 March 2015 at 14:18, Vernon Everett <everett.vernon at gmail.com > > wrote: > >> About now, I am getting a little nervous adding send and expect, because >> unlike telnet and telnets, we are doing ldap and ldaps testing. >> > > That's understandable. A read through the code suggests that at least in > some places, an empty string is equivalent to an undefined string, as the > string length (shown in Sendlen in the debug output) is zero in both > cases. So until a patch is in place, a work-around might be to define > empty "send" and "expect" strings for those that have none. > > Any suggestions? >> I think we have some debug code update recommendations for JC though. >> :-) >> > > Here's my patch. I'll push this into the dev list for proposed inclusion > in a future release. > > --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 > +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 > @@ -328,9 +328,9 @@ > dbgprintf("Service list dump\n"); > for (i=0; (svcinfo[i].svcname); i++) { > dbgprintf(" Name : %s\n", svcinfo[i].svcname); > - dbgprintf(" Sendtext: %s\n", binview(svcinfo[i].sendtxt, > svcinfo[i].sendlen)); > + dbgprintf(" Sendtext: %s\n", > svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, > svcinfo[i].sendlen):"[null]"); > dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); > - dbgprintf(" Exp.text: %s\n", binview(svcinfo[i].exptext, > svcinfo[i].explen)); > + dbgprintf(" Exp.text: %s\n", > svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, > svcinfo[i].explen):"[null]"); > dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); > dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); > dbgprintf(" Flags : %d\n", svcinfo[i].flags); > > This produces "[null]" where we would have seen "(null)" on a GNU-based > OS, > to differentiate between the two situations. > > In the mean time, you could compile a special version of xymond_rrd, and > run it manually on the same data channel as the real one, but have it make > RRD files and log file to a different location. This shouldn't interfere > with your production Xymon. Here's one I prepared earlier that works for > me: > > sudo -u xymon mkdir /tmp/my-rrd-data/ > sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; > /usr/lib/xymon/server/bin/xymond_channel --channel=data > --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch > --rrddir=/tmp/my-rrd-data/ --debug' > > This seems to show some really useful stuff that's relevant to solving > your > problem. Some sample debug lines: > > 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message 165619 > @@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC > ... > 15306 2015-03-12 11:36:28 Creating rrd > /tmp/my-rrd-data//servername/vmstat.rrd > 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' > 15306 2015-03-12 11:36:28 RRD create param 01: > '/tmp/my-rrd-data//servername/vmstat.rrd' > 15306 2015-03-12 11:36:28 RRD create param 02: '-s' > 15306 2015-03-12 11:36:28 RRD create param 03: '300' > 15306 2015-03-12 11:36:28 RRD create param 04: 'DS:cpu_r:GAUGE:600:0:U' > 15306 2015-03-12 11:36:28 RRD create param 05: 'DS:cpu_b:GAUGE:600:0:U' > 15306 2015-03-12 11:36:28 RRD create param 06: 'DS:cpu_w:GAUGE:600:0:U' > ... > 15306 2015-03-12 11:39:42 Got 265 bytes > 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message 165737 > @@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF > 15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos -1 > 15306 2015-03-12 11:39:42 Flushing > '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, > template > 'sec' > 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos > 216644, > endpos -1, usedbytes=0, bufleft=1884603 > > J >
This is some excellent sleuthing! :)
As I was pouring through the thread (sorry, I've been out the last few days), I failed to take note of the SPARC-Enterprise-T2000 in the output.
The patch below should fix the immediate issue triggered by debug mode... letting us move on to the larger oddness. Unfortunately, I have a feeling there are other occasions where we're relying on GNU's printf(NULL) printing that out and thus might be caught by this. As I find them, I go ahead and work to put fixes in.
In the meantime, this will be in 4.3.19 and can be patched directly from below.
HTH,
-jc
--- lib/netservices.c (revision 7598) +++ lib/netservices.c (working copy) @@ -81,9 +81,9 @@ unsigned char *inp, *outp; int i;
- if (!buf) return NULL; + if (result) xfree(result); + if (!buf) { result = strdup("[null]"); return result; }
- if (result) xfree(result); if (buf && (buflen == 0)) buflen = strlen(buf); result = (char *)malloc(4*buflen + 1); /* Worst case: All binary */
_______________________________________________
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
_______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
I was hoping to get approval to patch the production system, but change control at $CLIENT is sclerotic on a good day. Your suggestion is probably the way I will do it. Back there on Wednesday. Watch this space. Cheers Vernon On 21 March 2015 at 22:40, Galen Johnson <solitaryr at gmail.com> wrote:
Stupid question but can't you just set up a test Xymon server and have the client system point to both servers? That way you can update the test instance and do whatever you need to tweak it without impacting prod.
On Sat, Mar 21, 2015 at 6:31 AM, Vernon Everett <everett.vernon at gmail.com> wrote:
Very confused now. In the test graph, showing the history graphs, the URL contains service=ncv:power And in the history graphs in the status, it's this. service=power
That doesn't confuse me as much as what the graphs look like now. Both the test and the trends graph now contain the spurious values. Yesterday, they were only in the trends graph.
A need to get that debug output fixed.
Regards Vernon
On 21 March 2015 at 17:48, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
So the URLs are different? But both have service=power in the URLs?
On Sat, 21 Mar 2015 10:16 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi Jeremy
That thought occurred to me, but I checked. There is only one [power] entry in the graphs.cfg file. And I put it there for this particular test.
Would have made this one too easy if it was that. :-)
Regards Vernon
On 20 March 2015 at 16:43, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
Vernon
The power status page must refer to a different graph name in graphs.cfg with a different FNPATTERN.
Click on the graphs images for each version to get the 4-graph view and compare the URLs.
J
On Fri, 20 Mar 2015 19:35 Vernon Everett <everett.vernon at gmail.com> wrote:
Hi all
I was only back at the client today, and unfortunately have not managed to get that patch in yet. (As I mentioned before, it's a production system)
However, I did notice something really odd. I have focused my attention on the trends graphs, where I get all the extra values, but it's not happening in the test itself, despite the existence of the additional rrd files.
Example. I have something that plots the power usage of the PSUs on a NetApp e-series. There are 4 PSUs, output looks like this.
Total power drawn- 487 Watts Number of trays- 2 Tray power input details-
TRAY ID POWER SUPPLY SERIAL NUMBER INPUT POWER 99 0 145 Watts 99 1 151 Watts 0 0 99 Watts 0 1 92 Watts
All good. And I have a graph with 4 lines. Min, Max, Curr and Avg values are all there. It looks beautiful. But go look at the power graph in trends, and it's ugly. Heaps of additional data lines with no entries. All values are NaN And mixed in amongst the additional empty graphs, are the 4 valid lines.
I look at the rrd files, and they are all there, even the bad ones. Here's a few of them. power,tcpListenDrop.rrd power,tcpOutAck.rrd power,tcpOutDataSegs.rrd power,tcpOutRsts.rrd power,tcpOutUrg.rrd power,tcpOutWinProbe.rrd power,tcpRetransSegs.rrd power,tcpRtoMax.rrd power,tcpRttUpdate.rrd power,tcpTimKeepaliveProbe.rrd power,tcpTimRetransDrop.rrd power,Tray0_PSU0.rrd <--- Valid power,Tray0_PSU1.rrd <--- Valid power,Tray99_PSU0.rrd <--- Valid power,Tray99_PSU1.rrd <--- Valid power,trlogpool.rrd power,UDP_udpInDatagrams.rrd power,udpInCksumErrs.rrd power,udpOutDatagrams.rrd power,vnet.rrd
So I thought I would check my configs. In xymonserver From TEST2RRD= ,power=ncv, From GRAPHS= ,power::9, And further down SPLITNCV_power="*:GAUGE"
And in graphs.cfg [power] FNPATTERN power,(.*).rrd TITLE Database Power Consumption Per Tray PSU YAXIS Watts -l 0 DEF:p at RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p at RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p at RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p at RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p at RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p at RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
With luck I will get approval to recompile with the debugging bug-fix, and we can get more info, but I thought the extra entries in trends, but not in the test was interesting.
Regards Vernon
On 13 March 2015 at 15:24, J.C. Cleaver <cleaver at terabithia.org> wrote:
> On Wed, March 11, 2015 5:51 pm, Jeremy Laidman wrote: > > On 11 March 2015 at 14:18, Vernon Everett < > everett.vernon at gmail.com> > > wrote: > > > >> About now, I am getting a little nervous adding send and expect, > because > >> unlike telnet and telnets, we are doing ldap and ldaps testing. > >> > > > > That's understandable. A read through the code suggests that at > least in > > some places, an empty string is equivalent to an undefined string, > as the > > string length (shown in Sendlen in the debug output) is zero in > both > > cases. So until a patch is in place, a work-around might be to > define > > empty "send" and "expect" strings for those that have none. > > > > Any suggestions? > >> I think we have some debug code update recommendations for JC > though. > >> :-) > >> > > > > Here's my patch. I'll push this into the dev list for proposed > inclusion > > in a future release. > > > > --- lib/netservices.c.orig 2012-07-25 01:48:41.000000000 +1000 > > +++ lib/netservices.c 2015-03-12 11:18:18.000000000 +1100 > > @@ -328,9 +328,9 @@ > > dbgprintf("Service list dump\n"); > > for (i=0; (svcinfo[i].svcname); i++) { > > dbgprintf(" Name : %s\n", svcinfo[i].svcname); > > - dbgprintf(" Sendtext: %s\n", > binview(svcinfo[i].sendtxt, > > svcinfo[i].sendlen)); > > + dbgprintf(" Sendtext: %s\n", > > svcinfo[i].sendtxt!=NULL?binview(svcinfo[i].sendtxt, > > svcinfo[i].sendlen):"[null]"); > > dbgprintf(" Sendlen : %d\n", svcinfo[i].sendlen); > > - dbgprintf(" Exp.text: %s\n", > binview(svcinfo[i].exptext, > > svcinfo[i].explen)); > > + dbgprintf(" Exp.text: %s\n", > > svcinfo[i].exptext!=NULL?binview(svcinfo[i].exptext, > > svcinfo[i].explen):"[null]"); > > dbgprintf(" Exp.len : %d\n", svcinfo[i].explen); > > dbgprintf(" Exp.ofs : %d\n", svcinfo[i].expofs); > > dbgprintf(" Flags : %d\n", svcinfo[i].flags); > > > > This produces "[null]" where we would have seen "(null)" on a > GNU-based > > OS, > > to differentiate between the two situations. > > > > In the mean time, you could compile a special version of > xymond_rrd, and > > run it manually on the same data channel as the real one, but have > it make > > RRD files and log file to a different location. This shouldn't > interfere > > with your production Xymon. Here's one I prepared earlier that > works for > > me: > > > > sudo -u xymon mkdir /tmp/my-rrd-data/ > > sudo -u xymon xymoncmd /bin/sh -c 'XYMONTMP=/tmp; > > /usr/lib/xymon/server/bin/xymond_channel --channel=data > > --log=/tmp/my-rrd-data.log /path/to/xymond_rrd_debug_patch > > --rrddir=/tmp/my-rrd-data/ --debug' > > > > This seems to show some really useful stuff that's relevant to > solving > > your > > problem. Some sample debug lines: > > > > 15306 2015-03-12 11:36:28 xymond_rrd_debug_patch: Got message > 165619 > > > @@data#165619/servername|1426120588.401891|172.16.0.1||servername|vmstat|sunos|ABC > > ... > > 15306 2015-03-12 11:36:28 Creating rrd > > /tmp/my-rrd-data//servername/vmstat.rrd > > 15306 2015-03-12 11:36:28 RRD create param 00: 'rrdcreate' > > 15306 2015-03-12 11:36:28 RRD create param 01: > > '/tmp/my-rrd-data//servername/vmstat.rrd' > > 15306 2015-03-12 11:36:28 RRD create param 02: '-s' > > 15306 2015-03-12 11:36:28 RRD create param 03: '300' > > 15306 2015-03-12 11:36:28 RRD create param 04: > 'DS:cpu_r:GAUGE:600:0:U' > > 15306 2015-03-12 11:36:28 RRD create param 05: > 'DS:cpu_b:GAUGE:600:0:U' > > 15306 2015-03-12 11:36:28 RRD create param 06: > 'DS:cpu_w:GAUGE:600:0:U' > > ... > > 15306 2015-03-12 11:39:42 Got 265 bytes > > 15306 2015-03-12 11:39:42 xymond_rrd_debug_patch: Got message > 165737 > > > @@data#165737/servername|1426120782.080244|172.16.0.2||servername|trends||DEF > > 15306 2015-03-12 11:39:42 startpos 216644, fillpos 216644, endpos > -1 > > 15306 2015-03-12 11:39:42 Flushing > > '/servername/tcp.xopiy90404.parameter.rrd' with 1 updates pending, > > template > > 'sec' > > 15306 2015-03-12 11:39:42 Want msg 165738, startpos 216644, fillpos > > 216644, > > endpos -1, usedbytes=0, bufleft=1884603 > > > > J > > > > > This is some excellent sleuthing! :) > > As I was pouring through the thread (sorry, I've been out the last > few > days), I failed to take note of the SPARC-Enterprise-T2000 in the > output. > > > The patch below should fix the immediate issue triggered by debug > mode... > letting us move on to the larger oddness. Unfortunately, I have a > feeling > there are other occasions where we're relying on GNU's printf(NULL) > printing that out and thus might be caught by this. As I find them, > I go > ahead and work to put fixes in. > > In the meantime, this will be in 4.3.19 and can be patched directly > from > below. > > > HTH, > > -jc > > > --- lib/netservices.c (revision 7598) > +++ lib/netservices.c (working copy) > @@ -81,9 +81,9 @@ > unsigned char *inp, *outp; > int i; > > - if (!buf) return NULL; > + if (result) xfree(result); > + if (!buf) { result = strdup("[null]"); return result; } > > - if (result) xfree(result); > if (buf && (buflen == 0)) buflen = strlen(buf); > result = (char *)malloc(4*buflen + 1); /* Worst case: All > binary */ > _______________________________________________ > Xymon mailing list > Xymon at xymon.com > http://lists.xymon.com/mailman/listinfo/xymon >
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
_______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
-- "Accept the challenges so that you can feel the exhilaration of victory" - General George Patton
On 2/23/2015 8:46 PM, Vernon Everett wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap.
It is my observation that the rrd listener is extremely brittle. If a badly formatted message of the "data" type arrives, it will kill the rrd listener. This leads to gaps in all of the data charts, not just for the client which sent the garbage message. If I remember correctly, and "empty" message is something which will not parse and kills the listener.
When the rrd listener crashes, it leaves a tmp file behind (which causes other problems). I've written a script which I run from time to time. It hunts down the orphaned tmp files and kills them off. The pertinent part of it is identifying the orphans:
rrdfiles=~/server/tmp/rrdctl; comm -13 <(pgrep -u xymon xymond_rrd | sort -n) <(for i in
ls -1 ${rrdfiles}.* | sort -n;do echo ${i##*.};done)
Take a look and see how many orphans you have, and how frequently they are occurring. Maybe you can identify the culprit.
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On 2/23/2015 8:46 PM, Vernon Everett wrote:
I am getting those sporadic .rrd files in spades. :-( Sometimes, only a single data point in the file. But enough files, and your graphs start to look like crap.
Tomorrow I am off to a client where it's happening all the time. What can I send you to assist with investigating?
I was looking through my rrd notes. Back in August and September of 2014 I was chasing an RRD failure. While trying to find the cause, I enabled the --debug option on the rrd daemon which caused the daemon to core dump.
http://lists.xymon.com/archive/2014-August/040113.html
So be careful what steps you take to chase down your problems. You may make them worse :)
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
participants (5)
-
cleaver@terabithia.org
-
everett.vernon@gmail.com
-
jlaidman@rebel-it.com.au
-
john.thurston@alaska.gov
-
solitaryr@gmail.com