how to learn what is crashing my rrd handler?
I'm having difficulty with my RRD handlers crashing and leaving gaps in my databases.
I mentioned this back in April, 2014 but received no responses: http://lists.xymon.com/pipermail/xymon/2014-April/039547.html
Since that time, I stumbled across a client which was occasionally sending me empty messages of the "data" type. When I stopped that behavior, the behavior was corrected. The behavior has returned and I am not able to figure out how to find the cause.
I _suspect_ it is another client sending me empty messages, but how do I find it now that I have several hundred clients sending "data" messages?
Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
On Tue, August 26, 2014 11:04 am, John Thurston wrote:
I'm having difficulty with my RRD handlers crashing and leaving gaps in my databases.
I mentioned this back in April, 2014 but received no responses: http://lists.xymon.com/pipermail/xymon/2014-April/039547.html
Since that time, I stumbled across a client which was occasionally sending me empty messages of the "data" type. When I stopped that behavior, the behavior was corrected. The behavior has returned and I am not able to figure out how to find the cause.
I _suspect_ it is another client sending me empty messages, but how do I find it now that I have several hundred clients sending "data" messages?
As an initial step, run xymond_rrd in --debug mode... You can send the pid a -USR2 signal to toggle this setting without bouncing the process itself (be sure you're signalling xymond_rrd and not its xymond_channel parent).
Ideally that can help you see what exactly it was processing before it went down. If it's a full on crash, a backtrace from a core file would be helpful for debugging after a segfault or something similar.
HTH, -jc
On 8/26/2014 11:16 AM, J.C. Cleaver wrote:
> On Tue, August 26, 2014 11:04 am, John Thurston wrote:
>> I'm having difficulty with my RRD handlers crashing and leaving gaps in
>> my databases.
>>
>> I mentioned this back in April, 2014 but received no responses:
>> http://lists.xymon.com/pipermail/xymon/2014-April/039547.html
- snip -
>> I _suspect_ it is another client sending me empty messages, but how do I
>> find it now that I have several hundred clients sending "data" messages?
>> --
> As an initial step, run xymond_rrd in --debug mode... You can send the pid
> a -USR2 signal to toggle this setting without bouncing the process itself
> (be sure you're signalling xymond_rrd and not its xymond_channel parent).
Ahh. I figured out what I was doing wrong. I was placing --debug in the
wrong place on the line. When I put it at the end, I get debug output
from the 'data' handler rather than the parent.
When I do so, the log contains:
> 2014-08-27 09:02:07 Peer not up, flushing message queue
> 2014-08-27 09:02:08 Peer not up, flushing message queue
> 2014-08-27 09:02:09 Peer not up, flushing message queue
> 2014-08-27 09:02:10 Peer not up, flushing message queue
> 2014-08-27 09:02:10 Peer not up, flushing message queue
> 2014-08-27 09:02:16 Peer not up, flushing message queue
> 2014-08-27 09:02:17 Peer not up, flushing message queue
> 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/rrddefinitions.cfg
> 4595 2014-08-27 09:02:19 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383
> 4595 2014-08-27 09:02:19 Got 230 bytes
> 4595 2014-08-27 09:02:19 xymond_rrd: Got message 2103 @@data#2103/soapsgdc02.soa.alaska.gov|1409158937.159587|10.210.36.22||soapsgdc02.soa.alaska.gov|trends||ETS/MsgDir
> 4595 2014-08-27 09:02:19 startpos 230, fillpos 230, endpos -1
> 4595 2014-08-27 09:02:19 Transport setup is:
> 4595 2014-08-27 09:02:19 xymondportnumber = 1984
> 4595 2014-08-27 09:02:19 xymonproxyhost = NONE
> 4595 2014-08-27 09:02:19 xymonproxyport = 0
> 4595 2014-08-27 09:02:19 Recipient listed as '146.63.81.42'
> 4595 2014-08-27 09:02:19 Standard protocol on port 1984
> 4595 2014-08-27 09:02:19 Will connect to address 146.63.81.42 port 1984
> 4595 2014-08-27 09:02:19 Connect status is 0
> 4595 2014-08-27 09:02:19 Sent 16 bytes
> 4595 2014-08-27 09:02:19 Read 8192 bytes
> 4595 2014-08-27 09:02:19 Read 32767 bytes
> 4595 2014-08-27 09:02:19 Read 28440 bytes
> 4595 2014-08-27 09:02:19 Closing connection
> 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/analysis.cfg
> 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/protocols.cfg
> 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/protocols.d/soaprotocols.cfg
> 4595 2014-08-27 09:02:19 Service list dump
> 4595 2014-08-27 09:02:19 Name : ftp
> 4595 2014-08-27 09:02:19 Sendtext: quit\r\n
> 4595 2014-08-27 09:02:19 Sendlen : 6
> 4595 2014-08-27 09:02:19 Exp.text: 220
> 4595 2014-08-27 09:02:19 Exp.len : 3
> 4595 2014-08-27 09:02:19 Exp.ofs : 0
> 4595 2014-08-27 09:02:19 Flags : 1
> 4595 2014-08-27 09:02:19 Port : 21
> 4595 2014-08-27 09:02:19 Name : ftps
> 4595 2014-08-27 09:02:19 Sendtext: quit\r\n
> 4595 2014-08-27 09:02:19 Sendlen : 6
> 4595 2014-08-27 09:02:19 Exp.text: 220
> 4595 2014-08-27 09:02:19 Exp.len : 3
> 4595 2014-08-27 09:02:19 Exp.ofs : 0
> 4595 2014-08-27 09:02:19 Flags : 5
> 4595 2014-08-27 09:02:19 Port : 990
> 4595 2014-08-27 09:02:19 Name : ssh
> 4595 2014-08-27 09:02:19 Sendtext: SSH-2.0-OpenSSH_4.1\r\n
> 4595 2014-08-27 09:02:19 Sendlen : 21
> 4595 2014-08-27 09:02:19 Exp.text: SSH
> 4595 2014-08-27 09:02:19 Exp.len : 3
> 4595 2014-08-27 09:02:19 Exp.ofs : 0
> 4595 2014-08-27 09:02:19 Flags : 1
> 4595 2014-08-27 09:02:19 Port : 22
> 4595 2014-08-27 09:02:19 Name : ssh1
> 4595 2014-08-27 09:02:19 Sendtext: SSH-2.0-OpenSSH_4.1\r\n
> 4595 2014-08-27 09:02:19 Sendlen : 21
> 4595 2014-08-27 09:02:19 Exp.text: SSH
> 4595 2014-08-27 09:02:19 Exp.len : 3
> 4595 2014-08-27 09:02:19 Exp.ofs : 0
> 4595 2014-08-27 09:02:19 Flags : 1
> 4595 2014-08-27 09:02:19 Port : 22
> 4595 2014-08-27 09:02:19 Name : ssh2
> 4595 2014-08-27 09:02:19 Sendtext: SSH-2.0-OpenSSH_4.1\r\n
> 4595 2014-08-27 09:02:19 Sendlen : 21
> 4595 2014-08-27 09:02:19 Exp.text: SSH
> 4595 2014-08-27 09:02:19 Exp.len : 3
> 4595 2014-08-27 09:02:19 Exp.ofs : 0
> 4595 2014-08-27 09:02:19 Flags : 1
> 4595 2014-08-27 09:02:19 Port : 22
> 4595 2014-08-27 09:02:19 Name : telnet
> 4595 2014-08-27 09:02:19 2014-08-27 09:02:21 Child process 4595 died: Signal 6
> 2014-08-27 09:02:24 Peer at 0.0.0.0:0 failed: Broken pipe
> 2014-08-27 09:02:24 Peer not up, flushing message queue
> 2014-08-27 09:02:24 Peer not up, flushing message queue
and the stack from the core file (using pstack)
> fee5ebd4 _lwp_kill (6, 0, 0, fee3e0f0, ffffffff, 6) + 8
> fedd29f0 abort (0, 1, 6666c, ffb04, feed5518, 0) + 110
> 0003bb94 sigsegv_handler (b, 0, ffbfb588, 1, 0, 544a8) + 30
> fee5b00c __sighndlr (b, 0, ffbfb588, 3bb64, 0, 1) + c
> fee4f6bc call_user_handler (b, 0, 0, 0, fed32a00, ffbfb588) + 3b8
> fee4f8a4 sigacthandler (b, 0, ffbfb588, 20, 0, 0) + 60
> --- called from signal handler with signal 11 (SIGSEGV) ---
> fedc2d50 strlen (53e37, ffbfc804, ffbfbdf9, 0, 0, 0) + 50
> fee319d4 vfprintf (71990, 53e28, ffbfc800, 0, a0afc, fee314d4) + ec
> 0002f24c dbgprintf (53e28, 0, e9768, 6f800, 71990, 6d000) + a0
> 0003347c dump_tcp_services (53e88, 53ea0, 53eb8, c0, a0, 67f98) + a0
> 00033d70 init_tcp_services (168a78, 620, 67f98, 54060, 600, 168430) + 848
> 0002f858 rrd_setup (98906, 6d000, 6d000, 80808080, 6d000, 0) + 164
> 0002fc4c find_xymon_rrd (988f4, 492e8, 53fe08c6, 53fe08c6, 988c2, 2e) + 4
> 00048cb0 main (98907, ffbfdba4, 988fc, 68800, 3, 49528) + 728
> 00015d2c _start (0, 0, 0, 0, 0, 0) + 5c
Which, if I'm reading it correctly, makes me think the application tried
to read off the end of a string.
--
Do things because you should, not just because you can.
John Thurston 907-465-8591
John.Thurston at alaska.gov
Enterprise Technology Services
Department of Administration
State of Alaska
On 8/27/2014 9:27 AM, John Thurston wrote:
> On 8/26/2014 11:16 AM, J.C. Cleaver wrote:
>> On Tue, August 26, 2014 11:04 am, John Thurston wrote:
>>> I'm having difficulty with my RRD handlers crashing and leaving gaps in
>>> my databases.
>>>
>>> I mentioned this back in April, 2014 but received no responses:
>>> http://lists.xymon.com/pipermail/xymon/2014-April/039547.html
> - snip -
>>> I _suspect_ it is another client sending me empty messages, but how do I
>>> find it now that I have several hundred clients sending "data" messages?
>>> --
>
>> As an initial step, run xymond_rrd in --debug mode... You can send the
>> pid
>> a -USR2 signal to toggle this setting without bouncing the process itself
>> (be sure you're signalling xymond_rrd and not its xymond_channel parent).
>
> Ahh. I figured out what I was doing wrong. I was placing --debug in the
> wrong place on the line. When I put it at the end, I get debug output
> from the 'data' handler rather than the parent.
>
> When I do so, the log contains:
>
>> 2014-08-27 09:02:07 Peer not up, flushing message queue
>> 2014-08-27 09:02:08 Peer not up, flushing message queue
>> 2014-08-27 09:02:09 Peer not up, flushing message queue
>> 2014-08-27 09:02:10 Peer not up, flushing message queue
>> 2014-08-27 09:02:10 Peer not up, flushing message queue
>> 2014-08-27 09:02:16 Peer not up, flushing message queue
>> 2014-08-27 09:02:17 Peer not up, flushing message queue
>> 4595 2014-08-27 09:02:19 Opening file
>> /opt/xymon/server/etc/rrddefinitions.cfg
>> 4595 2014-08-27 09:02:19 Want msg 1, startpos 0, fillpos 0, endpos -1,
>> usedbytes=0, bufleft=528383
>> 4595 2014-08-27 09:02:19 Got 230 bytes
>> 4595 2014-08-27 09:02:19 xymond_rrd: Got message 2103
>> @@data#2103/soapsgdc02.soa.alaska.gov|1409158937.159587|10.210.36.22||soapsgdc02.soa.alaska.gov|trends||ETS/MsgDir
>>
- snip -
>> 4595 2014-08-27 09:02:19 Exp.len : 3
>> 4595 2014-08-27 09:02:19 Exp.ofs : 0
>> 4595 2014-08-27 09:02:19 Flags : 1
>> 4595 2014-08-27 09:02:19 Port : 22
>> 4595 2014-08-27 09:02:19 Name : telnet
>> 4595 2014-08-27 09:02:19 2014-08-27 09:02:21 Child process 4595 died:
>> Signal 6
>> 2014-08-27 09:02:24 Peer at 0.0.0.0:0 failed: Broken pipe
>> 2014-08-27 09:02:24 Peer not up, flushing message queue
>> 2014-08-27 09:02:24 Peer not up, flushing message queue
>
> and the stack from the core file (using pstack)
>> fee5ebd4 _lwp_kill (6, 0, 0, fee3e0f0, ffffffff, 6) + 8
>> fedd29f0 abort (0, 1, 6666c, ffb04, feed5518, 0) + 110
>> 0003bb94 sigsegv_handler (b, 0, ffbfb588, 1, 0, 544a8) + 30
>> fee5b00c __sighndlr (b, 0, ffbfb588, 3bb64, 0, 1) + c
>> fee4f6bc call_user_handler (b, 0, 0, 0, fed32a00, ffbfb588) + 3b8
>> fee4f8a4 sigacthandler (b, 0, ffbfb588, 20, 0, 0) + 60
>> --- called from signal handler with signal 11 (SIGSEGV) ---
>> fedc2d50 strlen (53e37, ffbfc804, ffbfbdf9, 0, 0, 0) + 50
>> fee319d4 vfprintf (71990, 53e28, ffbfc800, 0, a0afc, fee314d4) + ec
>> 0002f24c dbgprintf (53e28, 0, e9768, 6f800, 71990, 6d000) + a0
>> 0003347c dump_tcp_services (53e88, 53ea0, 53eb8, c0, a0, 67f98) + a0
>> 00033d70 init_tcp_services (168a78, 620, 67f98, 54060, 600, 168430) +
>> 848
>> 0002f858 rrd_setup (98906, 6d000, 6d000, 80808080, 6d000, 0) + 164
>> 0002fc4c find_xymon_rrd (988f4, 492e8, 53fe08c6, 53fe08c6, 988c2, 2e)
>> + 4
>> 00048cb0 main (98907, ffbfdba4, 988fc, 68800, 3, 49528) + 728
>> 00015d2c _start (0, 0, 0, 0, 0, 0) + 5c
>
> Which, if I'm reading it correctly, makes me think the application tried
> to read off the end of a string.
And I think the thing which ran off the end of the string was the debug
process :p I'm still looking at source (and my C is very, very bad), but
I suspect the debug print process is choking on the absence of an
attribute in the procols.cfg
When I turn debug off, my rrd handlers are working much better.
--
Do things because you should, not just because you can.
John Thurston 907-465-8591
John.Thurston at alaska.gov
Enterprise Technology Services
Department of Administration
State of Alaska
On 28 August 2014 05:10, John Thurston <john.thurston at alaska.gov> wrote:
And I think the thing which ran off the end of the string was the debug process :p I'm still looking at source (and my C is very, very bad), but I suspect the debug print process is choking on the absence of an attribute in the procols.cfg
The "Sendtext" line has only the date string, so that's probably the data that caused the crash. I'm guessing you haven't re-defined the protocols.cfg entry for [telnet], so I'm struggling to imagine where the bad data could have come from. But perhaps worth removing the "telnet" test for that host and see if the crashing stops.
J
On 8/27/2014 4:08 PM, Jeremy Laidman wrote:
On 28 August 2014 05:10, John Thurston <john.thurston at alaska.gov <mailto:john.thurston at alaska.gov>> wrote:
And I think the thing which ran off the end of the string was the debug process :p I'm still looking at source (and my C is very, very bad), but I suspect the debug print process is choking on the absence of an attribute in the procols.cfgThe "Sendtext" line has only the date string, so that's probably the data that caused the crash. I'm guessing you haven't re-defined the protocols.cfg entry for [telnet], so I'm struggling to imagine where the bad data could have come from. But perhaps worth removing the "telnet" test for that host and see if the crashing stops.
The telnet test is defined for only one host (out of 600) in my hosts.cfg, and it is never the host identified in the remnants of the crash.
It looks like (with 4.3.17), in debug mode, lines 324-340 of netservices.c dump the currently defined service list. They correctly send the contents of the service list for: ftp ftps ssh ssh1 ssh2 and they print the name for: telent and then they die when they are asked to print the Sendtext for "telnet"
My protocols.cfg contains:
[ssh|ssh1|ssh2] send "SSH-2.0-OpenSSH_4.1\r\n" expect "SSH" options banner port 22
[telnet] options banner,telnet port 23
There is no "send" attribute defined. I don't think there has ever been such an attribute defined. The files in the source directories (~/xymon-4.3.17/xymonnet/protocols.cfg) do not contain a "send" attribute. The "send" attribute is defined for most, but not all of the other protocols in the file.
Has anyone running 4.3.17 successfully run their rrd_data handler in debug mode? I don't think it can do it.
This looks to me like a code defect, wherein the dump_tcp_services function is not checking for a null string prior to finding its length and passing the information to dbgprintf. I'm just a lowly sysadmin whose last exposure to C was more than 25 years ago. Can someone with some C experience confirm or deny my hypothesis?
-- Do things because you should, not just because you can.
John Thurston 907-465-8591 John.Thurston at alaska.gov Enterprise Technology Services Department of Administration State of Alaska
participants (3)
-
cleaver@terabithia.org
-
jlaidman@rebel-it.com.au
-
john.thurston@alaska.gov