Hi,
i'm running trunk, and got a red "hobbitd_rrd" column for30 minutes this night, it went purple after the"hobbitd_rrd" said "program crash - Fatal signal caught"
-> what could be causing this ?
olivier
In <496C4C12.5000206 at gmail.com> Olivier Beau <obeau79 at gmail.com> writes:
i'm running trunk, and got a red "hobbitd_rrd" column for 30 minutes this night, it went purple after the"hobbitd_rrd" said "program crash - Fatal signal caught"
-> what could be causing this ?
The hobbitd_rrd column is a way of making sure you notice that there's been a crash of a Hobbit program. You can remove it with bb 127.0.0.1 "drop YOURHOBBITSERVER hobbitd_rrd" like you would remove any other status column.
It would of course be nice to figure out *why* hobbitd_rrd crashed. It should leave a core-file in the ~hobbit/server/tmp/ directory, so if you could run it through gdb as described in http://www.xymon.com/hobbit/help/known-issues.html#bugreport it would help a lot.
Thanks, Henrik
ok, i understand this column
here is the backtrace : (i'm running debian 4.0, 32bit)
$ gdb bin/hobbitd_rrd tmp/core GNU gdb 6.4.90-debian Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i486-linux-gnu"...Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
warning: Can't read pathname for load map: Input/output error. Reading symbols from /usr/lib/librrd.so.2...done. Loaded symbols for /usr/lib/librrd.so.2 Reading symbols from /usr/lib/libpng12.so.0...done. Loaded symbols for /usr/lib/libpng12.so.0 Reading symbols from /usr/lib/libpcre.so.3...done. Loaded symbols for /usr/lib/libpcre.so.3 Reading symbols from /usr/lib/libz.so.1...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.8...done. Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.8 Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.8...done. Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.8 Reading symbols from /lib/tls/i686/cmov/libc.so.6...done. Loaded symbols for /lib/tls/i686/cmov/libc.so.6 Reading symbols from /usr/lib/libfreetype.so.6...done. Loaded symbols for /usr/lib/libfreetype.so.6 Reading symbols from /usr/lib/libart_lgpl_2.so.2...done. Loaded symbols for /usr/lib/libart_lgpl_2.so.2 Reading symbols from /lib/tls/i686/cmov/libm.so.6...done. Loaded symbols for /lib/tls/i686/cmov/libm.so.6 Reading symbols from /lib/tls/i686/cmov/libdl.so.2...done. Loaded symbols for /lib/tls/i686/cmov/libdl.so.2 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Core was generated by `hobbitd_rrd --rrddir=/data/hobbit/data/rrd --extra-script=/data/hobbit/server/e'. Program terminated with signal 6, Aborted. #0 0xb7f72410 in ?? () (gdb) bt #0 0xb7f72410 in ?? () #1 0xbfbe576c in ?? () #2 0x00000006 in ?? () #3 0x000017c0 in ?? () #4 0xb7c42811 in raise () from /lib/tls/i686/cmov/libc.so.6 #5 0xb7c43fb9 in abort () from /lib/tls/i686/cmov/libc.so.6 #6 0x0806d9d1 in sigsegv_handler (signum=11) at sig.c:58 #7 0xb7f72420 in ?? () #8 0x0000000b in ?? () #9 0x00000033 in ?? () #10 0x00000000 in ?? () (gdb)
Olivier
Hello Henrik,
This is happening once or twice a day. each time i seem to be loosing 15 to 30 minutes of some graphs (i supposed this because of the rrdcache module)
Anything i could do to help fix this ?
Olivier
On 13/01/2009 13:21, Olivier Beau wrote:
ok, i understand this column
here is the backtrace : (i'm running debian 4.0, 32bit)
$ gdb bin/hobbitd_rrd tmp/core GNU gdb 6.4.90-debian Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i486-linux-gnu"...Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
warning: Can't read pathname for load map: Input/output error. Reading symbols from /usr/lib/librrd.so.2...done. Loaded symbols for /usr/lib/librrd.so.2 Reading symbols from /usr/lib/libpng12.so.0...done. Loaded symbols for /usr/lib/libpng12.so.0 Reading symbols from /usr/lib/libpcre.so.3...done. Loaded symbols for /usr/lib/libpcre.so.3 Reading symbols from /usr/lib/libz.so.1...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.8...done. Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.8 Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.8...done. Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.8 Reading symbols from /lib/tls/i686/cmov/libc.so.6...done. Loaded symbols for /lib/tls/i686/cmov/libc.so.6 Reading symbols from /usr/lib/libfreetype.so.6...done. Loaded symbols for /usr/lib/libfreetype.so.6 Reading symbols from /usr/lib/libart_lgpl_2.so.2...done. Loaded symbols for /usr/lib/libart_lgpl_2.so.2 Reading symbols from /lib/tls/i686/cmov/libm.so.6...done. Loaded symbols for /lib/tls/i686/cmov/libm.so.6 Reading symbols from /lib/tls/i686/cmov/libdl.so.2...done. Loaded symbols for /lib/tls/i686/cmov/libdl.so.2 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Core was generated by `hobbitd_rrd --rrddir=/data/hobbit/data/rrd --extra-script=/data/hobbit/server/e'. Program terminated with signal 6, Aborted. #0 0xb7f72410 in ?? () (gdb) bt #0 0xb7f72410 in ?? () #1 0xbfbe576c in ?? () #2 0x00000006 in ?? () #3 0x000017c0 in ?? () #4 0xb7c42811 in raise () from /lib/tls/i686/cmov/libc.so.6 #5 0xb7c43fb9 in abort () from /lib/tls/i686/cmov/libc.so.6 #6 0x0806d9d1 in sigsegv_handler (signum=11) at sig.c:58 #7 0xb7f72420 in ?? () #8 0x0000000b in ?? () #9 0x00000033 in ?? () #10 0x00000000 in ?? () (gdb)
Olivier
In <496C8738.4090604 at gmail.com> Olivier Beau <obeau79 at gmail.com> writes:
ok, i understand this column
here is the backtrace : (i'm running debian 4.0, 32bit)
$ gdb bin/hobbitd_rrd tmp/core Core was generated by `hobbitd_rrd --rrddir=/data/hobbit/data/rrd --extra-script=/data/hobbit/server/e'. Program terminated with signal 6, Aborted. #0 0xb7f72410 in ?? () (gdb) bt #0 0xb7f72410 in ?? () #1 0xbfbe576c in ?? () #2 0x00000006 in ?? () #3 0x000017c0 in ?? () #4 0xb7c42811 in raise () from /lib/tls/i686/cmov/libc.so.6 #5 0xb7c43fb9 in abort () from /lib/tls/i686/cmov/libc.so.6 #6 0x0806d9d1 in sigsegv_handler (signum=11) at sig.c:58 #7 0xb7f72420 in ?? () #8 0x0000000b in ?? () #9 0x00000033 in ?? () #10 0x00000000 in ?? ()
Yuck, not much help there - I was hoping there would be some more meaningful stuff in there.
Since this happens regularly, could you try running hobbitd_rrd for a while with debugging enabled ? Either restart it with the "--debug" option, or do a "killall -USR2 hobbitd_rrd" while it is running (the USR2 signal toggles debugging output on/off). The debug output goes to the normal hobbitd_rrd logfile (rrd-status.log or rrd-data.log).
Regards, Henrik
Hi Henrik,
It happened again today at 17:00:22. Nothing new when doing a bt on the coredump. An extract of rrd-status.log from 16h55 to 17h05 is available at http://www.qalpit.com/~olivier/tmp/rrd-status.log.gz
Olivier
ps: hobbitd_rrd only crashes on the status channel (hobbitd_rdd running on the data channel never crashed)
Since this happens regularly, could you try running hobbitd_rrd for a while with debugging enabled ? Either restart it with the "--debug" option, or do a "killall -USR2 hobbitd_rrd" while it is running (the USR2 signal toggles debugging output on/off). The debug output goes to the normal hobbitd_rrd logfile (rrd-status.log or rrd-data.log).
In <4974AE8B.80706 at gmail.com> Olivier Beau <obeau79 at gmail.com> writes:
It happened again today at 17:00:22. Nothing new when doing a bt on the coredump. An extract of rrd-status.log from 16h55 to 17h05 is available at http://www.qalpit.com/~olivier/tmp/rrd-status.log.gz
OK, the interesting part is here when it crashes: 2009-01-19 17:00:22 hobbitd_rrd: Got message 181436 @@status#181436/cedratnet-bdd1|1232380822.602633|127. 0.0.1||cedratnet-bdd1|mysql|1232398822|green||green|1231215890|0||0||1232380812|0|linuxmysql|unix/mysql 2009-01-19 17:00:22 startpos 342639, fillpos 378880, endpos 342991 2009-01-19 17:00:22 hobbitd_rrd: Got message 181437 @@status#181437/moniteur-ora2|1232380822.618847|10.12 .0.67||moniteur-ora2|cpu|1255363113|blue||blue|1228751913|0||1255363113|Disabled by 2009-01-19 17:00:22 startpos 342995, fillpos 378880, endpos -1 2009-01-19 17:00:22 Peer at 0.0.0.0:0 failed: Broken pipe 2009-01-19 17:00:22 Peer not up, flushing message queue 2009-01-19 17:00:22 Opening file /data/hobbit/server/etc/hobbit-rrddefinitions.cfg 2009-01-19 17:00:22 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 2009-01-19 17:00:22 hobbitd_rrd: Got message 181450 @@status#181450/nurun-etam-bdd1|1232380822.807004|127 .0.0.1||nurun-etam-bdd1|mysql|1232398822|green||green|1231768476|0||0||1232380582|0|linuxmysql|unix/mysql 2009-01-19 17:00:22 startpos 17100, fillpos 19357, endpos 17846 2009-01-19 17:00:22 Opening file /data/hobbit/server/etc/bb-hosts It appears to be a "mysql" status from either cedratnet-bdd1 or nurun-etam-bdd1 that causes the crash (I cannot tell exactly, because output buffering comes into play when there's a crash). It *could* also be the cpu-report from moniteur-ora2, but I doubt that - the cpu-status is tested a lot more than the mysql-status. In fact, "mysql" isn't part of hobbitd_rrd by default. So is this something you've added ? Is it something that you generate graphs for ? Or is it just a status that hobbitd_rrd should ignore ? Regards, Henrik
Henrik, Here are 2 other extracts from crashes : 2009-01-20 16:36:06 hobbitd_rrd: Got message 517875 @@status#517875/sw01.courrierinternational|1232465766.838715|192.168.255.32 ||sw01.courrierinternational|if_load|1232467566|green||green|1225102669|0||0||0|0||network/switch-dedie 2009-01-20 16:36:06 startpos 162634, fillpos 166552, endpos -1 2009-01-20 16:36:06 Want msg 517876, startpos 162634, fillpos 166552, endpos -1, usedbytes=3918, bufleft=361831 2009-01-20 16:36:06 Want msg 517876, startpos 162634, fillpos 170333, endpos -1, usedbytes=7699, bufleft=358050 2009-01-20 16:36:06 hobbitd_rrd: Got message 517876 @@status#517876/sw01.ctoutvert|1232465766.838761|192.168.255.32||sw01.ctout vert|memory|1234247285|blue||blue|1231828085|0||1234247285|Disabled by 2009-01-20 16:36:06 startpos 172884, fillpos 172884, endpos -1 2009-01-20 16:36:06 Peer at 0.0.0.0:0 failed: Broken pipe 2009-01-20 16:36:06 Peer not up, flushing message queue 2009-01-20 16:36:06 Opening file /data/hobbit/server/etc/hobbit-rrddefinitions.cfg 2009-01-20 16:36:06 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 2009-01-20 16:36:06 hobbitd_rrd: Got message 517913 @@status#517913/sw01.excenteurofac|1232465766.929692|192.168.255.32||sw01.e xcenteurofac|if_err|1232467566|green||green|1231866461|0||0||0|0||network/switch-dedie if_load and if_err are status from devmon, that i do not graph using ncv/extra-test.. memory is also generate from devmon, and is graphes by default in xymon 2009-01-22 17:14:20 hobbitd_rrd: Got message 343666 @@status#343666/logicimmo-netapp2|1232640859.848737|127.0.0.1||logicimmo-ne tapp2|disk|2147483647|blue||blue|1232479545|0||-1|Disabled by 2009-01-22 17:14:20 startpos 417512, fillpos 419047, endpos -1 2009-01-22 17:14:20 Peer at 0.0.0.0:0 failed: Broken pipe 2009-01-22 17:14:20 Peer not up, flushing message queue 2009-01-22 17:14:20 Opening file /data/hobbit/server/etc/hobbit-rrddefinitions.cfg 2009-01-22 17:14:20 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 2009-01-22 17:14:20 hobbitd_rrd: Got message 343677 @@status#343677/tif-netapp1|1232640860.884630|127.0.0.1||tif-netapp1|disk|1 232644460|green||green|1230710616|0||0||0|0|stockage|unix/infrasys/stockage 2009-01-22 17:14:20 startpos 1335, fillpos 3954, endpos 2589 disk is generate by netapp.pl (from the hobbit-client-perl) -> i noticed that in my 3 extracts, the last log before the crash is disabled. Looks like this could be a problem ? (i've check 2 other crashes, and there again, the last log is a disabled status) i checked those 3 disabled status : those hosts are up and running (so normal status are sent to hobbitd) we have disabled them for migration purpose, that might happen in a few days, or weeks... For your mysql question : yes i do graph mysql using NVC NCV_mysql="Questions:DERIVE,Threadsconnected:GAUGE,*:NONE" Olivier On 22/01/2009 15:29, Henrik Størner wrote:
In <4974AE8B.80706 at gmail.com> Olivier Beau <obeau79 at gmail.com> writes:
It happened again today at 17:00:22. Nothing new when doing a bt on the coredump. An extract of rrd-status.log from 16h55 to 17h05 is available at http://www.qalpit.com/~olivier/tmp/rrd-status.log.gz
OK, the interesting part is here when it crashes:
2009-01-19 17:00:22 hobbitd_rrd: Got message 181436 @@status#181436/cedratnet-bdd1|1232380822.602633|127. 0.0.1||cedratnet-bdd1|mysql|1232398822|green||green|1231215890|0||0||1232380812|0|linuxmysql|unix/mysql 2009-01-19 17:00:22 startpos 342639, fillpos 378880, endpos 342991 2009-01-19 17:00:22 hobbitd_rrd: Got message 181437 @@status#181437/moniteur-ora2|1232380822.618847|10.12 .0.67||moniteur-ora2|cpu|1255363113|blue||blue|1228751913|0||1255363113|Disabled by 2009-01-19 17:00:22 startpos 342995, fillpos 378880, endpos -1 2009-01-19 17:00:22 Peer at 0.0.0.0:0 failed: Broken pipe 2009-01-19 17:00:22 Peer not up, flushing message queue 2009-01-19 17:00:22 Opening file /data/hobbit/server/etc/hobbit-rrddefinitions.cfg 2009-01-19 17:00:22 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 2009-01-19 17:00:22 hobbitd_rrd: Got message 181450 @@status#181450/nurun-etam-bdd1|1232380822.807004|127 .0.0.1||nurun-etam-bdd1|mysql|1232398822|green||green|1231768476|0||0||1232380582|0|linuxmysql|unix/mysql 2009-01-19 17:00:22 startpos 17100, fillpos 19357, endpos 17846 2009-01-19 17:00:22 Opening file /data/hobbit/server/etc/bb-hosts
It appears to be a "mysql" status from either cedratnet-bdd1 or nurun-etam-bdd1 that causes the crash (I cannot tell exactly, because output buffering comes into play when there's a crash). It *could* also be the cpu-report from moniteur-ora2, but I doubt that - the cpu-status is tested a lot more than the mysql-status.
In fact, "mysql" isn't part of hobbitd_rrd by default. So is this something you've added ? Is it something that you generate graphs for ? Or is it just a status that hobbitd_rrd should ignore ?
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
i enabled all disabled status that are used from graphing (cpu,disk,procs,...) and i have not had a single crash in the last 36 hours (before, crashes would happen at least twice per day) -> from my user point of view, it looks like disabled status can crash hobbitd_rrd olivier On 22/01/2009 21:28, Olivier Beau wrote:
Henrik,
Here are 2 other extracts from crashes :
2009-01-20 16:36:06 hobbitd_rrd: Got message 517875 @@status#517875/sw01.courrierinternational|1232465766.838715|192.168.255.32 ||sw01.courrierinternational|if_load|1232467566|green||green|1225102669|0||0||0|0||network/switch-dedie
2009-01-20 16:36:06 startpos 162634, fillpos 166552, endpos -1 2009-01-20 16:36:06 Want msg 517876, startpos 162634, fillpos 166552, endpos -1, usedbytes=3918, bufleft=361831 2009-01-20 16:36:06 Want msg 517876, startpos 162634, fillpos 170333, endpos -1, usedbytes=7699, bufleft=358050 2009-01-20 16:36:06 hobbitd_rrd: Got message 517876 @@status#517876/sw01.ctoutvert|1232465766.838761|192.168.255.32||sw01.ctout vert|memory|1234247285|blue||blue|1231828085|0||1234247285|Disabled by 2009-01-20 16:36:06 startpos 172884, fillpos 172884, endpos -1 2009-01-20 16:36:06 Peer at 0.0.0.0:0 failed: Broken pipe 2009-01-20 16:36:06 Peer not up, flushing message queue 2009-01-20 16:36:06 Opening file /data/hobbit/server/etc/hobbit-rrddefinitions.cfg 2009-01-20 16:36:06 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 2009-01-20 16:36:06 hobbitd_rrd: Got message 517913 @@status#517913/sw01.excenteurofac|1232465766.929692|192.168.255.32||sw01.e xcenteurofac|if_err|1232467566|green||green|1231866461|0||0||0|0||network/switch-dedie
if_load and if_err are status from devmon, that i do not graph using ncv/extra-test.. memory is also generate from devmon, and is graphes by default in xymon
2009-01-22 17:14:20 hobbitd_rrd: Got message 343666 @@status#343666/logicimmo-netapp2|1232640859.848737|127.0.0.1||logicimmo-ne tapp2|disk|2147483647|blue||blue|1232479545|0||-1|Disabled by 2009-01-22 17:14:20 startpos 417512, fillpos 419047, endpos -1 2009-01-22 17:14:20 Peer at 0.0.0.0:0 failed: Broken pipe 2009-01-22 17:14:20 Peer not up, flushing message queue 2009-01-22 17:14:20 Opening file /data/hobbit/server/etc/hobbit-rrddefinitions.cfg 2009-01-22 17:14:20 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 2009-01-22 17:14:20 hobbitd_rrd: Got message 343677 @@status#343677/tif-netapp1|1232640860.884630|127.0.0.1||tif-netapp1|disk|1 232644460|green||green|1230710616|0||0||0|0|stockage|unix/infrasys/stockage 2009-01-22 17:14:20 startpos 1335, fillpos 3954, endpos 2589
disk is generate by netapp.pl (from the hobbit-client-perl)
-> i noticed that in my 3 extracts, the last log before the crash is disabled. Looks like this could be a problem ? (i've check 2 other crashes, and there again, the last log is a disabled status)
i checked those 3 disabled status : those hosts are up and running (so normal status are sent to hobbitd) we have disabled them for migration purpose, that might happen in a few days, or weeks...
For your mysql question : yes i do graph mysql using NVC NCV_mysql="Questions:DERIVE,Threadsconnected:GAUGE,*:NONE"
Olivier
On 22/01/2009 15:29, Henrik Størner wrote:
In <4974AE8B.80706 at gmail.com> Olivier Beau <obeau79 at gmail.com> writes:
It happened again today at 17:00:22. Nothing new when doing a bt on the coredump. An extract of rrd-status.log from 16h55 to 17h05 is available at http://www.qalpit.com/~olivier/tmp/rrd-status.log.gz
OK, the interesting part is here when it crashes:
2009-01-19 17:00:22 hobbitd_rrd: Got message 181436 @@status#181436/cedratnet-bdd1|1232380822.602633|127. 0.0.1||cedratnet-bdd1|mysql|1232398822|green||green|1231215890|0||0||1232380812|0|linuxmysql|unix/mysql
2009-01-19 17:00:22 startpos 342639, fillpos 378880, endpos 342991 2009-01-19 17:00:22 hobbitd_rrd: Got message 181437 @@status#181437/moniteur-ora2|1232380822.618847|10.12 .0.67||moniteur-ora2|cpu|1255363113|blue||blue|1228751913|0||1255363113|Disabled by 2009-01-19 17:00:22 startpos 342995, fillpos 378880, endpos -1 2009-01-19 17:00:22 Peer at 0.0.0.0:0 failed: Broken pipe 2009-01-19 17:00:22 Peer not up, flushing message queue 2009-01-19 17:00:22 Opening file /data/hobbit/server/etc/hobbit-rrddefinitions.cfg 2009-01-19 17:00:22 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 2009-01-19 17:00:22 hobbitd_rrd: Got message 181450 @@status#181450/nurun-etam-bdd1|1232380822.807004|127 .0.0.1||nurun-etam-bdd1|mysql|1232398822|green||green|1231768476|0||0||1232380582|0|linuxmysql|unix/mysql
2009-01-19 17:00:22 startpos 17100, fillpos 19357, endpos 17846 2009-01-19 17:00:22 Opening file /data/hobbit/server/etc/bb-hosts
It appears to be a "mysql" status from either cedratnet-bdd1 or nurun-etam-bdd1 that causes the crash (I cannot tell exactly, because output buffering comes into play when there's a crash). It *could* also be the cpu-report from moniteur-ora2, but I doubt that - the cpu-status is tested a lot more than the mysql-status.
In fact, "mysql" isn't part of hobbitd_rrd by default. So is this something you've added ? Is it something that you generate graphs for ? Or is it just a status that hobbitd_rrd should ignore ?
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Hi Olivier,
just one quick question: What version of the RRDtool library are you using ?
I just had a rather nasty problem which looks like a problem with the RRD library on Debian (1.2.15); upgrading to 1.2.30 - the latest 1.2.x version - made the problem disappear. This was causing random crashes of hobbitd_rrd ...
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
obeau79@gmail.com