For some reason hobbitd_larrd has started crashing on my main production server. Larrd-status.log has messages like this in it:
2006-06-09 12:42:39 Worker process died with exit code 139, terminating 2006-06-09 12:54:42 2006-06-09 12:54:42 Worker process died with exit code 139, terminating 2006-06-09 12:54:42 Our child has failed and will not talk to us: Channel status, PID 18254 2006-06-09 12:54:42 Worker process died with exit code 139, terminating 2006-06-09 12:56:40 2006-06-09 12:56:40 Worker process died with exit code 139, terminating 2006-06-09 12:56:40 Our child has failed and will not talk to us: Channel status, PID 18803 2006-06-09 12:56:40 Worker process died with exit code 139, terminating 2006-06-09 12:56:48 Host 'enormous' reports vmstat for an unknown OS 2006-06-09 12:58:43 2006-06-09 12:58:43 Worker process died with exit code 139, terminating 2006-06-09 12:58:43 2006-06-09 12:58:43 Worker process died with exit code 139, terminating
Loading the core file into gdb and executin "backtrace" yields "No stack". Any ideas what's going on? I'm running Hobbit 4.1.2.rc1 on a RedHat ES3 box.
Thanks, Larry Barber
On Fri, Jun 09, 2006 at 01:30:39PM -0500, Larry Barber wrote:
For some reason hobbitd_larrd has started crashing on my main production server. Larrd-status.log has messages like this in it:
2006-06-09 12:42:39 Worker process died with exit code 139, terminating
Loading the core file into gdb and executin "backtrace" yields "No stack". Any ideas what's going on? I'm running Hobbit 4.1.2.rc1 on a RedHat ES3 box.
4.1.2-rc1 is pretty old (almost one year).
I can think of several problems that might cause this, but my first suggestion would be to at least upgrade to the 4.1.2p1 release that is the current production-release.
From a testing perspective I'd like you to try out the 4.2 beta release that went out early this week, but I fully understand if you would rather not run the beta-version on a production system.
Regards, Henrik
I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks like:
#0 0x00dfe60a in do_lookup_versioned () from /lib/ld-linux.so.2 #1 0x00dfd776 in _dl_lookup_versioned_symbol_internal () from /lib/ld- linux.so.2 #2 0x00e01473 in fixup () from /lib/ld-linux.so.2 #3 0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #4 0x08054c6d in sigsegv_handler (signum=11) at sig.c:51 #5 <signal handler called> #6 0x00dfe3da in do_lookup () from /lib/ld-linux.so.2 #7 0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2 #8 0x00e0140f in fixup () from /lib/ld-linux.so.2 #9 0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip", fn=0x805f6e0 "tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"..., creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143 #11 0x0804f294 in do_net_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip", testname=0xb755d04e "http", msg=0xb755d07c "status stellent_pre-prod_v-ip.http green Fri Jun 9 16:16:31 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TY..."..., tstamp=1149887818) at rrd/do_net.c:48 #12 0x0805024a in update_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip", testname=0xb755d04e "http", msg=0xb755d07c "status stellent_pre-prod_v-ip.http green Fri Jun 9 16:16:31 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TY..."..., tstamp=1149887818, sender=0x1ca3f <Address 0x1ca3f out of bounds>, ldef=0x1ca3f) at do_rrd.c:291 #13 0x08049cf0 in main (argc=117311, argv=0xbfff8324) at hobbitd_rrd.c:199
larrd-status.log looks like:
...
2006-06-09 15:45:24 Our child has failed and will not talk to us: Channel status, PID 22591 2006-06-09 15:45:24 Worker process died with exit code 139, terminating 2006-06-09 15:56:03 2006-06-09 15:56:03 Worker process died with exit code 139, terminating 2006-06-09 15:57:03 2006-06-09 15:57:03 Worker process died with exit code 139, terminating 2006-06-09 15:57:24 Worker process died with exit code 139, terminating 2006-06-09 15:57:24 Our child has failed and will not talk to us: Channel status, PID 25060 2006-06-09 15:57:24 Worker process died with exit code 139, terminating 2006-06-09 15:59:24 2006-06-09 15:59:24 Worker process died with exit code 139, terminating 2006-06-09 15:59:24 Worker process died with exit code 139, terminating 2006-06-09 16:09:26 Worker process died with exit code 139, terminating 2006-06-09 16:13:01 2006-06-09 16:13:01 Worker process died with exit code 139, terminating 2006-06-09 16:13:02 Worker process died with exit code 139, terminating 2006-06-09 16:14:56 2006-06-09 16:14:56 Worker process died with exit code 139, terminating 2006-06-09 16:14:57 Worker process died with exit code 139, terminating 2006-06-09 16:16:58 2006-06-09 16:16:58 Worker process died with exit code 139, terminating 2006-06-09 16:16:58 Worker process died with exit code 139, terminating
It just started doing this today, I can't think of anything that I have done that could cause it.
Thanks, Larry Barber
On 6/9/06, Henrik Stoerner <henrik at hswn.dk> wrote:
On Fri, Jun 09, 2006 at 01:30:39PM -0500, Larry Barber wrote:
For some reason hobbitd_larrd has started crashing on my main production server. Larrd-status.log has messages like this in it:
2006-06-09 12:42:39 Worker process died with exit code 139, terminating
Loading the core file into gdb and executin "backtrace" yields "No stack". Any ideas what's going on? I'm running Hobbit 4.1.2.rc1 on a RedHat ES3 box.
4.1.2-rc1 is pretty old (almost one year).
I can think of several problems that might cause this, but my first suggestion would be to at least upgrade to the 4.1.2p1 release that is the current production-release.
From a testing perspective I'd like you to try out the 4.2 beta release that went out early this week, but I fully understand if you would rather not run the beta-version on a production system.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Fri, Jun 09, 2006 at 04:21:56PM -0500, Larry Barber wrote:
I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks like:
#5 <signal handler called> #6 0x00dfe3da in do_lookup () from /lib/ld-linux.so.2 #7 0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2 #8 0x00e0140f in fixup () from /lib/ld-linux.so.2 #9 0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip", fn=0x805f6e0 "tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"..., creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143
OK, the call trace looks sane so I think we can rule out simple memory corruption here.
The crash happens when trying to print an error-message from the RRDtool library, when trying to create a new RRD file for tracking a http test response time (it has just called the rrd_create() function, which returns an error and hobbit is trying to print out the error message when it crashes.
The filename looks somewhat suspicious. It is generated from the URL that is tested, and it is a very long filename beginning with "tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=" It's an http test for the host "stellent_pre-prod_v-ip"
My guess is that this filename is just too long. It *could* overflow the buffer set aside for the RRD filename - in that case, the attached patch against 4.1.2p1 should help.
It just started doing this today, I can't think of anything that I have done that could cause it.
I think You just added this http test for "stellent_pre-prod_v-ip".
Regards, Henrik
On Fri, Jun 09, 2006 at 11:40:48PM +0200, Henrik Stoerner wrote:
My guess is that this filename is just too long. It *could* overflow the buffer set aside for the RRD filename - in that case, the attached patch against 4.1.2p1 should help.
Correction, there is one more place that is sensitive to the filename length. Please use this corrected patch instead of the one I sent earlier.
Regards, Henrik
No joy, it is still crashing, stack trace:
(gdb) #0 0x0046260a in do_lookup_versioned () from /lib/ld-linux.so.2 #1 0x00461776 in _dl_lookup_versioned_symbol_internal () from /lib/ld- linux.so.2 #2 0x00465473 in fixup () from /lib/ld-linux.so.2 #3 0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #4 0x08054c79 in sigsegv_handler (signum=11) at sig.c:51 #5 <signal handler called> #6 0x004623da in do_lookup () from /lib/ld-linux.so.2 #7 0x00461103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2 #8 0x0046540f in fixup () from /lib/ld-linux.so.2 #9 0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #10 0x0804a92b in create_and_update_rrd (hostname=0x7 <Address 0x7 out of bounds>, fn=0x805f6e0 "tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"..., creparams=0x805e5c0, template=0x93f7b20 "sec") at do_rrd.c:145 #11 0x0804f2a0 in do_net_rrd (hostname=0xb755f036 "stellent_pre-prod_v-ip", testname=0xb755f04d "http", msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun 9 16:53:40 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TY..."..., tstamp=1149890052) at rrd/do_net.c:48 #12 0x08050256 in update_rrd (hostname=0xb755f036 "stellent_pre-prod_v-ip", testname=0xb755f04d "http", msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun 9 16:53:40 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TY..."..., tstamp=1149890052, sender=0x1ca3f <Address 0x1ca3f out of bounds>, ldef=0x1ca3f) at do_rrd.c:293 #13 0x08049cf0 in main (argc=117311, argv=0xbfffab14) at hobbitd_rrd.c:199
I was looking at your patch, and it doesn't look to me like that new lines are doing the same thing as the old:
- strcat(filedir, "/"); strcat(filedir, fn);
- snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
- filedir[sizeof(filedir)-1] = '\0'; creparams[1] = filedir; /* Icky */
It looks like the original line creates something like "filedir/fn" while the new lines create something like "filedir/hostname/fn". Is this right?
Thanks, Larry Barber
On 6/9/06, Henrik Stoerner <henrik at hswn.dk> wrote:
On Fri, Jun 09, 2006 at 04:21:56PM -0500, Larry Barber wrote:
I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks like:
#5 <signal handler called> #6 0x00dfe3da in do_lookup () from /lib/ld-linux.so.2 #7 0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2 #8 0x00e0140f in fixup () from /lib/ld-linux.so.2 #9 0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip", fn=0x805f6e0 "tcp.http.https:,,pws.tc.sc.egov.usda.gov ,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"..., creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143
OK, the call trace looks sane so I think we can rule out simple memory corruption here.
The crash happens when trying to print an error-message from the RRDtool library, when trying to create a new RRD file for tracking a http test response time (it has just called the rrd_create() function, which returns an error and hobbit is trying to print out the error message when it crashes.
The filename looks somewhat suspicious. It is generated from the URL that is tested, and it is a very long filename beginning with "tcp.http.https:,,pws.tc.sc.egov.usda.gov ,siteminderagent,dmsforms,login_banner.fcc?TYPE=" It's an http test for the host "stellent_pre-prod_v-ip"
My guess is that this filename is just too long. It *could* overflow the buffer set aside for the RRD filename - in that case, the attached patch against 4.1.2p1 should help.
It just started doing this today, I can't think of anything that I have done that could cause it.
I think You just added this http test for "stellent_pre-prod_v-ip".
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
After applying the second patch, it's still crashing, stacktrace:
(gdb) backtrace #0 0x00d8960a in do_lookup_versioned () from /lib/ld-linux.so.2 #1 0x00d88776 in _dl_lookup_versioned_symbol_internal () from /lib/ld- linux.so.2 #2 0x00d8c473 in fixup () from /lib/ld-linux.so.2 #3 0x00d8c330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #4 0x08054c89 in sigsegv_handler (signum=11) at sig.c:51 #5 <signal handler called> #6 0x0039078b in strlen () from /lib/tls/libc.so.6 #7 0x0035e621 in vfprintf () from /lib/tls/libc.so.6 #8 0x0037fd24 in vsnprintf () from /lib/tls/libc.so.6 #9 0x08050fb3 in errprintf (fmt=0x8057cd8 "RRD error creating %s: %s\n") at errormsg.c:51 #10 0x0804a93a in create_and_update_rrd (hostname=0x7 <Address 0x7 out of bounds>, fn=0x805f6e0 "tcp.http.https:,,pws.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d3b2e2ae-78ac-495d-a153-09f36b6aa237&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$2z10ILc8e"..., creparams=0x805e5c0, template=0x9098e68 "sec") at do_rrd.c:145 #11 0x0804f2af in do_net_rrd (hostname=0xb7560037 "FS_PVHOST", testname=0xb7560041 "http", msg=0xb756006f "status FS_PVHOST.http green Fri Jun 9 17:11:00 2006: OK ; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200 OK\r\nDate: Fri, 09 Jun 2006 22:11:57 GMT\r\nServer: IBM_HTTP_Server/2.0.47."..., tstamp=1149891084) at rrd/do_net.c:50 #12 0x08050266 in update_rrd (hostname=0xb7560037 "FS_PVHOST", testname=0xb7560041 "http", msg=0xb756006f "status FS_PVHOST.http green Fri Jun 9 17:11:00 2006: OK ; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200 OK\r\nDate: Fri, 09 Jun 2006 22:11:57 GMT\r\nServer: IBM_HTTP_Server/2.0.47."..., tstamp=1149891084, sender=0x706a4266 <Address 0x706a4266 out of bounds>, ldef=0x706a4266) at do_rrd.c:293 #13 0x08049cf0 in main (argc=1886012006, argv=0xbfffbab4) at hobbitd_rrd.c:199
BTW, those ultra-long URL's have been in there for quite a while, several months anyway.
Thanks, Larry Barber
On 6/9/06, Larry Barber <lebarber at gmail.com> wrote:
No joy, it is still crashing, stack trace:
(gdb) #0 0x0046260a in do_lookup_versioned () from /lib/ld-linux.so.2 #1 0x00461776 in _dl_lookup_versioned_symbol_internal () from /lib/ld- linux.so.2 #2 0x00465473 in fixup () from /lib/ld-linux.so.2 #3 0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #4 0x08054c79 in sigsegv_handler (signum=11) at sig.c:51 #5 <signal handler called> #6 0x004623da in do_lookup () from /lib/ld-linux.so.2 #7 0x00461103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2 #8 0x0046540f in fixup () from /lib/ld-linux.so.2 #9 0x00465330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #10 0x0804a92b in create_and_update_rrd (hostname=0x7 <Address 0x7 out of bounds>, fn=0x805f6e0 "tcp.http.https:,,pws.tc.sc.egov.usda.gov,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"..., creparams=0x805e5c0, template=0x93f7b20 "sec") at do_rrd.c:145 #11 0x0804f2a0 in do_net_rrd (hostname=0xb755f036 "stellent_pre-prod_v-ip", testname=0xb755f04d "http", msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun 9 16:53:40 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TY... "..., tstamp=1149890052) at rrd/do_net.c:48 #12 0x08050256 in update_rrd (hostname=0xb755f036 "stellent_pre-prod_v-ip", testname=0xb755f04d "http", msg=0xb755f07b "status stellent_pre-prod_v-ip.http green Fri Jun 9 16:53:40 2006: OK ; OK\n\n&green https://pws.tc.sc.egov.usda.gov/siteminderagent/dmsforms/login_banner.fcc?TY... "..., tstamp=1149890052, sender=0x1ca3f <Address 0x1ca3f out of bounds>, ldef=0x1ca3f) at do_rrd.c:293 #13 0x08049cf0 in main (argc=117311, argv=0xbfffab14) at hobbitd_rrd.c:199
I was looking at your patch, and it doesn't look to me like that new lines are doing the same thing as the old:
- strcat(filedir, "/"); strcat(filedir, fn);
- snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
- filedir[sizeof(filedir)-1] = '\0'; creparams[1] = filedir; /* Icky */
It looks like the original line creates something like "filedir/fn" while the new lines create something like "filedir/hostname/fn". Is this right?
Thanks, Larry Barber
On 6/9/06, Henrik Stoerner <henrik at hswn.dk> wrote:
On Fri, Jun 09, 2006 at 04:21:56PM -0500, Larry Barber wrote: I loaded p1, and hobbitd_rrd is still dumping, the stack trace looks like:
#5 <signal handler called> #6 0x00dfe3da in do_lookup () from /lib/ld- linux.so.2 #7 0x00dfd103 in _dl_lookup_symbol_internal () from /lib/ld-linux.so.2 #8 0x00e0140f in fixup () from /lib/ld-linux.so.2 #9 0x00e01330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #10 0x0804a91f in create_and_update_rrd (hostname=0xb755d037 "stellent_pre-prod_v-ip", fn=0x805f6e0 "tcp.http.https:,,pws.tc.sc.egov.usda.gov ,siteminderagent,dmsforms,login_banner.fcc?TYPE=33554433&REALMOID=06-d38f4375-a8bd-4190-b6f9-3c77f0901647&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$hIspF3"..., creparams=0x805e5c0, template=0x9cf6b20 "sec") at do_rrd.c:143
OK, the call trace looks sane so I think we can rule out simple memory corruption here.
The crash happens when trying to print an error-message from the RRDtool library, when trying to create a new RRD file for tracking a http test response time (it has just called the rrd_create() function, which returns an error and hobbit is trying to print out the error message when it crashes.
The filename looks somewhat suspicious. It is generated from the URL that is tested, and it is a very long filename beginning with "tcp.http.https:,,pws.tc.sc.egov.usda.gov ,siteminderagent,dmsforms,login_banner.fcc?TYPE=" It's an http test for the host "stellent_pre-prod_v-ip"
My guess is that this filename is just too long. It *could* overflow the buffer set aside for the RRD filename - in that case, the attached patch against 4.1.2p1 should help.
It just started doing this today, I can't think of anything that I have done that could cause it.
I think You just added this http test for "stellent_pre-prod_v-ip".
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Fri, Jun 09, 2006 at 05:01:55PM -0500, Larry Barber wrote:
No joy, it is still crashing, stack trace:
Does rrdtool work for you? Try running rrdtool create /foo.rrd DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576 Assuming you're not root, this should print out the message ERROR: creating '/foo.rrd': Permission denied
I was looking at your patch, and it doesn't look to me like that new lines are doing the same thing as the old:
- strcat(filedir, "/"); strcat(filedir, fn);
- snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
- filedir[sizeof(filedir)-1] = '\0'; creparams[1] = filedir; /* Icky */
It looks like the original line creates something like "filedir/fn" while the new lines create something like "filedir/hostname/fn". Is this right?
It is. In the old version, "filedir" contained the rrd top-level directory + the hostname, e.g. "/hobbit/rrd/myhost", and then it added an extra "/" and the rrd filename.
The new version just uses snprintf() to output the top-level directory + the hostname-directory + the rrd filename in one go.
sprintf(filedir, "%s/%s", rrddir, hostname);
if (stat(filedir, &st) == -1) {
...
}
snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname, fn);
filedir[sizeof(filedir)-1] = '\0';
Regards, Henrik
rrdtool performs as expected:
-bash-2.05b$ /usr/local/rrdtool-1.2.10/bin/rrdtool create /foo.rrd DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576 ERROR: creating '/foo.rrd': Permission denied
Thanks, Larry Barber
On 6/9/06, Henrik Stoerner <henrik at hswn.dk> wrote:
On Fri, Jun 09, 2006 at 05:01:55PM -0500, Larry Barber wrote:
No joy, it is still crashing, stack trace:
Does rrdtool work for you? Try running rrdtool create /foo.rrd DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576 Assuming you're not root, this should print out the message ERROR: creating '/foo.rrd': Permission denied
I was looking at your patch, and it doesn't look to me like that new lines are doing the same thing as the old:
strcat(filedir, "/"); strcat(filedir, fn);
fn);snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname,filedir[sizeof(filedir)-1] = '\0'; creparams[1] = filedir; /* Icky */It looks like the original line creates something like "filedir/fn" while the new lines create something like "filedir/hostname/fn". Is this right?
It is. In the old version, "filedir" contained the rrd top-level directory + the hostname, e.g. "/hobbit/rrd/myhost", and then it added an extra "/" and the rrd filename.
The new version just uses snprintf() to output the top-level directory + the hostname-directory + the rrd filename in one go.
sprintf(filedir, "%s/%s", rrddir, hostname); if (stat(filedir, &st) == -1) { ... } snprintf(filedir, sizeof(filedir)-1, "%s/%s/%s", rrddir, hostname,fn); filedir[sizeof(filedir)-1] = '\0';
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Fri, Jun 09, 2006 at 05:23:21PM -0500, Larry Barber wrote:
rrdtool performs as expected:
-bash-2.05b$ /usr/local/rrdtool-1.2.10/bin/rrdtool create /foo.rrd DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576 ERROR: creating '/foo.rrd': Permission denied
OK, but I still think it is odd that it crashes while printing an RRDtool error message.
What happens if you use this patch on top of the one you already installed ?
Henrik
Still crashing, stack trace: (gdb) #0 0x00f1260a in do_lookup_versioned () from /lib/ld-linux.so.2 #1 0x00f11776 in _dl_lookup_versioned_symbol_internal () from /lib/ld- linux.so.2 #2 0x00f15473 in fixup () from /lib/ld-linux.so.2 #3 0x00f15330 in _dl_runtime_resolve () from /lib/ld-linux.so.2 #4 0x08054cad in sigsegv_handler (signum=11) at sig.c:51 #5 <signal handler called> #6 0x00b8f657 in strlen () from /lib/csa/sse2/sse2_boost.so.1 #7 0x00a8ac19 in OK_BOD_strncpy () from /lib/csa/libcsa.so.6 #8 0x00a883c0 in strncpy () from /lib/csa/libcsa.so.6 #9 0x0804a93c in create_and_update_rrd (hostname=0x7 <Address 0x7 out of bounds>, fn=0x63 <Address 0x63 out of bounds>, creparams=0x805e5c0, template=0x8fa8340 "sec") at do_rrd.c:150 #10 0x0804f2d3 in do_net_rrd (hostname=0xb7560036 "FS_PVHOST", testname=0xb7560040 "http", msg=0xb756006e "status FS_PVHOST.http green Fri Jun 9 17:34:51 2006: OK ; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200 OK\r\nDate: Fri, 09 Jun 2006 22:35:15 GMT\r\nServer: IBM_HTTP_Server/2.0.47."..., tstamp=1149892521) at rrd/do_net.c:50 #11 0x0805028a in update_rrd (hostname=0xb7560036 "FS_PVHOST", testname=0xb7560040 "http", msg=0xb756006e "status FS_PVHOST.http green Fri Jun 9 17:34:51 2006: OK ; OK ; OK\n\n&green http://poc.fs.usda.gov/wps/portal - OK\n\nHTTP/1.1 200 OK\r\nDate: Fri, 09 Jun 2006 22:35:15 GMT\r\nServer: IBM_HTTP_Server/2.0.47."..., tstamp=1149892521, sender=0xffffffc0 <Address 0xffffffc0 out of bounds>, ldef=0xffffffc0) at do_rrd.c:301 #12 0x08049cf0 in main (argc=-64, argv=0xbfffb634) at hobbitd_rrd.c:199
Notice entry #9, it appears that something is munging up the hostname variable.
Thanks, Larry Barber
On 6/9/06, Henrik Stoerner <henrik at hswn.dk> wrote:
On Fri, Jun 09, 2006 at 05:23:21PM -0500, Larry Barber wrote:
rrdtool performs as expected:
-bash-2.05b$ /usr/local/rrdtool-1.2.10/bin/rrdtool create /foo.rrd DS:sec:GAUGE:600:0:U RRA:AVERAGE:0.5:1:576 ERROR: creating '/foo.rrd': Permission denied
OK, but I still think it is odd that it crashes while printing an RRDtool error message.
What happens if you use this patch on top of the one you already installed ?
Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace: #8 0x00a883c0 in strncpy () from /lib/csa/libcsa.so.6 #9 0x0804a93c in create_and_update_rrd (hostname=0x7 <Address 0x7 out of bounds>, fn=0x63 <Address 0x63 out of bounds>, creparams=0x805e5c0, template=0x8fa8340 "sec") at do_rrd.c:150
It still crashes while handling the data we get from the rrd_get_error() routine.
I had a look at the rrdtool sources, and this crash doesn't make sense. rrd_get_error() returns a static buffer, so it should be able to crash.
Notice entry #9, it appears that something is munging up the hostname variable.
Most likely, it is just a memory scribble that hits part of the stack as a result of the real error.
It's too late for me to do more about it now, but I would like to take a closer look at this. If you could tar up the 4.1.2p1 build directory including the hobbitd_rrd binary and the core file and mail it to me. I'll have a look at it in the morning when I'm a bit more awake.
Regards, Henrik
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace:
As a final (desperate) fix, change to code to avoid printing the error message. I.e. around line 143 in hobbitd/do_rrd.c, add a line after result = rrd_create(pcount, creparams); with if (result != 0) return 1;
If that doesn't crash, then I'm really suspicious of your rrdtool library. What version is that, by the way ?
Regards, Henrik
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace:
I just remembered: Check the size of you Hobbit logfiles, especially the "rrd-status.log" file.
The current Hobbit versions do not have large file support, so if the log gets around 2 GB, printing anything to the logfile will cause an I/O error, and this has been seen to crash programs. That would explain why this happens when it tries to print an error message.
Regards, Henrik
The larrd-status.log file wasn't unduly large, they get rotated daily. I am using version 1.2.10 of rrdtool, although 1.0.48 is also installed on the machine.
Even with the (desperate) fix, it is still coring.
Where should I mail those files to? I assume you don't want them on the mailling list.
Thanks, Larry Barber
On 6/9/06, Henrik Stoerner <henrik at hswn.dk> wrote:
On Fri, Jun 09, 2006 at 05:37:45PM -0500, Larry Barber wrote:
Still crashing, stack trace:
I just remembered: Check the size of you Hobbit logfiles, especially the "rrd-status.log" file.
The current Hobbit versions do not have large file support, so if the log gets around 2 GB, printing anything to the logfile will cause an I/O error, and this has been seen to crash programs. That would explain why this happens when it tries to print an error message.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
On Fri, Jun 09, 2006 at 06:26:11PM -0500, Larry Barber wrote:
The larrd-status.log file wasn't unduly large, they get rotated daily. I am using version 1.2.10 of rrdtool, although 1.0.48 is also installed on the machine.
Even with the (desperate) fix, it is still coring.
Where should I mail those files to? I assume you don't want them on the mailling list.
My direct mail address, henrik at hswn.dk
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
lebarber@gmail.com