RRD crashing high availability hobbit
Hi again all,
I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).
Detail of theoretical setup:
- 2 datacentres. Each datacentre contains a single hobbit server instance.
- Each client reports to their local datacentre hobbit server.
- Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).
The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.
For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server. An example of the rrd error messages:
2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644 when last update time is 1250762644 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
My question is - how can we stop this happening? Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing?
I hope that makes sense. If you need further clarification please let me know.
Cheers James
On Thursday, 20 August 2009 11:06:30 j.sansford at ntlworld.com wrote:
Hi again all,
I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).
Detail of theoretical setup:
- 2 datacentres. Each datacentre contains a single hobbit server instance.
- Each client reports to their local datacentre hobbit server.
- Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).
The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.
For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server.
How did you determine that this is what is "crashing" the server?
An example of the rrd error messages:
2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644 when last update time is 1250762644 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
I have a number of setups where messages like this are common, due to running network tests and SNMP polling at intervals smaller than 5 minutes (without adjusting all the RRD files to cater to this), and I have not seen hobbit "crash" due to this.
What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd die and leave a status message? Or, does something else occur? Does the server reboot? Does the OS hang? How often does this occur?
My question is - how can we stop this happening?
You would first need to tell us what is happening ...
Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing?
I hope that makes sense. If you need further clarification please let me know.
If hobbitd or hobbitd_rrd or some other process actually crashes, you should be able to get a core file, from which you can get a backtrace (e.g. with gdb), which would allow someone to see why it is crashing, and possibly fix it.
Regards, Buchan
Hi Buchan,
We get a core dump, running a pstack gives the following info:
core 'core' of 11142: hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd fed28a17 _lwp_kill (1, 6) + 7 fecd1d63 raise (6) + 1f fecb1bad abort (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd 08060291 xstrdup (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main (2, 804613c, 8046148) + 4dc 080539fc _start (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server.
Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.
Cheers James.
---- Buchan Milne <bgmilne at staff.telkomsa.net> wrote:
On Thursday, 20 August 2009 11:06:30 j.sansford at ntlworld.com wrote:
Hi again all,
I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).
Detail of theoretical setup:
- 2 datacentres. Each datacentre contains a single hobbit server instance.
- Each client reports to their local datacentre hobbit server.
- Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).
The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.
For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server.
How did you determine that this is what is "crashing" the server?
An example of the rrd error messages:
2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644 when last update time is 1250762644 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
I have a number of setups where messages like this are common, due to running network tests and SNMP polling at intervals smaller than 5 minutes (without adjusting all the RRD files to cater to this), and I have not seen hobbit "crash" due to this.
What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd die and leave a status message? Or, does something else occur? Does the server reboot? Does the OS hang? How often does this occur?
My question is - how can we stop this happening?
You would first need to tell us what is happening ...
Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing?
I hope that makes sense. If you need further clarification please let me know.
If hobbitd or hobbitd_rrd or some other process actually crashes, you should be able to get a core file, from which you can get a backtrace (e.g. with gdb), which would allow someone to see why it is crashing, and possibly fix it.
Regards, Buchan
j.sansford at ntlworld.com wrote:
Hi Buchan,
We get a core dump, running a pstack gives the following info:
core 'core' of 11142: hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd fed28a17 _lwp_kill (1, 6) + 7 fecd1d63 raise (6) + 1f fecb1bad abort (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd 08060291 xstrdup (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main (2, 804613c, 8046148) + 4dc 080539fc _start (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
That looks like you are running extratest for a netapp which from what I can see in hobbitd/do_rrd.c is what handles the xtstats column reported by netapp.pl - just from a cursory glance at the code - I don't use it myself. You really need to look at the C code to check it's doing the right thing. You have 2 choices - quick fix is to disable just that test in netapp.pl - other option is to work out what format it should be and fix the test.
In 4.2.3 for example, the do_devmon.c RRD code doesn't actually implement what is documented and I use a perl script with --extra-script instead
Various RRD handlers are in hobbitd/rrd/do_*.c Looking at the code for xstrdup in lib/memory.c as below you should check your logs - it's probably getting called with a NULL pointer (unlikely you're out of memory), but the logs should tell you.
char *xstrdup(const char *s) { char *result;
if (s == NULL) {
errprintf("xstrdup: Cannot dup NULL string\n");
abort();
}
result = strdup(s);
if (result == NULL) {
errprintf("xstrdup: Out of memory\n");
abort();
}
#ifdef MEMORY_DEBUG add_to_memlist(result, strlen(result)+1); #endif
return result;
}
Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server.
Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.
Cheers James.
---- Buchan Milne <bgmilne at staff.telkomsa.net> wrote:
On Thursday, 20 August 2009 11:06:30 j.sansford at ntlworld.com wrote:
Hi again all,
I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).
Detail of theoretical setup:
- 2 datacentres. Each datacentre contains a single hobbit server instance.
- Each client reports to their local datacentre hobbit server.
- Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).
The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.
For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server.
How did you determine that this is what is "crashing" the server?
An example of the rrd error messages:
2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644 when last update time is 1250762644 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
I have a number of setups where messages like this are common, due to running network tests and SNMP polling at intervals smaller than 5 minutes (without adjusting all the RRD files to cater to this), and I have not seen hobbit "crash" due to this.
These kinds of messages can also be due to duplicate keys being used in RRD reporting. You need to look at how the RRD data is generated to get to the bottom of these. Sometimes the duplicates are in one test, sometimes multiple tests reporting the same thing or too frequently (such as your possible loops). It is unlikely thsi will crash the hobbitd_rrd though.
For example, I had this on MacOSX for ifstat. By default it uses 'netstat -ibn' which is producing multiple lines for the same interface. I changed that in hobbitclient-darwin.sh to 'netstat -ibn | egrep -v "^lo|^vmnet|<Link" - note that I had to filter out vmnet interfaces since netstat -i limits to 5 chars for interface, and there are actually vmnet1 and vmnet8 :( Luckily I don't really care about those.
bash-3.2# netstat -ibn | egrep -v "^lo|^vmnet|<Link"
Name Mtu Network Address Ipkts Ierrs Ibytes
Opkts Oerrs Obytes Coll
en0 1500 fe80::21f:f fe80:6::21f:f3ff: 7709215 - 2307390372
23616260 - 32787591390 -
en0 1500 10.1/16 10.1.75.6 7709215 - 2307390372
23616260 - 32787591390 -
en2 1500 fe80::201:2 fe80:9::201:23ff: 0 -
0 0 - 781938 -
en2 1500 10.37.129/24 10.37.129.2 0 -
0 0 - 781938 -
en3 1500 fe80::210:3 fe80:a::210:32ff: 0 -
0 0 - 792748 -
en3 1500 10.211.55/24 10.211.55.2 0 -
0 0 - 792748 -
bash-3.2# netstat -ibn
Name Mtu Network Address Ipkts Ierrs Ibytes
Opkts Oerrs Obytes Coll
lo0 16384 <Link#1> 196623 0 20477947
196620 0 20477947 0
lo0 16384 fe80::1%lo0 fe80:1::1 196623 - 20477947
196620 - 20477947 -
lo0 16384 127 127.0.0.1 196623 - 20477947
196620 - 20477947 -
lo0 16384 ::1/128 ::1 196623 - 20477947
196620 - 20477947 -
gif0* 1280 <Link#2> 0 0
0 0 0 0 0
stf0* 1280 <Link#3> 0 0
0 0 0 0 0
en1 1500 <Link#4> 00:1f:5b:c3:ec:35 0 0
0 0 0 0 0
fw0 4078 <Link#5> 00:1f:f3:ff:fe:71:5e:18 0 0
0 0 0 346 0
en0 1500 <Link#6> 00:1f:f3:5c:32:e6 7709242 0 2307393391
23616262 0 32787591586 0
en0 1500 fe80::21f:f fe80:6::21f:f3ff: 7709242 - 2307393391
23616262 - 32787591586 -
en0 1500 10.1/16 10.1.75.6 7709242 - 2307393391
23616262 - 32787591586 -
vmnet 1500 <Link#7> 00:50:56:c0:00:08 0 0
0 0 0 0 0
vmnet 1500 192.168.149 192.168.149.1 0 -
0 0 - 0 -
vmnet 1500 <Link#8> 00:50:56:c0:00:01 0 0
0 0 0 0 0
vmnet 1500 172.16.189/24 172.16.189.1 0 -
0 0 - 0 -
en2 1500 <Link#9> 00:01:23:45:67:89 0 0
0 0 0 781938 0
en2 1500 fe80::201:2 fe80:9::201:23ff: 0 -
0 0 - 781938 -
en2 1500 10.37.129/24 10.37.129.2 0 -
0 0 - 781938 -
en3 1500 <Link#10> 00:10:32:54:76:98 0 0
0 0 0 792748 0
en3 1500 fe80::210:3 fe80:a::210:32ff: 0 -
0 0 - 792748 -
en3 1500 10.211.55/24 10.211.55.2 0 -
0 0 - 792748 -
What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd die and leave a status message? Or, does something else occur? Does the server reboot? Does the OS hang? How often does this occur?
My question is - how can we stop this happening?
You would first need to tell us what is happening ...
Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing?
I hope that makes sense. If you need further clarification please let me know.
If hobbitd or hobbitd_rrd or some other process actually crashes, you should be able to get a core file, from which you can get a backtrace (e.g. with gdb), which would allow someone to see why it is crashing, and possibly fix it.
Regards, Buchan
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- David Baldwin - IT Unit Australian Sports Commission www.ausport.gov.au Tel 02 62147830 Fax 02 62141830 PO Box 176 Belconnen ACT 2616 david.baldwin at ausport.gov.au Leverrier Street Bruce ACT 2617
Keep up to date with what's happening in Australian sport visit http://www.ausport.gov.au
This message is intended for the addressee named and may contain confidential and privileged information. If you are not the intended recipient please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you receive this message in error, please delete it and notify the sender.
On Friday, 21 August 2009 00:42:59 David Baldwin wrote:
j.sansford at ntlworld.com wrote:
Hi Buchan,
We get a core dump, running a pstack gives the following info:
core 'core' of 11142: hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd fed28a17 _lwp_kill (1, 6) + 7 fecd1d63 raise (6) + 1f fecb1bad abort (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd 08060291 xstrdup (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main
(2, 804613c, 8046148) + 4dc 080539fc _start (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
OK, so it crashed in do_netapp_extratest_rrd from hobbitd/rrd/do_netapp.c . I'm not familiar with pstack, but it looks like this may be from a stripped binary (or, you may be able to get more information from pstack).
If pstack can't show the values, then you may want to consider running hobbitd_rrd with the --debug flag, which should result in some logging of what it has received just before it crashes.
That looks like you are running extratest for a netapp which from what I can see in hobbitd/do_rrd.c is what handles the xtstats column reported by netapp.pl - just from a cursory glance at the code - I don't use it myself. You really need to look at the C code to check it's doing the right thing. You have 2 choices - quick fix is to disable just that test in netapp.pl - other option is to work out what format it should be and fix the test.
In 4.2.3 for example, the do_devmon.c RRD code doesn't actually implement what is documented
What is not implemented?
Where do you see this documented?
There is one fix that I have committed in svn (Xymon 4.2 branch, Xymon 4.3 branch, devmon svn). I am not aware of any other requests or bugs filed on the devmon rrd collector.
and I use a perl script with --extra-script instead
Is this the one shipped with devmon, or would you like to contribute a better one?
Various RRD handlers are in hobbitd/rrd/do_*.c Looking at the code for xstrdup in lib/memory.c as below you should check your logs - it's probably getting called with a NULL pointer (unlikely you're out of memory), but the logs should tell you.
char *xstrdup(const char *s) { char *result;
if (s == NULL) { errprintf("xstrdup: Cannot dup NULL string\n"); abort(); } result = strdup(s); if (result == NULL) { errprintf("xstrdup: Out of memory\n"); abort(); }#ifdef MEMORY_DEBUG add_to_memlist(result, strlen(result)+1); #endif
return result;}
xstrdup is called twice in do_netapp_extratest_rrd, but seeing the string that it's aborting on would help narrow it down. If you can provide the status message that made hobbitd_rrd crash (retrieve it using: bb localhost 'hobbitdlog hostname.testname') it can be used to reproduce this by someone trying to fix the bug.
Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server.
It is still unlikely that this has anything to do with hobbitd_rrd crashing.
Regards, Buchan
Hi,
I have a test that produce a line number that will change randomly, and also the name of the field to graph sometime change
So i use NCV_dts=*:AVERAGE; to track it
The rrd file are created correctly ( ds and values), but i dont know how i configure DEF definition in hobbitgraph to see it
thanks,
Marco
Hi, I saw that the problem is in the creation of the rrd for xtstats for netapp filers. Can you check what version of the netapp.pl package you have installed ? Have you applied the latest patch included in the hobbit_perl_client distribution to the hobbit server 4.2.3?
In the last version of the Hobbit_perl_client (v 1.21) there was a correction is the netapp.pl code and also a patch to be applied to a clean 4.2.3 that should solve a hobbit_rrd crashing problem in the xtstats function caused by different kind of data sent by different storage software versions.
If your hobbitd_rrd still crash after the patch application can you run the hobbitd_rrd with the -debug as suggested and try to extract the data regarding the xtstats that make the server crash? (or can you send me the last 5-6 minutes of that logs) so I can analyze what the module is receiving and what is going wrong?
Thanks Francesco
-----Original Message----- From: j.sansford at ntlworld.com [mailto:j.sansford at ntlworld.com] Sent: giovedì 20 agosto 2009 18.34 To: hobbit at hswn.dk; Buchan Milne Subject: Re: [hobbit] RRD crashing high availability hobbit
Hi Buchan,
We get a core dump, running a pstack gives the following info:
core 'core' of 11142: hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd fed28a17 _lwp_kill (1, 6) + 7 fecd1d63 raise (6) + 1f fecb1bad abort (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd 08060291 xstrdup (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main (2, 804613c, 8046148) + 4dc 080539fc _start (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server.
Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.
Cheers James.
---- Buchan Milne <bgmilne at staff.telkomsa.net> wrote:
On Thursday, 20 August 2009 11:06:30 j.sansford at ntlworld.com wrote:
Hi again all,
I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).
Detail of theoretical setup:
- 2 datacentres. Each datacentre contains a single hobbit server instance.
- Each client reports to their local datacentre hobbit server.
- Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).
The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.
For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server.
How did you determine that this is what is "crashing" the server?
An example of the rrd error messages:
2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644 when last update time is 1250762644 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
I have a number of setups where messages like this are common, due to running network tests and SNMP polling at intervals smaller than 5 minutes (without adjusting all the RRD files to cater to this), and I have not seen hobbit "crash" due to this.
What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd die and leave a status message? Or, does something else occur? Does the server reboot? Does the OS hang? How often does this occur?
My question is - how can we stop this happening?
You would first need to tell us what is happening ...
Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing?
I hope that makes sense. If you need further clarification please let me know.
If hobbitd or hobbitd_rrd or some other process actually crashes, you should be able to get a core file, from which you can get a backtrace (e.g. with gdb), which would allow someone to see why it is crashing, and possibly fix it.
Regards, Buchan
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
participants (5)
-
bgmilne@staff.telkomsa.net
-
david.baldwin@ausport.gov.au
-
fduranti@q8.it
-
j.sansford@ntlworld.com
-
marco.avvisano@regione.toscana.it