[hobbit] RRD crashing high availability hobbit

21 Aug 2009

      Hi,
I saw that the problem is in the creation of the rrd for xtstats for netapp filers.
Can you check what version of the netapp.pl package you have installed ? Have you applied the latest patch included in the hobbit_perl_client distribution to the hobbit server 4.2.3?
In the last version of the Hobbit_perl_client (v 1.21) there was a correction is the netapp.pl code and also a patch to be applied to a clean 4.2.3 that should solve a hobbit_rrd crashing problem in the xtstats function caused by different kind of data sent by different storage software versions.
If your hobbitd_rrd still crash after the patch application can you run the hobbitd_rrd with the -debug as suggested and try to extract the data regarding the xtstats that make the server crash? (or can you send me the last 5-6 minutes of that logs) so I can analyze what the module is receiving and what is going wrong?
Thanks
Francesco
-----Original Message-----
From: j.sansford at ntlworld.com [mailto:j.sansford at ntlworld.com]
Sent: giovedì 20 agosto 2009 18.34
To: hobbit at hswn.dk; Buchan Milne
Subject: Re: [hobbit] RRD crashing high availability hobbit
Hi Buchan,
We get a core dump, running a pstack gives the following info:
core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
fed28a17 _lwp_kill (1, 6) + 7
fecd1d63 raise    (6) + 1f
fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd
08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200
0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1
0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6
08054044 main     (2, 804613c, 8046148) + 4dc
080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server.
Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.
Cheers
James.
---- Buchan Milne <bgmilne at staff.telkomsa.net> wrote:
...
On Thursday, 20 August 2009 11:06:30 j.sansford at ntlworld.com wrote:
...
Hi again all,
I need some help configuring/debugging why our hobbit servers are crashing
(due to rrd, which I shall explain shortly) and how to get around this. We
have 3 hobbit servers with proxies, however I will simplify this
explanation with just 2 hobbits and no proxies (as we discovered the same
thing happens).
Detail of theoretical setup:

2 datacentres. Each datacentre contains a single hobbit server instance.
Each client reports to their local datacentre hobbit server.
Each hobbit server is configured such that they know about the other
hobbit (through BBDISPLAYS).

The issue is that for what looks like most server side tests, such as
vmstat etc, that we are getting feedback loops between the hobbit servers.
For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
The client reports back to hobbit in DC1 and hobbit then also reports this
data to the hobbit in DC2. The hobbit in DC2 however is configured to
report to DC1 and so bounces the message back (i think). Therefore the
server tries to update the rrd twice within a second resulting in errors.
Eventually this will crash the server.
How did you determine that this is what is "crashing" the server?
...
An example of the rrd error
messages:
2009-08-20 11:04:04 RRD error updating
/export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762644 when last update time is
1250762644 (minimum one second step)
2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
/export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
illegal attempt to update using time 1250762646 when last update time is
1250762646 (minimum one second step)
I have a number of setups where messages like this are common, due to running
network tests and SNMP polling at intervals smaller than 5 minutes (without
adjusting all the RRD files to cater to this), and I have not seen hobbit
"crash" due to this.
What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd
die and leave a status message? Or, does something else occur? Does the server
reboot? Does the OS hang? How often does this occur?
...
My question is - how can we stop this happening?
You would first need to tell us what is happening ...
...
Also, why is this
happening? Is there a way we can disable rrd graphing on one server so just
one hobbit server handles the graphing?
I hope that makes sense. If you need further clarification please let me
know.
If hobbitd or hobbitd_rrd or some other process actually crashes, you should
be able to get a core file, from which you can get a backtrace (e.g. with gdb),
which would allow someone to see why it is crashing, and possibly fix it.
Regards,
Buchan
To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk

[hobbit] RRD crashing high availability hobbit

fduranti＠q8.it