On Thursday, 20 August 2009 11:06:30 j.sansford at ntlworld.com wrote:
Hi again all,
I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).
Detail of theoretical setup:
- 2 datacentres. Each datacentre contains a single hobbit server instance.
- Each client reports to their local datacentre hobbit server.
- Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).
The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.
For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server.
How did you determine that this is what is "crashing" the server?
An example of the rrd error messages:
2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644 when last update time is 1250762644 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
I have a number of setups where messages like this are common, due to running network tests and SNMP polling at intervals smaller than 5 minutes (without adjusting all the RRD files to cater to this), and I have not seen hobbit "crash" due to this.
What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd die and leave a status message? Or, does something else occur? Does the server reboot? Does the OS hang? How often does this occur?
My question is - how can we stop this happening?
You would first need to tell us what is happening ...
Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing?
I hope that makes sense. If you need further clarification please let me know.
If hobbitd or hobbitd_rrd or some other process actually crashes, you should be able to get a core file, from which you can get a backtrace (e.g. with gdb), which would allow someone to see why it is crashing, and possibly fix it.
Regards, Buchan