On Mon, December 4, 2006 10:03 am, Henrik Stoerner wrote:
On Sun, Dec 03, 2006 at 12:10:03AM +0100, Henrik Stoerner wrote:
Besides, "fail over" means lot of different things. For a true fail over setup, you'll need some hardware support on top of Hobbit - providing a virtual IP for your resilient hosts, and probably some sort of shared storage. Most of that is handled outside Hobbit.
So what exactly do you have in mind ?
I'm replying to my own mail to pick up all the responses that have come about this.
Trever Noggle:
I would like to do like you can with BB.. The master and the backups will be on completely different networks [...] I want to have a main monitor server at location 1 monitoring devices at both locations. I then want location 2 to take over if location 1 goes down.
Anton Burkhalter:
I use two independent Hobbit servers; each client reports to both servers. The question is how to synchronize the two servers after an outage of a server.
Ralph Mitchell:
The thing that concerns me is that I can't be running the same checks from two servers at the same time. People around here get irritated when their webserver stats are artificially inflated
Daniel J McDonald:
I'd like hobbit-alerts to only run on one box at a time. Displays and tests can all run independently
For myself, I might add that having access to the historical data - both graphs and history logs - is also a requirement.
The simple "do it like BB does" is inadequate - it cannot handle keeping the historical data up-to-date on both servers, and it also fails to carry over the current alerts that are active: If the master server sent out an alert for something before it crashed, and the next alert should go out 12 hours later, then this repeat-setting isn't transferred to the slave server. So when the master server drops off the net and the slave server takes over, it will immediately start by sending out alerts for everything that is down. Not good.
The current state of a Hobbit server can easily be shared among servers. The checkpoint files that go into ~hobbit/server/tmp/ can be copied across to another server, and if you do that often enough then starting up Hobbit on the other server will pick up all of the current status. So that part is easy - for convenience I might want to implement some sort of internal Hobbit protocol for distributing the checkpoint files, but you can already today just use scp, rsync or similar to copy those files over.
The downside of this of course is that something has to recognize when the primary site is gone, and start up Hobbit on the secondary server. That is not very attractive; I would rather have Hobbit running on both servers all the time - this would require some work. But let's assume for now that this is possible.
Then there are the on-disk files: History logs and graphs. Something has happened here recently, since it is now possible to distribute these over multiple servers - and also to have more than one site perform all of the updates of those files. So instead of periodically copying the files from a master server to a slave, you just copy them once and then mirror all of the updates to the relevant servers. The code for this is in the current snapshots; it isn't documented yet, and hasn't had much testing. I use it currently for another purpose: Load-sharing of the updates.
Finally there are the various Hobbit tasks: The display, the network tests, the alerts.
The display tasks are very easily distributed to multiple servers - it is somewhat inconvenient that there are static webpages built for the overview webpages, I want to eliminate those and have all of the webpages generated dynamically - but the web display does not have to be on the same physical server as the rest of Hobbit, so doing failover for the web interface is relatively simple.
Alerts - the code is *almost* ready. It is based on the same principle as what is used for distributing the history- and RRD-files across multiple servers; the hobbitd_alert module runs on all of the servers - so it keeps track of the repeat times etc - but it only actually sends the alerts from one of the servers at any time.
Network tests - I've heard arguments going both ways as to whether one should run network tests on all servers ("it is interesting to see if the site is down when tested from all of our locations, or only from the primary location"), or on just a single server ("we want to minimize traffic from the monitoring systems towards the webservers"). I'm still thinking about how to handle this - if they run on all Hobbit servers, there has to be some way of choosing which test result should be used; if they run only on a single server I will probably use the same method for choosing which server runs the tests as I use to decide who gets to send out alerts.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Personally for me, for what I am wanting to be able to do, I do not care about history data. It would be nice but not required. I simply want the box at the remote site to take over on all of the display, testing and paging if the master server (or network) is down. This way I will still get alerted if there is a problem at site 1. And since site 2 will also be monitored by site 1 I will be alerted if there is a problem at either site.
It would be nice in the future to be able to have the historical data in sync on both boxes but that is not something that is important to me at this point.
-Trever