Loadbalancing Hobbit Server
Hi
I would like to implement hobbit collecting data from lot of hosts. Considering that, is there a way to load balance or distribute the monitoring across multiple hobbit servers ?
- Ram
On Mon, Feb 12, 2007 at 02:02:16PM -0500, Prasad, Ram (GE, Corporate, consultant) wrote:
I would like to implement hobbit collecting data from lot of hosts. Considering that, is there a way to load balance or distribute the monitoring across multiple hobbit servers ?
Not in the current (4.2.0) version, but some work has been done for the next release. Hobbit *is* designed to have only one server that has full knowledge of the current status of each system being monitored, but you can distribute various components of Hobbit onto other servers:
Network tests can be distributed among multiple servers - typically you will have a network test server that handles a specific part of your network topology. E.g. one network test server for the DMZ systems, and another for your internal-only servers. This is available in 4.2.0.
History logs and RRD files - i.e. the Hobbit modules that need to store data on disk - can be distributed among multiple servers. Hobbit will automatically send updates to the correct server, and fetch data from the server holding it when generating webpages. This also applies to the client-data logs that are stored when a critical event occurs. (4.3.0)
Client data analysis can be performed on any server. Hobbit can feed the client data to any server capable of doing this. (4.3.0)
BTW, the stuff listed as present in the next (4.3.0) release is already in the current snapshot, and at least the RRD load balancing is in production use now.
But before you start planning to deploy large farms of Hobbit servers to handle all of your hosts, do a trial installation and see what the load on your Hobbit servers will be. At work, I have single servers handling over 3000 hosts - and a 5 year old Hobbit server handles it just fine. The only task we've split off is the RRD updates, since the disk system on our Hobbit server couldn't handle the flood of RRD file updates (we have about 25000 RRD files being updated every 5 minutes).
Regards, Henrik
- History logs and RRD files - i.e. the Hobbit modules that need to store data on disk - can be distributed among multiple servers. Hobbit will automatically send updates to the correct server, and fetch data from the server holding it when generating webpages. This also applies to the client-data logs that are stored when a critical event occurs. (4.3.0)
Is this leading the way to a hot-cold or hot-hot HA setup? I understand how one server could distribute jobs to a farm. But what if the central server goes down? If data is sync'ed throughout the environement how could the 'freshest' be guaranteed through failures?
I am looking to implement a HA Hobbit solution where 10 minutes of recovery is acceptable while preserving historical data.
Are you just writing 'load sharing logic' or do you plan on developing failover/recovery logic as well?
Scott Walters -PacketPusher
Hi Scott,
you're always asking interesting questions :-)
On Mon, Feb 19, 2007 at 03:41:47PM -0500, Scott Walters wrote:
- History logs and RRD files - i.e. the Hobbit modules that need to store data on disk - can be distributed among multiple servers. Hobbit will automatically send updates to the correct server, and fetch data from the server holding it when generating webpages. This also applies to the client-data logs that are stored when a critical event occurs. (4.3.0)
Is this leading the way to a hot-cold or hot-hot HA setup? I understand how one server could distribute jobs to a farm. But what if the central server goes down? If data is sync'ed throughout the environement how could the 'freshest' be guaranteed through failures?
I am looking to implement a HA Hobbit solution where 10 minutes of recovery is acceptable while preserving historical data.
Are you just writing 'load sharing logic' or do you plan on developing failover/recovery logic as well?
The immediate need I had was load sharing. But I believe it can be used to implement failover as well. Let me explain.
A Hobbit server consists of one core daemon which has all of the current state information, and a bunch of more-or-less stateless "task" handlers. There's an "update the RRD files" task, an "analyze the client data" task, a "send out the alerts" task, and a "run the network tests" task. Plus some more, but you get the picture.
My plan is that you can have multiple servers running each of these tasks, and you can duplicate the tasks so they run on multiple servers. When each task is initialized, it tells Hobbit that "hey, I'm here and I can do alerts" - and then it basically just goes to sleep until it is notified that now it should actually do something. So, whenever the Hobbit server needs to hand off some action to a task, it checks what servers can handle it and just picks one that is available.
The information about what servers are available for handling the various tasks is contained in a small demon running on the Hobbit server; think of it as a kind of "Hobbit-DNS" except that it is updated automatically.
Some tasks can run on any of the available servers. E.g. analyzing the client data can be done on any server running the hobbitd_client module; so it doesn't matter which of the available "client task" servers is invoked. (Obviously, the config files must be kept in sync on the servers, but that's why we have tools like rsync).
Some tasks store data - e.g. the RRD files. Those tasks can run on multiple servers, BUT: For any given host, there will be only one server holding the data. It's no good feeding the RRD updates to server A at 10:00 AM, and server B at 10:05 - because that would break the RRD data. So if the RRD files for "www.foo.com" lives on server A, and that server crashes, then you will lose access to the RRD files for www.foo.com - but RRD files for hosts on the other servers will still be available. History logs are handled like RRD files. Now, you can argue that it would be nice if you could replicate the RRD- or history-updates to multiple servers so you would have a complete failover where you wouldn't lose access to some of the data. If there's enough requests it can be added - there's nothing in the design that prevents it. But perhaps it would just be simpler to mirror those files between the servers at regular intervals through some other program.
There are some tasks that can only run on one server at a time: E.g. the "send out alerts" task is one you wouldn't want to duplicate. So for this type of task, Hobbit will initially pick one server to handle it, and only if that servers fails will it switch to another server.
So now there's a mechanism in place for having fail-over servers for the critical tasks, and load-balancing tasks among multiple servers. The missing piece is to duplicate the core Hobbit server, and replicate the information that is stored there (the current state of the system, and the what-servers-run-what-tasks info). That's the part I haven't quite worked out yet.
Replicating the data is fairly straight-forward. Hobbit already has a mechanism in place for saving the current state in a "checkpoint" file, so it can restart without losing the current state info. So replication can be done by putting in some method for requesting the checkpoint data. Sure, you'd lose a few minutes worth of updates in case of a failover - depending upon how often you update the standby-servers' data from the checkpoint - but since Hobbit updates everything every 5 minutes, I don't think that will be a major issue.
The tricky part is deciding when to do the failover. My current plan is to have a "standby" option for the backup Hobbit daemon where it just loads and picks up the checkpoint data from the master server at regular intervals; once that fails it goes on-line and starts behaving like a regular Hobbit daemon. That would suffice for a 2-server/hot-cold setup, and makes matters a lot less complicated (eg I won't have to deal with the issue of deciding who has the most recent data).
There are still a couple of murky details, like how do you get the clients to send their data to the server that is up? One way would be to send them a list of the available Hobbit servers whenever they send their client reports, so they always (except the first time) have a list of the current servers. If sending data to the first server fails, they must try the next server in the list - if that works, then they'll get a new list back with the new Hobbit server as the first one to try.
Those are my ideas. Feedback is very welcome from anyone; this is a relatively new area for me to be working with (at least from a programmer perspective), so any input will be appreciated.
Regards, Henrik
On 2/19/07, Henrik Stoerner <henrik at hswn.dk> wrote:
Hi Scott,
you're always asking interesting questions :-)
Thanks. I changed majors to Philosophy since returning to college to finish my undergraduate degree. I'm glad to see it's paying off ;)
Some tasks can run on any of the available servers. E.g. analyzing the client data can be done on any server running the hobbitd_client module; so it doesn't matter which of the available "client task" servers is invoked. (Obviously, the config files must be kept in sync on the servers, but that's why we have tools like rsync).
Hmmm...since you already have a "task master" it might be convenient to make it the "config master" as well. Similar to the hobbit-client.cfg?
Some tasks store data - e.g. the RRD files. Those tasks can run on
multiple servers, BUT: For any given host, there will be only one server holding the data. It's no good feeding the RRD updates to server A at 10:00 AM, and server B at 10:05 - because that would break the RRD data. So if the RRD files for "www.foo.com" lives on server A, and that server crashes, then you will lose access to the RRD files for www.foo.com - but RRD files for hosts on the other servers will still be available. History logs are handled like RRD files. Now, you can argue that it would be nice if you could replicate the RRD- or history-updates to multiple servers so you would have a complete failover where you wouldn't lose access to some of the data. If there's enough requests it can be added - there's nothing in the design that prevents it. But perhaps it would just be simpler to mirror those files between the servers at regular intervals through some other program.
Yes, that's the million dollar question: Should HA with integrity of RRD/history files be part of of Hobbit? Even if you do put the history-updates to multiple servers, you still have the nightmare of how to sync things up when the "dead" server comes back up.
Those are my ideas. Feedback is very welcome from anyone; this is a relatively new area for me to be working with (at least from a programmer perspective), so any input will be appreciated.
Because of the complexity of HA solutions and data integrity, I am not sure the hobbit code is the right place for the logic. Similar to the database backend, you'll open yourself up to a lot of potential debugging. I am a keep it simple stupid kinda guy and I am reminded of a saying, "A man with one watch always knows what time it is."
I'd rather see the hobbit tool improve monitoring, reports, and other features that really matter. Let the HA happen outside of hobbit.
I also believe you should only cluster/load-balance when one box can't do the job. Introducing those complexities to increase availability are usually counterproductive -- you end up taking your system down because it's so hard to configure/maintain. And then it usually doesn't work anyway when it's supposed to.
Scott Walters -PacketPusher
On Mon, Feb 19, 2007 at 11:36:08PM -0500, Scott Walters wrote:
Because of the complexity of HA solutions and data integrity, I am not sure the hobbit code is the right place for the logic. Similar to the database backend, you'll open yourself up to a lot of potential debugging. I am a keep it simple stupid kinda guy and I am reminded of a saying, "A man with one watch always knows what time it is."
I'd rather see the hobbit tool improve monitoring, reports, and other features that really matter. Let the HA happen outside of hobbit.
I do try to keep it as simple as possible. The loadbalancing stuff had almost no impact on the existing code, and if at all possible I'll isolate this in a separate module so a "normal" single-site setup won't have to deal with it.
But I do sympathize with your point. You could build a HA Hobbit setup today using standard tools - shared storage and standard failover software like the Linux-HA tools - and perhaps that is the best way for this.
I also believe you should only cluster/load-balance when one box can't do the job.
That is the problem I was facing recently, so there was no way to avoid that.
Introducing those complexities to increase availability are usually counterproductive -- you end up taking your system down because it's so hard to configure/maintain. And then it usually doesn't work anyway when it's supposed to.
*grin* yes, this was clearly demonstrated in an incident we had last week at work.
Regards, Henrik
From where I sit, an active hobbit server with a running hot standby seems to be fairly easy to implement now. I haven't tried to set it up, but I've looked at the requirements.
I'm currently running this config; I use the hot standby for initial server testing on new releases and for checking out different bb-hosts layouts for cosmetic appeal. All my systems know both server addresses. And the hot standby does everything except network tests and alerting.
All that would need to happen in the event the primary hobbit server failed would be to update the hobbitlaunch.cfg to enable the network testing module and the alerting module. And move the IP address of the webserver. This should be doable with the currently available HA toolset for Linux (I'll know more in a few weeks -- I've been promised a new pair of hobbit servers to implement this on).
This does require suitable network bandwidth to run the data to both systems, and it will require playing with the hobbit checkpoint file so the failover system will know the proper enable/disable/ack statuses on restart.
Tom Kauffman NIBCO, Inc
-----Original Message----- From: Henrik Stoerner [mailto:henrik at hswn.dk] Sent: Tuesday, February 20, 2007 1:53 AM To: hobbit at hswn.dk Subject: Re: [hobbit] Loadbalancing Hobbit Server
On Mon, Feb 19, 2007 at 11:36:08PM -0500, Scott Walters wrote:
Because of the complexity of HA solutions and data integrity, I am not
sure
the hobbit code is the right place for the logic. Similar to the database backend, you'll open yourself up to a lot of potential debugging. I am a keep it simple stupid kinda guy and I am reminded of a saying, "A man with one watch always knows what time it is."
I'd rather see the hobbit tool improve monitoring, reports, and other features that really matter. Let the HA happen outside of hobbit.
I do try to keep it as simple as possible. The loadbalancing stuff had almost no impact on the existing code, and if at all possible I'll isolate this in a separate module so a "normal" single-site setup won't have to deal with it.
But I do sympathize with your point. You could build a HA Hobbit setup today using standard tools - shared storage and standard failover software like the Linux-HA tools - and perhaps that is the best way for this.
I also believe you should only cluster/load-balance when one box can't do the job.
That is the problem I was facing recently, so there was no way to avoid that.
Introducing those complexities to increase availability are usually counterproductive -- you end up taking your system down because it's so hard to configure/maintain. And then it usually doesn't work anyway when it's supposed to.
*grin* yes, this was clearly demonstrated in an incident we had last week at work.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
CONFIDENTIALITY NOTICE: This email and any attachments are for the
exclusive and confidential use of the intended recipient. If you are not
the intended recipient, please do not read, distribute or take action in
reliance upon this message. If you have received this in error, please
notify us immediately by return email and promptly delete this message
and its attachments from your computer system. We do not waive
attorney-client or work product privilege by the transmission of this
message.
participants (4)
-
henrik@hswn.dk
-
KauffmanT@nibco.com
-
ram.1prasad@ge.com
-
scott@PacketPushers.com