New Hobbit stuff: Scalability and H/A work
A couple of weeks ago, I was asked if our Hobbit system at work could handle monitoring of one more customer. Of course, I said - no problem. Well, there was one gotcha: This customer has 1100+ servers that need to be monitored. Which means my Hobbit installation is about to double in the number of hosts monitored. Hmm ...
This will be interesting to watch. I am fairly confident that Hobbit can handle it, with one exception: The disks on my Hobbit server will be overloaded. It already spends about 50% of it's time in I/O wait, so doubling the number of hosts with cpu/memory/disk etc. graphs will probably crash it.
So something needs to be done - fast: These hosts should go into our Hobbit before Christmas. That is why I currently may seem a bit absent from the mailing list.
The way I plan to handle it will be by distributing the load of RRD updates onto several servers, each handling a subset of the total set of hosts; Hobbit will automatically detect which of the 3 RRD-servers handles a specific host and direct rrd-updates to that server. New hosts are distributed across all of the RRD servers in a weighted round-robin fashion.
The way this is going to be done means that it can be used not only for distributing the load of the RRD file updates, but also for distributing the other hobbitd_* modules (alerting, history logs, client data processing etc). In other words, this will be a major win for Hobbit in large installations.
It also has one more benefit: I think this can be evolved to handle automatic failover, so you can run multiple Hobbit servers that process the same data - meaning all of the on-disk data will be identical across all of the Hobbit servers. This should make it possible to setup a group of Hobbit servers for very high availability of the monitoring system. I haven't worked out all of the implementation details yet, but I think it is possible.
Regards, Henrik
Henrik Stoerner wrote:
A couple of weeks ago, I was asked if our Hobbit system at work could handle monitoring of one more customer. Of course, I said - no problem. Well, there was one gotcha: This customer has 1100+ servers that need to be monitored. Which means my Hobbit installation is about to double in the number of hosts monitored. Hmm ...
This will be interesting to watch. I am fairly confident that Hobbit can handle it, with one exception: The disks on my Hobbit server will be overloaded. It already spends about 50% of it's time in I/O wait, so doubling the number of hosts with cpu/memory/disk etc. graphs will probably crash it.
Strange that I/O seem to be an issue for you. What kind of system do you run the hobbit server on?
I have ~3300 rrd files updated here on a blade with an old 40 Gig 5400 2,5" hard drive and it is almost idle.
Cheers, Gildas
On Tue, Oct 31, 2006 at 04:48:34PM +0000, Gildas Le Nadan wrote:
Henrik Stoerner wrote:
This will be interesting to watch. I am fairly confident that Hobbit can handle it, with one exception: The disks on my Hobbit server will be overloaded. It already spends about 50% of it's time in I/O wait, so doubling the number of hosts with cpu/memory/disk etc. graphs will probably crash it.
Strange that I/O seem to be an issue for you. What kind of system do you run the hobbit server on?
The server is an oldish Sun E220 with two new 72 GB SCSI disks, 10K rpms.
I have ~3300 rrd files updated here on a blade with an old 40 Gig 5400 2,5" hard drive and it is almost idle.
I have about 25000 RRD files :-) That's about 80 updates every second.
Regards, Henrik
From: henrik at hswn.dk (Henrik Stoerner) Reply-To: hobbit at hswn.dk To: hobbit at hswn.dk Subject: [hobbit] New Hobbit stuff: Scalability and H/A work Date: Tue, 31 Oct 2006 17:03:34 +0100
A couple of weeks ago, I was asked if our Hobbit system at work could handle monitoring of one more customer. Of course, I said - no problem. Well, there was one gotcha: This customer has 1100+ servers that need to be monitored. Which means my Hobbit installation is about to double in the number of hosts monitored. Hmm ...
This will be interesting to watch. I am fairly confident that Hobbit can handle it, with one exception: The disks on my Hobbit server will be overloaded. It already spends about 50% of it's time in I/O wait, so doubling the number of hosts with cpu/memory/disk etc. graphs will probably crash it.
So something needs to be done - fast: These hosts should go into our Hobbit before Christmas. That is why I currently may seem a bit absent from the mailing list.
The way I plan to handle it will be by distributing the load of RRD updates onto several servers, each handling a subset of the total set of hosts; Hobbit will automatically detect which of the 3 RRD-servers handles a specific host and direct rrd-updates to that server. New hosts are distributed across all of the RRD servers in a weighted round-robin fashion.
So this is load balancing, each hobbit server keep a portion all-hosts's rrd files.
The way this is going to be done means that it can be used not only for distributing the load of the RRD file updates, but also for distributing the other hobbitd_* modules (alerting, history logs, client data processing etc). In other words, this will be a major win for Hobbit in large installations.
It also has one more benefit: I think this can be evolved to handle automatic failover, so you can run multiple Hobbit servers that process the same data - meaning all of the on-disk data will be identical across all of the Hobbit servers. This should make it possible to setup a group of Hobbit servers for very high availability of the monitoring system. I haven't worked out all of the implementation details yet, but I think it is possible.
Please check out FTSha(R1), the opensource cluster/ha for Solaris. So the only part you need to work on is the "Distributed BB messages distribution".
R1: http://www.fstha.com/compare.html
Regards
tj
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Stay in touch with old friends and meet new ones with Windows Live Spaces http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://spa...
participants (3)
-
gn1@sanger.ac.uk
-
henrik@hswn.dk
-
tj_yang@hotmail.com