New Hobbit stuff: Scalability and H/A work

31 Oct 2006


      A couple of weeks ago, I was asked if our Hobbit system at work could
handle monitoring of one more customer. Of course, I said - no problem.
Well, there was one gotcha: This customer has 1100+ servers that need
to be monitored. Which means my Hobbit installation is about to double
in the number of hosts monitored. Hmm ...
This will be interesting to watch. I am fairly confident that Hobbit can
handle it, with one exception: The disks on my Hobbit server will be
overloaded. It already spends about 50% of it's time in I/O wait, so
doubling the number of hosts with cpu/memory/disk etc. graphs will
probably crash it.
So something needs to be done - fast: These hosts should go into our
Hobbit before Christmas. That is why I currently may seem a bit absent
from the mailing list.
The way I plan to handle it will be by distributing the load of RRD
updates onto several servers, each handling a subset of the total set of
hosts; Hobbit will automatically detect which of the 3 RRD-servers
handles a specific host and direct rrd-updates to that server. New
hosts are distributed across all of the RRD servers in a weighted
round-robin fashion.
The way this is going to be done means that it can be used not only for
distributing the load of the RRD file updates, but also for distributing
the other hobbitd_* modules (alerting, history logs, client data
processing etc). In other words, this will be a major win for Hobbit in
large installations.
It also has one more benefit: I think this can be evolved to handle
automatic failover, so you can run multiple Hobbit servers that process
the same data - meaning all of the on-disk data will be identical across all
of the Hobbit servers. This should make it possible to setup a group of
Hobbit servers for very high availability of the monitoring system. I
haven't worked out all of the implementation details yet, but I think it
is possible.
Regards,
Henrik

New Hobbit stuff: Scalability and H/A work

henrik＠hswn.dk