A couple of weeks ago, I was asked if our Hobbit system at work could handle monitoring of one more customer. Of course, I said - no problem. Well, there was one gotcha: This customer has 1100+ servers that need to be monitored. Which means my Hobbit installation is about to double in the number of hosts monitored. Hmm ...
This will be interesting to watch. I am fairly confident that Hobbit can handle it, with one exception: The disks on my Hobbit server will be overloaded. It already spends about 50% of it's time in I/O wait, so doubling the number of hosts with cpu/memory/disk etc. graphs will probably crash it.
So something needs to be done - fast: These hosts should go into our Hobbit before Christmas. That is why I currently may seem a bit absent from the mailing list.
The way I plan to handle it will be by distributing the load of RRD updates onto several servers, each handling a subset of the total set of hosts; Hobbit will automatically detect which of the 3 RRD-servers handles a specific host and direct rrd-updates to that server. New hosts are distributed across all of the RRD servers in a weighted round-robin fashion.
The way this is going to be done means that it can be used not only for distributing the load of the RRD file updates, but also for distributing the other hobbitd_* modules (alerting, history logs, client data processing etc). In other words, this will be a major win for Hobbit in large installations.
It also has one more benefit: I think this can be evolved to handle automatic failover, so you can run multiple Hobbit servers that process the same data - meaning all of the on-disk data will be identical across all of the Hobbit servers. This should make it possible to setup a group of Hobbit servers for very high availability of the monitoring system. I haven't worked out all of the implementation details yet, but I think it is possible.
Regards, Henrik