Hi Wolfgang,
your mails were in my spam mailbox for some reason. I then started writing a reply directly to You, but decided to send it to the list also
- so it can be referenced through the mail archive if anyone else needs it.
Now I need your help, because my professor at my university wants me to present the software architecture of Hobbit in my thesis. He wants some pictures and diagramms showing the 'inner mechanisms' and how the different modules work together. So my questions is, do you have any information about the architecture, that you can share with me?
I don't have any formal design documents; the Hobbit design has evolved from a few basic principles and some ideas I had. I'll try to give you a quick summary.
Hobbit should be portable across Unix architectures. This is obviously important for the Hobbit client code, but I've done my best to use only "standard" Unix and C for both the server- and client-code. This has turned out to be easier than I thought; there are some really old Unix systems that cannot run the Hobbit server, but any recent version of a Unix-like system (released within the past 5-10 years) runs Hobbit without problems.
Hobbit also has to be backwards compatible with Big Brother, in the sense that you can use BB clients with a Hobbit server. I had over 1000 servers with a BB client on them (Unix and Windows), so doing a "Big Bang" switch changing the servers and clients at once would be impossible.
Hobbit must scale well. When I started using Big Brother I had 40 servers to monitor. When I got to 100, I had to re-implement parts of BB and this became the "bbgen" toolkit. When I got to 500, my BB server was getting overloaded, and I started to work on a replacement. Today I have 2500 servers, and before Christmas I will be monitoring nearly 4000 servers. That's a 100x increase in just 5 years.
Hobbit should not rely on a lot of other infrastructure to work. E.g. it doesn't require a huge database backend; you can add one if you like, but it is not needed. Keeping Hobbit simple makes it robust - and you really, REALLY want your monitoring to work when everything else is crashing. We once had a major power outage at our datacenter; the Hobbit server came up quickly, but there were a lot of systems that needed some manual intervention to get back online. It was quite interesting to see how the activity on the Hobbit server just sky-rocketed, because everyone was looking at Hobbit to see which systems were running, and which were down.
Analyzing data
For me, a key principle of handling the data that is poured into Hobbit is that as much as possible of the data analysis should take place on the Hobbit server. I believe it is a huge benefit to keep configuration settings in one place (the Hobbit server); also, by having access to the raw data you can also perform types of analysis that you didn't think of when a script was initially created to collect some data.
E.g. the Hobbit client reports data about who is logged on to a system. This information is not currently used by Hobbit, but I know that someone wrote a custom backend utility to check this data and alert him if there was someone logged in as "root". I had not thought of this, but by making the raw data from the client available, it was very easy for him to implement this check on all of his servers.
This is quite different from Big Brother, where the data is pre-processed into status messages. The BB client can check if a process is running, but then it just reports "process foo is OK". When process "foo" is NOT running you get an error status - but you cannot easily see if process "foo" has stopped because your backup is running at the same time (and they cannot coexist) because you only get part of the information, not the full process listing.
This may seem like a trivial example, but I realized early on that there are far more ways of using these data than I could possibly imagine. So instead of forcing my ideas of how to use the data upon others, it should be possible to just get the raw data and perform your own analysis of it.
Another example is that in Hobbit 4.2, I added a module which saves a copy of the client data if a status goes red on a host. This has turned out to be extremely helpful in diagnosing those "why did the webserver crash at 4 AM last Tuesday" questions ... because you have access to a lot of raw data collected by the client just before the crash happened, including all of the data that Hobbit didn't analyze by itself but which humans can use to put the whole picture together.
This is not implemented completely yet. The network test utility - which was also carried over from the bbgen toolkit - works the "Big Brother" way. One thing on my agenda is to change that, so the network tester just reports that the ping of host "foo" responded in 12.7 ms, the ping of host "bar" failed and so on. Then a module on the Hobbit server can decide if these should result in a red or yellow status, perhaps based on other information it has (eg that the response time shouldn't exceed 10 ms during working hours, unless the primary network connection was down so we were running on a backup line with less capacity).
The core daemons
I wanted to have a network daemon holding all of the "current state" information. This information changes all the time as new status reports arrive, so it has to keep this in memory - writing it to disk would be too slow (BB did this, and it doesn't scale). So a core component of Hobbit would be this central daemon (hobbitd). The daemon NEVER does any disk I/O; this would slow it down and I don't want that, because Hobbit must support monitoring of thousands of servers. All communication between hobbitd and the outside world goes via a network connection; this is used both for in-band data (status updates and data messages), but also for out-of-band data like control messages (drop a host, disable a server and so on). Tools that need to fetch the entire status of all servers, or just the detailed status of a host also do this through a network connection to hobbitd.
However, some things must be stored on disk - RRD (graph) files, for instance, or historical eventlogs. So this is handled by a bunch of independant "worker" modules - hobbitd_rrd (RRD updates), hobbitd_history (history logs), hobbitd_alert (sending out alerts). These obviously have to be fed information about the data that flows into the hobbitd daemon - e.g. hobbitd_rrd needs the full status message to extract the data it puts into the RRD files, and hobbitd_history needs information about the status changes from green->red and so on. So I needed a fast inter-process communication mechanism between hobbitd and the worker modules. Also, I wanted to be able to start/stop/restart worker modules on-the-fly; this is extremely nice for testing and makes the system much more robust. Finally, I wanted an interface that was simple to use so that end-users can hook into the data stream if they need to write some custom back-end script. The solution for this was a mechanism that uses the System V "shared memory" IPC mechanism, combined with a group of semaphores to control access to the shared memory area. So hobbitd copies a message into the shared memory area and up's a semaphore telling the workers that there is a new message. The workers then pick up the message and down's another semaphore once they have secured their copy of the message; hobbitd then knows when it is safe to overwrite the shared-memory area with a new message. I call this IPC mechanism a "channel", and there are in fact several of these: One for each type of message. So there is a channel which receives all of the raw "status" messages; another channel for the raw "data" messages; a channel that receives messages about status changes (for history logging); a channel that receives messages about critical red/yellow statuses (for alerts) and so on. Recently a new channel was added for the "client" messages that comes from the Hobbit client.
There are some early notes about this mechanism in the hobbitd/new-daemon.txt file in the hobbit sources. Not all of the ideas there have been implemented, e.g. the "streaming" protocol turned out not to be particularly important.
To make sure that the semaphore stuff is handled correctly, I decided to put a "buffer" module between hobbitd and the actual workers. This is the hobbitd_channel module; it serves only one purpose, which is to grab the messages that hobbitd sends out through the IPC mechanism, and queue them for the real worker module (hobbitd_rrd, hobbitd_history etc). The fact that hobbitd_channel acts as a message queue is useful to accomodate spikes in the activity, e.g. the alert module sometimes gets a huge spike of messages e.g. when a network switch dies. hobbitd_channel also makes it easy to build your own backend modules, because it forwards the messages via a simple text-based pipe; so your custom backend modules can just read them from stdin.
Another benefit of having hobbitd_channel between hobbitd and the worker modules showed up recently; I am currently working on a new version of hobbitd_channel which can distribute the incoming messages between multiple worker "clones" running on different servers, to perform some load balancing of the heavy tasks (primarily RRD file updates). This has been implemented almost exclusively by changing hobbitd_channel, instead of having to modify all of the worker modules.
So the core design looks like this:
Network tests --
TCP:1984 \ IPC
Clients ----------> hobbitd --------> hobbitd_channel ---> worker module
/ Sh. mem. stdin
/
Custom tests ---/
The Web interface
The web interface is mostly carried over from Hobbit's predecessor, the "bbgen" toolkit. I wrote this for Big Brother, to speed up the generation of the Big Brother webpages, and by re-using this in Hobbit I would quickly get a working web interface - all I had to do was to change the programs to grab their data from the hobbitd daemon, instead of reading through the status logfiles that Big Brother uses.
This also means that the web interface is not tied in with the core daemons. Sure, they need to communicate and there are some things in the core daemons that are closely related to how the web interface works - e.g. disabling a host. But it should be possible for an adventurous programmer to use the core Hobbit daemons with their own web front-end tools and come up with a completely different user-interface.
So the web interface is probably the part of Hobbit that has evolved the least from it's origins in Big Brother. Some new CGI programs have been added, but nothing revolutionary new - it just picks up bits of information from hobbitd and the configuration files and displays them.
One design criteria for the web interface is that it should be as dynamic as possible; it must reflect the current status and configuration as much as possible. That is why most of the web interface is done with CGI programs; the only static webpages in Hobbit are the overview pages generated by bbgen - and I hope to eliminate those soon.
The clients
So with this background, it is obvious that the Hobbit client is really, really dumb. It is basically just a shell script that runs some normal OS commands - df, ps, who and so on - and then it's up to the Hobbit server to analyze them and generate some status columns. Client data is sent to hobbitd, which feeds it through a channel to the hobbitd_client module. hobbitd_client has some parsers for each of the operating systems it knows about, and uses those to grab the interesting data and compare it to the client configuration rules. Then hobbitd_client generates some "status" messages and sends them to hobbitd. The major challenge with this design is logfiles; you cannot realistically send entire logfiles - some of them are several GB of data - over to Hobbit for analysis every 5 minutes. So some filtering must be done on the client side; to keep all of the configuration data on the Hobbit server this meant that the client has to pick up its filter-configuration from the Hobbit server.
I hope this is enough of an overview for You. Good luck with your thesis.
Regards, Henrik