Hi Greg,
I've taken the liberty of sending this to the Xymon list also, since it is probably of general interest.
On Mon, Nov 02, 2009 at 11:39:09AM -0500, shea_greg at emc.com wrote:
I'm having some trouble trying to figure out how to off-load RRD processing with the 4.3.0 code. I found hobbitd_locator and that's part of it, as well as hobbitd_channel, but it's not clear to me how to setup the master and peer(s). Also how does this affect the webpage generation?
From earlier posts to the list, I have a single server running 4.2.0 with over 70000 RRD files and I'm experiencing serious delays in processing data and have to restart Hobbit every 15 minutes. One solution I'm aware of and also Buchan had mentioned is to add more spindles.
The standard 4.3.0 beta adds caching of RRD file updates, and this has a significant impact on the I/O load of the server - essentially, it means that hobbitd_rrd caches up to 12 updates (= 1 hour) before it does an actual update of the RRD file. Since the amount of disk I/O is almost identical whether you're doing one data update or 50, this caching eliminates about 90% of your disk I/O on the RRD files. So that would be the simplest solution to implement.
I have 90000+ RRD files. I recently did a hardware upgrade of the server, but it isn't anything fancy - just a plain HP DL360 with a set of two 36 GB SCSI disks in hardware RAID-1. I used to off-load the RRD handling to another server, but it is now back on the main Xymon server. The amount of memory used for the cache isn't all that much - about 50 MB on my system.
The only downside to this is that shutting down Xymon means all of the cached data must be flushed to disk - and this take a while, 10-15 minutes on my system.
Another optimization to eliminate disk I/O is to move the generated webpages to a RAM disk. I have ~hobbit/server/www/ on a ram-disk; the gifs, help, menu, notes, rep and snap sub-directories are symlinks that point to "real" (disk-based) storage. This means all of the webpages that are re-generated once a minute resides on a RAM disk, eliminating all of the disk I/O that rewriting them causes. And since they are regenerated so often, it doesn't matter that they're wiped out when you reboot the server - they are regenerated within a minute after you have Xymon up and running again.
But the remote RRD off-loading works, I've used it for more than a year. Here's how to set it up.
First, webpage-generation is unchanged, it still happens on the main Xymon server and the fact that the RRD files are stored somewhere else is transparent.
The main server runs the hobbitd_locator, which keeps track of where each of the hosts store their RRD files. The RRD server(s) only run hobbitd_rrd, and a webserver.
On the main server, add these entries to your hobbitlaunch.cfg:
[locator] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg LOGFILE $BBSERVERLOGS/locator.log NEEDS hobbitd CMD hobbitd_locator --listen=0.0.0.0:9000
[netrrd-status]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
NEEDS locator
CMD hobbitd_channel --channel=status
--log=$BBSERVERLOGS/netrrd-status.log
--locator=127.0.0.1
--service=rrd
[netrrd-data]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
NEEDS locator
CMD hobbitd_channel --channel=data
--log=$BBSERVERLOGS/netrrd-data.log
--locator=127.0.0.1
--service=rrd
The locator listens on port 9000 - it is a UDP based service (like DNS), so you may need to open up some firewalls to reach it.
On the RRD offload-servers, you run only the hobbitd_rrd modules with some additional options that tell it to listen for data from a network connection. Here's the hobbitlaunch.cfg entry, assuming your main Xymon server has IP 192.168.1.1 and the RRD off-load server has IP 192.168.1.2:
[netrrd-worker]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD hobbitd_rrd
--log=$BBSERVERLOGS/netrrd-status.log
--rrddir=/var/lib/hobbit/rrd
--locator=192.168.1.1:9000
--listen=192.168.1.2:9001
--locatorid=192.168.1.2:9001
--locatorextra=http://192.168.1.2/hobbit-cgi/
OK, this is a bit complicated - I'll try to explain what these options do.
hobbitd_locator needs to know that this RRD-offload-server exists, and what hosts it is handling RRD files for. So hobbitd_rrd must announce itself to the locator - so the "--locator" option tells it how to contact the locator.
hobbitd_rrd receives data from the remote hobbitd_channel over a network connection, so the "--listen" option tells it what IP and port-number it will use to listen for incoming connections from hobbitd_channel.
The IP/portnumber that hobbitd_rrd listens on may not be the one that hobbitd_channel should use, because the RRD offload server could be hidden behind a NAT firewall or some other network-based address translation might be taking place. So the "--locatorid" option announces the IP+portnumber that hobbitd_channel should use to connect to the hobbitd_rrd service from the outside. Normally there is no NAT'ing, so "--listen" and "--locatorid" are identical.
Finally, the "--locatorextra" tells the Xymon web-page tools what URL they should use when generating links to the Xymon graphs. Since the RRD files are no longer stored on the main Xymon server, you cannot access them via the same URL prefix that you use for all of the other Xymon webpages and CGI's - the "--locatorextra" option is used to tell Xymon what the URL is for the graphs. And yes, this means you will need to run a separate webserver on the RRD off-load server.
When hobbitd_rrd starts up with these options, it will first contact the locator and tell it "hey, I can handle RRD files - if someone wants to send me some RRD data, they can contact me on 192.168.1.2 port 9001. And please pass this information to anyone who asks for it: http://192.168.1.1/cgi-bin". It then proceeds to scan the RRD directory to determine which hosts it has RRD files stored for, and for each host it then tells the locator "Hi, I am the RRD server on 192.168.1.2:9001, and I have RRD files for host foo.bar.com". After that, it just leans back and waits for someone to connect to it.
Over on the main Xymon server, the hobbitd_channel modules are receiving data about RRD updates. Each time a new message arrives, they'll ask the locator "where are the RRD files stored for host abc1.bar.com" ? If the locator knows, then it will respond with the IP:portnumber of the RRD-server handling this host; if it knows that none of the known RRD servers handle this host (i.e. it is a new host) then it will just hand out the IP:portnumber of one of the RRD servers so new hosts can be added. When the hobbitd_channel module is told "send data for foo.bar.com to the RRD server at 192.168.1.2:9001" it will establish a TCP connection to that port (if it doesn't have one open already), and send the data to it.
When hobbitd_rrd receives a new connection, it spawns an extra process to handle the connection, which receives the data and then does the actual RRD update.
The connections between hobbitd_channel and the RRD offload-server are persistent, so once it is up and running you'll see two connections to your RRD offload server; one for each of the hobbitd_channel instances.
The final piece of the puzzle is when you view the detailed status-log on the webpage, and the graph must show up on that page. The hobbitsvc.cgi utility will ask the locator "where are the RRD files for host foo.bar.com?" and get a response that includes the extra data that the locator was asked to pass on to anyone who asked. hobbitsvc.cgi knows that this data is the base of the CGI-URL for the RRD graph CGI, so instead of generating a link to the image URL on the main Xymon webserver, it generates a link that points to the RRD off-load server. The browser contacts the webserver running on the RRD-server, and the image is generated by the RRD-server.
I hope that is enough to get you going.
Regards, Henrik