Optimizing Xymon disk performance (was: Moving RRD processing to another server)
Hi Greg,
I've taken the liberty of sending this to the Xymon list also, since it is probably of general interest.
On Mon, Nov 02, 2009 at 11:39:09AM -0500, shea_greg at emc.com wrote:
I'm having some trouble trying to figure out how to off-load RRD processing with the 4.3.0 code. I found hobbitd_locator and that's part of it, as well as hobbitd_channel, but it's not clear to me how to setup the master and peer(s). Also how does this affect the webpage generation?
From earlier posts to the list, I have a single server running 4.2.0 with over 70000 RRD files and I'm experiencing serious delays in processing data and have to restart Hobbit every 15 minutes. One solution I'm aware of and also Buchan had mentioned is to add more spindles.
The standard 4.3.0 beta adds caching of RRD file updates, and this has a significant impact on the I/O load of the server - essentially, it means that hobbitd_rrd caches up to 12 updates (= 1 hour) before it does an actual update of the RRD file. Since the amount of disk I/O is almost identical whether you're doing one data update or 50, this caching eliminates about 90% of your disk I/O on the RRD files. So that would be the simplest solution to implement.
I have 90000+ RRD files. I recently did a hardware upgrade of the server, but it isn't anything fancy - just a plain HP DL360 with a set of two 36 GB SCSI disks in hardware RAID-1. I used to off-load the RRD handling to another server, but it is now back on the main Xymon server. The amount of memory used for the cache isn't all that much - about 50 MB on my system.
The only downside to this is that shutting down Xymon means all of the cached data must be flushed to disk - and this take a while, 10-15 minutes on my system.
Another optimization to eliminate disk I/O is to move the generated webpages to a RAM disk. I have ~hobbit/server/www/ on a ram-disk; the gifs, help, menu, notes, rep and snap sub-directories are symlinks that point to "real" (disk-based) storage. This means all of the webpages that are re-generated once a minute resides on a RAM disk, eliminating all of the disk I/O that rewriting them causes. And since they are regenerated so often, it doesn't matter that they're wiped out when you reboot the server - they are regenerated within a minute after you have Xymon up and running again.
But the remote RRD off-loading works, I've used it for more than a year. Here's how to set it up.
First, webpage-generation is unchanged, it still happens on the main Xymon server and the fact that the RRD files are stored somewhere else is transparent.
The main server runs the hobbitd_locator, which keeps track of where each of the hosts store their RRD files. The RRD server(s) only run hobbitd_rrd, and a webserver.
On the main server, add these entries to your hobbitlaunch.cfg:
[locator] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg LOGFILE $BBSERVERLOGS/locator.log NEEDS hobbitd CMD hobbitd_locator --listen=0.0.0.0:9000
[netrrd-status]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
NEEDS locator
CMD hobbitd_channel --channel=status
--log=$BBSERVERLOGS/netrrd-status.log
--locator=127.0.0.1
--service=rrd
[netrrd-data]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
NEEDS locator
CMD hobbitd_channel --channel=data
--log=$BBSERVERLOGS/netrrd-data.log
--locator=127.0.0.1
--service=rrd
The locator listens on port 9000 - it is a UDP based service (like DNS), so you may need to open up some firewalls to reach it.
On the RRD offload-servers, you run only the hobbitd_rrd modules with some additional options that tell it to listen for data from a network connection. Here's the hobbitlaunch.cfg entry, assuming your main Xymon server has IP 192.168.1.1 and the RRD off-load server has IP 192.168.1.2:
[netrrd-worker]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD hobbitd_rrd
--log=$BBSERVERLOGS/netrrd-status.log
--rrddir=/var/lib/hobbit/rrd
--locator=192.168.1.1:9000
--listen=192.168.1.2:9001
--locatorid=192.168.1.2:9001
--locatorextra=http://192.168.1.2/hobbit-cgi/
OK, this is a bit complicated - I'll try to explain what these options do.
hobbitd_locator needs to know that this RRD-offload-server exists, and what hosts it is handling RRD files for. So hobbitd_rrd must announce itself to the locator - so the "--locator" option tells it how to contact the locator.
hobbitd_rrd receives data from the remote hobbitd_channel over a network connection, so the "--listen" option tells it what IP and port-number it will use to listen for incoming connections from hobbitd_channel.
The IP/portnumber that hobbitd_rrd listens on may not be the one that hobbitd_channel should use, because the RRD offload server could be hidden behind a NAT firewall or some other network-based address translation might be taking place. So the "--locatorid" option announces the IP+portnumber that hobbitd_channel should use to connect to the hobbitd_rrd service from the outside. Normally there is no NAT'ing, so "--listen" and "--locatorid" are identical.
Finally, the "--locatorextra" tells the Xymon web-page tools what URL they should use when generating links to the Xymon graphs. Since the RRD files are no longer stored on the main Xymon server, you cannot access them via the same URL prefix that you use for all of the other Xymon webpages and CGI's - the "--locatorextra" option is used to tell Xymon what the URL is for the graphs. And yes, this means you will need to run a separate webserver on the RRD off-load server.
When hobbitd_rrd starts up with these options, it will first contact the locator and tell it "hey, I can handle RRD files - if someone wants to send me some RRD data, they can contact me on 192.168.1.2 port 9001. And please pass this information to anyone who asks for it: http://192.168.1.1/cgi-bin". It then proceeds to scan the RRD directory to determine which hosts it has RRD files stored for, and for each host it then tells the locator "Hi, I am the RRD server on 192.168.1.2:9001, and I have RRD files for host foo.bar.com". After that, it just leans back and waits for someone to connect to it.
Over on the main Xymon server, the hobbitd_channel modules are receiving data about RRD updates. Each time a new message arrives, they'll ask the locator "where are the RRD files stored for host abc1.bar.com" ? If the locator knows, then it will respond with the IP:portnumber of the RRD-server handling this host; if it knows that none of the known RRD servers handle this host (i.e. it is a new host) then it will just hand out the IP:portnumber of one of the RRD servers so new hosts can be added. When the hobbitd_channel module is told "send data for foo.bar.com to the RRD server at 192.168.1.2:9001" it will establish a TCP connection to that port (if it doesn't have one open already), and send the data to it.
When hobbitd_rrd receives a new connection, it spawns an extra process to handle the connection, which receives the data and then does the actual RRD update.
The connections between hobbitd_channel and the RRD offload-server are persistent, so once it is up and running you'll see two connections to your RRD offload server; one for each of the hobbitd_channel instances.
The final piece of the puzzle is when you view the detailed status-log on the webpage, and the graph must show up on that page. The hobbitsvc.cgi utility will ask the locator "where are the RRD files for host foo.bar.com?" and get a response that includes the extra data that the locator was asked to pass on to anyone who asked. hobbitsvc.cgi knows that this data is the base of the CGI-URL for the RRD graph CGI, so instead of generating a link to the image URL on the main Xymon webserver, it generates a link that points to the RRD off-load server. The browser contacts the webserver running on the RRD-server, and the image is generated by the RRD-server.
I hope that is enough to get you going.
Regards, Henrik
Henrik Størner wrote:
<snip>
The standard 4.3.0 beta adds caching of RRD file updates, and this has a significant impact on the I/O load of the server - essentially, it means that hobbitd_rrd caches up to 12 updates (= 1 hour) before it does an actual update of the RRD file. Since the amount of disk I/O is almost identical whether you're doing one data update or 50, this caching eliminates about 90% of your disk I/O on the RRD files. So that would be the simplest solution to implement.
I am running 4.3.0-0.20080103 so quite an out of date version. However I would be interested in learning this. Would I have to upgrade my hobbit / xymon installation to get this feature? What version contains it? How do I enable it?
Currently I have > 100,000 rrd files and the server sits with between 20% and 50% i/o wait. Performance is acceptable I guess but anything to improve this would be welcome.
Cheers
Iain
<snip>
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
In <4AF006C5.1000907 at shihad.org> Iain M Conochie <iain at shihad.org> writes:
Henrik St�rner wrote:
<snip>
The standard 4.3.0 beta adds caching of RRD file updates, and this has a significant impact on the I/O load of the server - essentially, it means that hobbitd_rrd caches up to 12 updates (= 1 hour) before it does an actual update of the RRD file. Since the amount of disk I/O is almost identical whether you're doing one data update or 50, this caching eliminates about 90% of your disk I/O on the RRD files. So that would be the simplest solution to implement.
I am running 4.3.0-0.20080103 so quite an out of date version. However I would be interested in learning this. Would I have to upgrade my hobbit / xymon installation to get this feature? What version contains it? How do I enable it?
I would really recommend upgrading to the current software in the Subversion tree http://hobbitmon.svn.sourceforge.net/viewvc/hobbitmon/branches/4.3.0/ and just install that.
You could probably get away with just copying the hobbitd/hobbitd_rrd and web/hobbitsvc.cgi binaries from this version on top of your current setup, but I cannot say for certain that it will work.
Regards, Henrik
I have some custom data I collect from different hosts, and generate rrds from them. One problem I have is that not all hosts have all the same data. Ex: host1 has the series data1, data2 and data3 but host2 has only data1 and data2. If I in hobbit-graphs.cfg define all tree data series, the graph will fail for host2 since this rrd dont contain this data.
Is there a way of defining the graphs.cfg with using variables? Would like something like this to be possible: DEF:$data=file.rrd:(data\d):AVERAGE LINE2:$data#@COLOR@:$data
The @COLOR@ is working.
Any one that have done something like this?
Øyvind
hi
You could probably get away with just copying the hobbitd/hobbitd_rrd and web/hobbitsvc.cgi binaries from this version on top of your current setup, but I cannot say for certain that it will work.
you have to update hobbitgraph.cgi too. It's seems to work fine.
148795 rrd files 8000 devices ha with drbd
load 7 without this hack. 1.47 now i/o wait 50% ithout this hack. 0% now
Henrik you rox !
oau
----- Mail Original ----- De: "Henrik \"Størner\"" <henrik at hswn.dk> À: hobbit at hswn.dk Envoyé: Mercredi 4 Novembre 2009 11h21:51 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: Re: [hobbit] Optimizing Xymon disk performance
In <4AF006C5.1000907 at shihad.org> Iain M Conochie <iain at shihad.org> writes:
Henrik St�rner wrote:
<snip>
The standard 4.3.0 beta adds caching of RRD file updates, and this has a significant impact on the I/O load of the server - essentially, it means that hobbitd_rrd caches up to 12 updates (= 1 hour) before it does an actual update of the RRD file. Since the amount of disk I/O is almost identical whether you're doing one data update or 50, this caching eliminates about 90% of your disk I/O on the RRD files. So that would be the simplest solution to implement.
I am running 4.3.0-0.20080103 so quite an out of date version. However I would be interested in learning this. Would I have to upgrade my hobbit / xymon installation to get this feature? What version contains it? How do I enable it?
I would really recommend upgrading to the current software in the Subversion tree http://hobbitmon.svn.sourceforge.net/viewvc/hobbitmon/branches/4.3.0/ and just install that.
You could probably get away with just copying the hobbitd/hobbitd_rrd and web/hobbitsvc.cgi binaries from this version on top of your current setup, but I cannot say for certain that it will work.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
hi
and what about disabling BBHOSTHISTLOG ?
As the man says this feature is not usable :
BBHOSTHISTLOG This environment variable controls if the $BBHIST/HOSTNAME logfile is updated. This file holds a list of all status changes seen for a single host, but is not used by any of the standard Hobbit tools. If you do not want to save this, you can disable it by setting BBHOSTHISTLOG=FALSE.
Best regards
oau
----- Mail Original ----- De: "Henrik Størner" <henrik at hswn.dk> À: "shea greg" <shea_greg at emc.com> Cc: hobbit at hswn.dk Envoyé: Lundi 2 Novembre 2009 23h17:52 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: [hobbit] Optimizing Xymon disk performance (was: Moving RRD processing to another server)
Hi Greg,
I've taken the liberty of sending this to the Xymon list also, since it is probably of general interest.
On Mon, Nov 02, 2009 at 11:39:09AM -0500, shea_greg at emc.com wrote:
I'm having some trouble trying to figure out how to off-load RRD processing with the 4.3.0 code. I found hobbitd_locator and that's part of it, as well as hobbitd_channel, but it's not clear to me how to setup the master and peer(s). Also how does this affect the webpage generation?
From earlier posts to the list, I have a single server running 4.2.0 with over 70000 RRD files and I'm experiencing serious delays in processing data and have to restart Hobbit every 15 minutes. One solution I'm aware of and also Buchan had mentioned is to add more spindles.
The standard 4.3.0 beta adds caching of RRD file updates, and this has a significant impact on the I/O load of the server - essentially, it means that hobbitd_rrd caches up to 12 updates (= 1 hour) before it does an actual update of the RRD file. Since the amount of disk I/O is almost identical whether you're doing one data update or 50, this caching eliminates about 90% of your disk I/O on the RRD files. So that would be the simplest solution to implement.
I have 90000+ RRD files. I recently did a hardware upgrade of the server, but it isn't anything fancy - just a plain HP DL360 with a set of two 36 GB SCSI disks in hardware RAID-1. I used to off-load the RRD handling to another server, but it is now back on the main Xymon server. The amount of memory used for the cache isn't all that much - about 50 MB on my system.
The only downside to this is that shutting down Xymon means all of the cached data must be flushed to disk - and this take a while, 10-15 minutes on my system.
Another optimization to eliminate disk I/O is to move the generated webpages to a RAM disk. I have ~hobbit/server/www/ on a ram-disk; the gifs, help, menu, notes, rep and snap sub-directories are symlinks that point to "real" (disk-based) storage. This means all of the webpages that are re-generated once a minute resides on a RAM disk, eliminating all of the disk I/O that rewriting them causes. And since they are regenerated so often, it doesn't matter that they're wiped out when you reboot the server - they are regenerated within a minute after you have Xymon up and running again.
But the remote RRD off-loading works, I've used it for more than a year. Here's how to set it up.
First, webpage-generation is unchanged, it still happens on the main Xymon server and the fact that the RRD files are stored somewhere else is transparent.
The main server runs the hobbitd_locator, which keeps track of where each of the hosts store their RRD files. The RRD server(s) only run hobbitd_rrd, and a webserver.
On the main server, add these entries to your hobbitlaunch.cfg:
[locator] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg LOGFILE $BBSERVERLOGS/locator.log NEEDS hobbitd CMD hobbitd_locator --listen=0.0.0.0:9000
[netrrd-status]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
NEEDS locator
CMD hobbitd_channel --channel=status
--log=$BBSERVERLOGS/netrrd-status.log
--locator=127.0.0.1
--service=rrd
[netrrd-data]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
NEEDS locator
CMD hobbitd_channel --channel=data
--log=$BBSERVERLOGS/netrrd-data.log
--locator=127.0.0.1
--service=rrd
The locator listens on port 9000 - it is a UDP based service (like DNS), so you may need to open up some firewalls to reach it.
On the RRD offload-servers, you run only the hobbitd_rrd modules with some additional options that tell it to listen for data from a network connection. Here's the hobbitlaunch.cfg entry, assuming your main Xymon server has IP 192.168.1.1 and the RRD off-load server has IP 192.168.1.2:
[netrrd-worker]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD hobbitd_rrd
--log=$BBSERVERLOGS/netrrd-status.log
--rrddir=/var/lib/hobbit/rrd
--locator=192.168.1.1:9000
--listen=192.168.1.2:9001
--locatorid=192.168.1.2:9001
--locatorextra=http://192.168.1.2/hobbit-cgi/
OK, this is a bit complicated - I'll try to explain what these options do.
hobbitd_locator needs to know that this RRD-offload-server exists, and what hosts it is handling RRD files for. So hobbitd_rrd must announce itself to the locator - so the "--locator" option tells it how to contact the locator.
hobbitd_rrd receives data from the remote hobbitd_channel over a network connection, so the "--listen" option tells it what IP and port-number it will use to listen for incoming connections from hobbitd_channel.
The IP/portnumber that hobbitd_rrd listens on may not be the one that hobbitd_channel should use, because the RRD offload server could be hidden behind a NAT firewall or some other network-based address translation might be taking place. So the "--locatorid" option announces the IP+portnumber that hobbitd_channel should use to connect to the hobbitd_rrd service from the outside. Normally there is no NAT'ing, so "--listen" and "--locatorid" are identical.
Finally, the "--locatorextra" tells the Xymon web-page tools what URL they should use when generating links to the Xymon graphs. Since the RRD files are no longer stored on the main Xymon server, you cannot access them via the same URL prefix that you use for all of the other Xymon webpages and CGI's - the "--locatorextra" option is used to tell Xymon what the URL is for the graphs. And yes, this means you will need to run a separate webserver on the RRD off-load server.
When hobbitd_rrd starts up with these options, it will first contact the locator and tell it "hey, I can handle RRD files - if someone wants to send me some RRD data, they can contact me on 192.168.1.2 port 9001. And please pass this information to anyone who asks for it: http://192.168.1.1/cgi-bin". It then proceeds to scan the RRD directory to determine which hosts it has RRD files stored for, and for each host it then tells the locator "Hi, I am the RRD server on 192.168.1.2:9001, and I have RRD files for host foo.bar.com". After that, it just leans back and waits for someone to connect to it.
Over on the main Xymon server, the hobbitd_channel modules are receiving data about RRD updates. Each time a new message arrives, they'll ask the locator "where are the RRD files stored for host abc1.bar.com" ? If the locator knows, then it will respond with the IP:portnumber of the RRD-server handling this host; if it knows that none of the known RRD servers handle this host (i.e. it is a new host) then it will just hand out the IP:portnumber of one of the RRD servers so new hosts can be added. When the hobbitd_channel module is told "send data for foo.bar.com to the RRD server at 192.168.1.2:9001" it will establish a TCP connection to that port (if it doesn't have one open already), and send the data to it.
When hobbitd_rrd receives a new connection, it spawns an extra process to handle the connection, which receives the data and then does the actual RRD update.
The connections between hobbitd_channel and the RRD offload-server are persistent, so once it is up and running you'll see two connections to your RRD offload server; one for each of the hobbitd_channel instances.
The final piece of the puzzle is when you view the detailed status-log on the webpage, and the graph must show up on that page. The hobbitsvc.cgi utility will ask the locator "where are the RRD files for host foo.bar.com?" and get a response that includes the extra data that the locator was asked to pass on to anyone who asked. hobbitsvc.cgi knows that this data is the base of the CGI-URL for the RRD graph CGI, so instead of generating a link to the image URL on the main Xymon webserver, it generates a link that points to the RRD off-load server. The browser contacts the webserver running on the RRD-server, and the image is generated by the RRD-server.
I hope that is enough to get you going.
Regards, Henrik
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
participants (4)
-
henrik@hswn.dk
-
iain@shihad.org
-
olivier@audry.fr
-
oyvind.bjorge@telenor.com