Hello,
Currently I'm managing a XyMon server which consists of around 5,000 devices. We are looking to keep continually adding more and more devices to it as time goes on. The issue is our system is currently always using max system resources we keep allocating to the server. BTW devmon seems to be the highest system resource hog.
Stats: About 5,000 devices XyMon 4.3.0-0-beta2 DevMon 0.2 4 CPU(s) 16 GB RAM Suse Linux Enterprise 10 (32-Bit) 300GB Enterprise SAN Storage - Fiber Channel - (3 Years of archived data stored)
Do it do me any good to give the system more resources? CPU(s) or RAM?
What is the experience that other users have with monitoring this many devices?
What system configurations are you using to support this many monitored hosts?
Sorry to ask so many questions, but I was wondering what options we have before making any big decisions.
Matt
Hi Matthew,
On 22-12-2011 00:27, Matthew Neumark wrote:
Currently I'm managing a XyMon server which consists of around 5,000 devices. We are looking to keep continually adding more and more devices to it as time goes on. The issue is our system is currently always using max system resources we keep allocating to the server. BTW devmon seems to be the highest system resource hog. Stats: About 5,000 devices XyMon 4.3.0-0-beta2 DevMon 0.2 4 CPU(s) 16 GB RAM Suse Linux Enterprise 10 (32-Bit) 300GB Enterprise SAN Storage - Fiber Channel - (3 Years of archived data stored) Do it do me any good to give the system more resources? CPU(s) or RAM? What is the experience that other users have with monitoring this many devices? What system configurations are you using to support this many monitored hosts?
Your installation is about the same size as the one I have at work. I recently upgraded it because it could no longer keep up with the load, and based on that I would say that your hardware specs should be more than adequate to handle the number of devices.
The only real difference between your system and mine is that I changed to SSD disks for storing the RRD-files (graphs) - I don't know how your FC disks compare with SSD's, but I could certainly see a significant effect of that change; when stopping Xymon it used to take 15-20 minutes for xymond_rrd to flush all of the cached RRD updates to disk, but after changing to the SSD disks it only takes a few seconds. The interesting thing of course is how long they will last, since the number of write operations is limited on these devices; I plan to replace them once a year to be on the safe side.
Have a look at your vmstat1 graph for the Xymon server (on the "trends" status page), and see how much time is being spent in I/O wait state - if that is in the 20-25% range, then you probably have a problem with I/O bandwidth, and adding an SSD disk could help. (I say 20-25% because as far as I know, Linux sends all I/O operations through one CPU, so if you have 4 CPU's and one of them is fully busy doing I/O, then it will show up in vmstat as I/O wait taking up 1/4 if the time).
Is there any swap being used ? (Check the "free" output). I wouldn't expect that there is much swap going on with 16 GB of RAM. So more RAM probably will not help.
I've never used Devmon, so I don't know how much of a "hog" it is. If it really is the one using all of the ressources, a solution (or workaround, really) might be to split the Devmon load between more servers - you can still have them report their data to the same Xymon server, you will only move the running of Devmon to a different node.
Just for the record, my current system is an HP DL380 G7, 2 dual-core 2.4 GHz CPU's, 24 GB RAM, 6x300 GB SAS 10K diske in a RAID-1 configuration, and 2 64 GB SSD disks in RAID-1. It is currently handling about 4200 servers with clients installed, and an additional 3000 entries in hosts.cfg for network devices, websites etc. All in all 50000 statuses being tracked. On average, the CPU load is 6% busy. To be fair, I must add that I have most of the network tests running on another node (for ease of firewall setup, mostly) and that node is 15% busy.
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
MNeumark@savemart.com