Before I go inventing something I want to find out if anyone already has done this.
We have a lot of virtual linux hosts (VMs on an ESX farm). We monitor all of them with Xymon. When there is a widespread problem (as there was this past weekend) the virtualization team would like to have a report on which VMs top the list of, for example, cpu load from Xymon historical data. (Yes there are ESX based tools, but they have not spent the $$ to put them on all of the servers.) I pointed the team manager to the metrics report in Xymon and he was impressed, but doesn't want to have to look at a graph containing plots for a few hundred hosts to find the top 10.
So, I'm looking it writing a script to mine the rrd or history data from the Xymon server to produce the list he wants. He is also interested in the top disk I/O numbers, too, but I'm focusing on load average for now.
He says he just wants an average for each host over the 48 hours of the weekend, which is when we usually see problems.
Has anyone done this or something like it? I don't see anything in Xymon already built in to get close so I was looking at rrdtool fetch. However, this is cumbersome and, frankly I'm not understanding the data I'm getting back (for example 1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the e+03 seems to really mean *10^1, right?)
But I ramble. Thanks for any help.
Steve Holmes Purdue