top ten list of servers wrt cpu load
Before I go inventing something I want to find out if anyone already has done this.
We have a lot of virtual linux hosts (VMs on an ESX farm). We monitor all of them with Xymon. When there is a widespread problem (as there was this past weekend) the virtualization team would like to have a report on which VMs top the list of, for example, cpu load from Xymon historical data. (Yes there are ESX based tools, but they have not spent the $$ to put them on all of the servers.) I pointed the team manager to the metrics report in Xymon and he was impressed, but doesn't want to have to look at a graph containing plots for a few hundred hosts to find the top 10.
So, I'm looking it writing a script to mine the rrd or history data from the Xymon server to produce the list he wants. He is also interested in the top disk I/O numbers, too, but I'm focusing on load average for now.
He says he just wants an average for each host over the 48 hours of the weekend, which is when we usually see problems.
Has anyone done this or something like it? I don't see anything in Xymon already built in to get close so I was looking at rrdtool fetch. However, this is cumbersome and, frankly I'm not understanding the data I'm getting back (for example 1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the e+03 seems to really mean *10^1, right?)
But I ramble. Thanks for any help.
Steve Holmes Purdue
On 19 March 2013 06:55, Steve Holmes <sholmes42 at mac.com> wrote:
So, I'm looking it writing a script to mine the rrd or history data from the Xymon server to produce the list he wants. He is also interested in the top disk I/O numbers, too, but I'm focusing on load average for now.
Sounds useful. I've not seen anything that does this already.
close so I was looking at rrdtool fetch. However, this is cumbersome and, frankly I'm not understanding the data I'm getting back (for example 1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the e+03 seems to really mean *10^1, right?)
Nope, 1.227+03 means 1227. However, sometimes there is an adjustment applied, that's not always obvious. For example, my understanding is that the load average (in la.rrd) is recorded after multiplying by 100, which is an artefact of the BigBrother legacy, because floating-point comparisons were difficult to implement in a generic shell script that had to run on any *nix platform. The BigBrother data collector would chop everything after two decimal places, then strip the dot out, thus providing a load average factored up by 100. You can tell this is what's happening in Xymon by looking at the [la] entry in graphs.cfg, or to save you looking it up:
DEF:avg=la.rrd:la:AVERAGE
CDEF:la=avg,100,/
So the graphs.cfg entry scales it back down before graphing.
Similar adjustments are made for things like interface load and TCP/IP stats, where bytes-per-second are converted to bits-per-second. Again, the graphs.cfg file often gives you a clue as to what's going on.
J
Wherever you go, there you are.
On Mar 18, 2013, at 6:27 PM, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
On 19 March 2013 06:55, Steve Holmes <sholmes42 at mac.com> wrote:
So, I'm looking it writing a script to mine the rrd or history data from the Xymon server to produce the list he wants. He is also interested in the top disk I/O numbers, too, but I'm focusing on load average for now.
Sounds useful. I've not seen anything that does this already.
close so I was looking at rrdtool fetch. However, this is cumbersome and, frankly I'm not understanding the data I'm getting back (for example 1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the e+03 seems to really mean *10^1, right?)
Nope, 1.227+03 means 1227. However, sometimes there is an adjustment applied, that's not always obvious. For example, my understanding is that the load average (in la.rrd) is recorded after multiplying by 100, which is an artefact of the BigBrother legacy, because floating-point comparisons were difficult to implement in a generic shell script that had to run on any *nix platform. The BigBrother data collector would chop everything after two decimal places, then strip the dot out, thus providing a load average factored up by 100. You can tell this is what's happening in Xymon by looking at the [la] entry in graphs.cfg, or to save you looking it up:
DEF:avg=la.rrd:la:AVERAGE CDEF:la=avg,100,/So the graphs.cfg entry scales it back down before graphing.
Similar adjustments are made for things like interface load and TCP/IP stats, where bytes-per-second are converted to bits-per-second. Again, the graphs.cfg file often gives you a clue as to what's going on.
J
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Ah, yes, of course! Thanks Steve.
On Mon, Mar 18, 2013 at 6:27 PM, Jeremy Laidman <jlaidman at rebel-it.com.au>wrote:
On 19 March 2013 06:55, Steve Holmes <sholmes42 at mac.com> wrote:
So, I'm looking it writing a script to mine the rrd or history data from the Xymon server to produce the list he wants. He is also interested in the top disk I/O numbers, too, but I'm focusing on load average for now.
Sounds useful. I've not seen anything that does this already.
close so I was looking at rrdtool fetch. However, this is cumbersome and, frankly I'm not understanding the data I'm getting back (for example 1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the e+03 seems to really mean *10^1, right?)
Nope, 1.227+03 means 1227. However, sometimes there is an adjustment applied, that's not always obvious. For example, my understanding is that the load average (in la.rrd) is recorded after multiplying by 100, which is an artefact of the BigBrother legacy, because floating-point comparisons were difficult to implement in a generic shell script that had to run on any *nix platform. The BigBrother data collector would chop everything after two decimal places, then strip the dot out, thus providing a load average factored up by 100. You can tell this is what's happening in Xymon by looking at the [la] entry in graphs.cfg, or to save you looking it up:
DEF:avg=la.rrd:la:AVERAGE CDEF:la=avg,100,/So the graphs.cfg entry scales it back down before graphing.
Similar adjustments are made for things like interface load and TCP/IP stats, where bytes-per-second are converted to bits-per-second. Again, the graphs.cfg file often gives you a clue as to what's going on.
J
Attached is the perl script I came up with. It serves my purpose and might be useful to someone. $fudge will have to be expanded for other measures. I may do that if someone here at Purdue requires it. Otherwise have at it.
Steve Holmes Purdue
participants (3)
-
jlaidman@rebel-it.com.au
-
sholmes42@gmail.com
-
sholmes42@mac.com