My old BB setup has a customized vmstat-larrd.pl script which allows for variations in vmstat output based on the version of procps. In other words, it compensates for the fact that RHEL3 and old red hat linux systems have vmstat output who's column ordering doesn't match up.
So I'd like to bring some of those vmstat changes into my new hobbit setup, most notably the ability to plot CPU wait for IO (wa) alongside user, system & idle time, but poking around in hobbitgraph.cfg doesn't reveal an easy way to do it.
Any tips on how I might accomplish this?
Tom
In <41F53C34.4080008 at nandomedia.com> Tom Georgoulias <tgeorgoulias at nandomedia.com> writes:
My old BB setup has a customized vmstat-larrd.pl script which allows for variations in vmstat output based on the version of procps. In other words, it compensates for the fact that RHEL3 and old red hat linux systems have vmstat output who's column ordering doesn't match up.
Hobbit knows these two layouts of the vmstat data as "linux" and "debian3", the latter being the one for the older linux versions (essentially, systems running vmstat for a Linux 2.2 kernel).
So I'd like to bring some of those vmstat changes into my new hobbit setup, most notably the ability to plot CPU wait for IO (wa) alongside user, system & idle time, but poking around in hobbitgraph.cfg doesn't reveal an easy way to do it.
Any tips on how I might accomplish this?
As with LARRD, there are two steps: Collecting the data, and displaying them.
Data collection is handled by hobbitd_larrd. The interesting stuff here is in hobbitd/do_larrd.c, and hobbitd/larrd/*.c . do_larrd.c determines which function should parse an incoming status or data message, by looking at the name of the "status" or "data" name. It also consults the LARRDS environment variable, e.g. to figure out that "ftp" is handled by the "tcp" parser. Each type of RRD file then has it's own little routine in one of the hobbitd/larrd/*.c files to pick out the interesting data, and put it into the RRD file.
Where do you get the I/O wait information from ?
Data display is handled by hobbitgraph.cgi, and the config file hobbitgraph.cfg. This is very similar to the looong set of definitions in larrd-grapher.cgi, except that you need not worry about hostnames in the RRD files, because Hobbit keeps all RRD files for a given host in a separate directory. So e.g. the "vmstat" graph can just get the CPU idle-time value with
DEF:cpu_idl=vmstat.rrd:cpu_idl:AVERAGE
i.e. grab the "vmstat.rrd" file, and extract the current average value of the "cpu_idl" dataset.
You can mix values from different RRD files in the same graph, e.g. the "vmstat2" graph uses both the "vmstat.rrd" file and the "la.rrd" file:
DEF:avg=la.rrd:la:AVERAGE
CDEF:la=avg,100,/
DEF:cpu_idl=vmstat.rrd:cpu_idl:AVERAGE
CDEF:cpu_idl2=cpu_idl,100,/
If you have more questions, please ask. And if you have something that could be of interest to others, I'll be happy to include it with Hobbit.
Regards, Henrik
How difficult is it to add custom graphs? For example we have several Oracle standby databases that "resync" (import binary database changelogs) for several hours every night. I have a perl script which parses one of the logfiles created by this process, and gets values like the total time it took to import each binary log. Having a graph of the average time it takes to process would be VERY handy for metric and scaling evaluations.
The long story short is that I have a script or process that outputs numbers that I want to graph with Hobbit+LARRD. Is it possible? Or, I should be asking, how difficult would it be, and can you give any pointers on what to do.
Thanks,
-Charles
On Mon, Jan 24, 2005 at 03:14:54PM -0700, Charles Jones wrote:
How difficult is it to add custom graphs?
Well, it does require some programming - it's an advantage if you are familiar with C, since that is the language Hobbit is written in. I don't think it's terribly hard, but I am biased.
For example we have several Oracle standby databases that "resync" (import binary database changelogs) for several hours every night. I have a perl script which parses one of the logfiles created by this process, and gets values like the total time it took to import each binary log. Having a graph of the average time it takes to process would be VERY handy for metric and scaling evaluations.
The hard part usually is getting the data, and you have that already with your perl script.
Next step is getting the data to the Hobbit server. This is easy; decide on a unique name for the type of data you're handling - e.g. "orasync" - and use the "bb" utility to send it off as a "data" message to Hobbit. E.g. the following script runs your perl script, stores the output in a temporary file, and uses the "bb" utility from a Big Brother client installation to send this datafile to Hobbit in a "data" message:
#!/bin/sh
/foo/perlscript >/tmp/datafile
BBHOME=/usr/local/bbc export BBHOME . $BBHOME/etc/bbdef.sh
$BB $BBDISP "data $MACHINE.orasync
cat /tmp/datafile
"
Now the fun bit starts. Hobbit will automatically pass data-messages to all tasks monitoring the "data" channel. hobbitd_larrd is one of them, so you can either add some code to this "worker module", or you can create your own module from scratch using your favorite programming language. An example of a hobbitd worker module is in the hobbitd_sample.c application included with Hobbit.
Assuming you just add stuff to the existing hobbitd_larrd module, you must do two things:
Write a routine do_orasync_larrd() similar to the other ones in the hobbitd/larrd/*.c files, that receives the message, picks out the numbers that you want to store, and saves it in an RRD file;
Add a line to do_larrd.c at the end of the file, so when it sees the "orasync" message, it calls your do_orasync_larrd() routine.
You need to learn about RRDtool to really do this; the "rrdcreate" and "rrdgraph" manpages include some tips on how to define RRD's and how you can setup graphs.
Take a look at one of the simple ones, e.g. hobbitd/larrd/do_bbgen.c which picks out a single value from the status message that bbgen sends when updating the web-pages - the do_bbgen_larrd routine simply finds the string "TIME TOTAL <some number>", picks out the number (which is the time bbgen takes to generate the webpages and stores it in an RRD file.
The data stored in the RRD file is described in the bbgen_params variable (or "orasync_params" in your case):
static char *bbgen_params[] = { "rrdcreate", rrdfn, "DS:runtime:GAUGE:600:0:U", rra1, rra2, rra3, rra4, NULL };
The first and last line of this is static; you only change the "DS:..." line to define the data you store in the rrd.
When you have the data you want, put all of it in the "rrdvalues" string, with the timestamp in front, and call the create_and_update_rrd routine to do the work of saving the data.
So now you have an RRD file. Time to put it into hobbitgraph.cfg. This really depends on the kind of data you are handling.
The last step is to add "orasync" to the GRAPHS definition in hobbitserver.cfg. This causes the bb-larrdcolumn tool to include the orasync graph on the "trends" column page.
The first time you do it, it seems complex, and I admit: it isn't exactly trivial because there are many pieces that need to fit together. But once you get the first simple graph working you'll see that it isn't all that hard. And I'll be happy to help you if you run into problems along the way.
Henrik
Henrik Stoerner wrote:
and use the "bb" utility to send it off as a "data" message to Hobbit. E.g. the following script runs your perl script, stores the output in a temporary file, and uses the "bb" utility from a Big Brother client installation to send this datafile to Hobbit in a "data" message:
#!/bin/sh
/foo/perlscript >/tmp/datafile
BBHOME=/usr/local/bbc export BBHOME . $BBHOME/etc/bbdef.sh
$BB $BBDISP "data $MACHINE.orasync
cat /tmp/datafile"Now the fun bit starts. Hobbit will automatically pass data-messages to all tasks monitoring the "data" channel.
Can you tell me the difference between using "data" and "status"? The reason I ask is because I looked at the hobbitd/larrd/do_bea.c (because it was the smallest one), and I notice in the comments that script that feeds it is using "status" instead of "data".
Thanks, -Charles
Charles Jones wrote:
Henrik Stoerner wrote:
and use the "bb" utility to send it off as a "data" message to Hobbit. E.g. the following script runs your perl script, stores the output in a temporary file, and uses the "bb" utility from a Big Brother client installation to send this datafile to Hobbit in a "data" message:
#!/bin/sh
/foo/perlscript >/tmp/datafile BBHOME=/usr/local/bbc export BBHOME . $BBHOME/etc/bbdef.sh
$BB $BBDISP "data $MACHINE.orasync
cat /tmp/datafile"Now the fun bit starts. Hobbit will automatically pass data-messages to all tasks monitoring the "data" channel.
Can you tell me the difference between using "data" and "status"? The reason I ask is because I looked at the hobbitd/larrd/do_bea.c (because it was the smallest one), and I notice in the comments that script that feeds it is using "status" instead of "data".
Thanks, -Charles
Opps, scratch that....the do_bea.c is definitely not the smallest one, and I should have looked at the one you suggested :-) I would still like to know the difference between, and when one should use, data vs status though.
Now that I am looking at d0_bbgen.c, it does look very simple...I will give a try at making my own. Thanks again.
-Charles
Would it be possible, when the status of something goes Red, to have the code printed somewhere on the status page, so that one could use the "Acknowledge alert" option and copy/paste it in, rather than having to get the incident code from email/pager? If we wanted to be really spiffy the acknowledge alert could even have a dropdown/list of current alerts so you wouldn't even have to type it :)
-Charles
On Mon, 2005-01-24 at 17:07 -0700, Charles Jones wrote:
Would it be possible, when the status of something goes Red, to have the code printed somewhere on the status page, so that one could use the "Acknowledge alert" option and copy/paste it in, rather than having to get the incident code from email/pager? If we wanted to be really spiffy the acknowledge alert could even have a dropdown/list of current alerts so you wouldn't even have to type it :)
Then you wouldn't know that the person who was supposed to be notified really was... Making them copy it off a pager is a cheap "2-factor" authentication....
-- Daniel J McDonald, CCIE # 2495, CNX Austin Energy
dan.mcdonald at austinenergy.com
Daniel J McDonald wrote:
On Mon, 2005-01-24 at 17:07 -0700, Charles Jones wrote:
Would it be possible, when the status of something goes Red, to have the code printed somewhere on the status page, so that one could use the "Acknowledge alert" option and copy/paste it in, rather than having to get the incident code from email/pager? If we wanted to be really spiffy the acknowledge alert could even have a dropdown/list of current alerts so you wouldn't even have to type it :)
Then you wouldn't know that the person who was supposed to be notified really was... Making them copy it off a pager is a cheap "2-factor" authentication....
I see what you mean. In my case the alerts go to an email alias that includes both the alert pager and an email list. So it can be any one of a number of people who actually Ack the alert, and they always have to copy and paste it from the email subject. They use the explanation/cause field to indicate who acked ie. "Network cable came unplugged, plugged back in -cjones". I'm just trying to figure out a way to make the Acking process more efficient.
-Charles
In <1106611848.27890.333.camel at localhost.localdomain> Daniel J McDonald <dan.mcdonald at austinenergy.com> writes:
On Mon, 2005-01-24 at 17:07 -0700, Charles Jones wrote:
Would it be possible, when the status of something goes Red, to have the code printed somewhere on the status page, so that one could use the "Acknowledge alert" option and copy/paste it in, rather than having to get the incident code from email/pager? If we wanted to be really spiffy the acknowledge alert could even have a dropdown/list of current alerts so you wouldn't even have to type it :)
Then you wouldn't know that the person who was supposed to be notified really was... Making them copy it off a pager is a cheap "2-factor" authentication....
Exactly. But I understand Charles' question, because I've been wanting to do something like that.
Our monitoring is handled by a NOC manned 24x7, and when an alert pops up on the Hobbit NK view they raise a trouble-ticket in some other system. The NOC people dont get an e-mail or pager alert, but it would still be nice if they could acknowledge "yes, a TT has been raised about this" to get the problem off their monitor. So I will probably implement some way of putting an "acknowledge" function on the webpages - this would have to be protected with some sort of access control, obviously.
Henrik
Henrik Storner wrote:
In <1106611848.27890.333.camel at localhost.localdomain> Daniel J McDonald <dan.mcdonald at austinenergy.com> writes:
On Mon, 2005-01-24 at 17:07 -0700, Charles Jones wrote:
Would it be possible, when the status of something goes Red, to have the code printed somewhere on the status page, so that one could use the "Acknowledge alert" option and copy/paste it in, rather than having to get the incident code from email/pager? If we wanted to be really spiffy the acknowledge alert could even have a dropdown/list of current alerts so you wouldn't even have to type it :)
Then you wouldn't know that the person who was supposed to be notified really was... Making them copy it off a pager is a cheap "2-factor" authentication....
Exactly. But I understand Charles' question, because I've been wanting to do something like that.
Our monitoring is handled by a NOC manned 24x7, and when an alert pops up on the Hobbit NK view they raise a trouble-ticket in some other system. The NOC people dont get an e-mail or pager alert, but it would still be nice if they could acknowledge "yes, a TT has been raised about this" to get the problem off their monitor. So I will probably implement some way of putting an "acknowledge" function on the webpages - this would have to be protected with some sort of access control, obviously.
Currently I am simply using a .htaccess file to restrict access, which has been working for me so far, but built-in access control would be nice, particularly for Acks and for Maint.pl. If there were a proper permissions system, you could even define what users could see which groups of hosts! Ahhh I feel the feature creature sneaking up on us! :-)
-Charles
On Mon, 2005-01-24 at 16:12 -0700, Charles Jones wrote:
Charles Jones wrote:
Henrik Stoerner wrote:
and use the "bb" utility to send it off as a "data" message to Hobbit. E.g. the following script runs your perl script, stores the output in a temporary file, and uses the "bb" utility from a Big Brother client installation to send this datafile to Hobbit in a "data" message:
#!/bin/sh
/foo/perlscript >/tmp/datafile BBHOME=/usr/local/bbc export BBHOME . $BBHOME/etc/bbdef.sh
$BB $BBDISP "data $MACHINE.orasync
cat /tmp/datafile"Now the fun bit starts. Hobbit will automatically pass data-messages to all tasks monitoring the "data" channel.
Can you tell me the difference between using "data" and "status"? The reason I ask is because I looked at the hobbitd/larrd/do_bea.c (because it was the smallest one), and I notice in the comments that script that feeds it is using "status" instead of "data".
Thanks, -Charles
Opps, scratch that....the do_bea.c is definitely not the smallest one, and I should have looked at the one you suggested :-) I would still like to know the difference between, and when one should use, data vs status though.
AFAIK, status is the result of a test, which should be alarmed/displayed/etc, whereas data is not alarmed/displayed, just handed off to something else to deal with it. (I think in BB it was just appended to a file in bbvar/data/hostname....)
Regards, Adam
--
Adam Goryachev Website Managers Ph: +61 2 8304 0000 adam at websitemanagers.com.au Fax: +61 2 9345 4396 www.websitemanagers.com.au
On Mon, Jan 24, 2005 at 04:03:31PM -0700, Charles Jones wrote:
Henrik Stoerner wrote:
and use the "bb" utility to send it off as a "data" message to Hobbit.
Can you tell me the difference between using "data" and "status"?
A "status" message results in a column on the display, and also has a color (red, green, yellow) that might trigger an alert.
A "data" message is never displayed and cannot generate an alert, it is just a way of collecting data.
Henrik
Henrik Stoerner wrote:
On Mon, Jan 24, 2005 at 04:03:31PM -0700, Charles Jones wrote:
Henrik Stoerner wrote:
and use the "bb" utility to send it off as a "data" message to Hobbit.
Can you tell me the difference between using "data" and "status"?
A "status" message results in a column on the display, and also has a color (red, green, yellow) that might trigger an alert.
A "data" message is never displayed and cannot generate an alert, it is just a way of collecting data.
But data can be collected from a status message as well right? At least I hope so, because I want to update a status, AND graph the result. For instance in my oracle resync scenario, I want to send basically a status that says "Resync completed successfully. 500 files imported. Total Resync Time: 1530 seconds." I want the 1530 to be trended. Will I be able to do that, or will have have to send both a status and a seperate data message?
Thanks, -Charles
On Tue, Jan 25, 2005 at 02:07:42AM -0700, Charles Jones wrote:
Henrik Stoerner wrote:
A "status" message results in a column on the display, and also has a color (red, green, yellow) that might trigger an alert.
A "data" message is never displayed and cannot generate an alert, it is just a way of collecting data.
But data can be collected from a status message as well right? At least I hope so, because I want to update a status, AND graph the result.
Certainly, no problem at all. Hobbit gets most of its RRD graph-data from status messages (the "cpu", "disk", "memory" and network test messages, for instance). That's why you'll see two hobbitd_larrd processes running: One gets the "status" messages, and the other gets the "data" messages.
So no, you don't need to do anything special. Whether you send your original data as a status- or a data-message is up to you, as far as collecting the data in an RRD and graphing them, there is no difference.
Regards, Henrik
Will Hobbit play nice with bbfetch? If I recall, bbfetch is run on the BBDISPLAY server, and scp's the raw status files generated by the remote clients modified $BBHOME/bin/bb. I'm wondering if it would still work since Hobbit has no bbvar directory...I think it would because I think bb-fetch uses the $BBTMP variable, and from what I have seen Hobbit populates all the usual bb variables.
If bb-fetch won't work with Hobbit, it might be nice to incorporate similar functionality in, as it is quite useful for situations where bbproxy won't do the trick because of one-way firewall issues.
-Charles
On Wed, Jan 26, 2005 at 12:32:45PM -0700, Charles Jones wrote:
Will Hobbit play nice with bbfetch? If I recall, bbfetch is run on the BBDISPLAY server, and scp's the raw status files generated by the remote clients modified $BBHOME/bin/bb.
It might not ... I haven't tried bbfetch myself, so I cannot say. But it would probably be pretty easy to come up with a script that picks up the status-files that bbfetch collects, and sends them off to the Hobbit daemon via the normal Hobbit "bb" command.
If bb-fetch won't work with Hobbit, it might be nice to incorporate similar functionality in, as it is quite useful for situations where bbproxy won't do the trick because of one-way firewall issues.
I have some ideas for a Hobbit client, and yes - making it work in both a "push" (normal client) and a "pull" (bbfetch style) setup it necessary.
Henrik
On Wed, 2005-01-26 at 21:56 +0100, Henrik Stoerner wrote:
On Wed, Jan 26, 2005 at 12:32:45PM -0700, Charles Jones wrote:
Will Hobbit play nice with bbfetch? If I recall, bbfetch is run on the BBDISPLAY server, and scp's the raw status files generated by the remote clients modified $BBHOME/bin/bb.
It might not ... I haven't tried bbfetch myself, so I cannot say. But it would probably be pretty easy to come up with a script that picks up the status-files that bbfetch collects, and sends them off to the Hobbit daemon via the normal Hobbit "bb" command.
If bb-fetch won't work with Hobbit, it might be nice to incorporate similar functionality in, as it is quite useful for situations where bbproxy won't do the trick because of one-way firewall issues.
I have some ideas for a Hobbit client, and yes - making it work in both a "push" (normal client) and a "pull" (bbfetch style) setup it necessary.
I'm actually rather fond of the bb-central style - no clients, all of the "client like scripts" run via ssh from the server.
bb-fetch has trouble with time - the remote client wipes the status files on it's own schedule, and the server picks them up on it's own schedule, and if the client isn't done writing you end up with lots of purples...
I haven't tried bb-central with hobbit yet. bbmap is still my next priority now that I've got a really good bbmrtg.pl running. But I've got to get bb-central up soon
My production BigBrother server is running BigBrother + bbgen 2.5 (I know there is newer bbgen, I plan on replacing BB with a Hobbit server). My current bb+bbgen setup has problems whenever a machine dies in such a way that it is pingable, but when you connect to any open TCP port you get nothing back (usually caused by a memory error or overheating). When my current bb+bbgen setup tries to test one of these machines that has zombified, it gets hung testing that host, and eventually everything turns purple since bb isn't updating anymore.
Does Hobbit have proper timeouts to timeout a hung TCP connection so this sort of thing does not happen? For all I know this behavior was fixed in bbgen 3.x , but as I said I plan on just phasing out my BB server in favor of Hobbit.
-Charles
On Wed, Jan 26, 2005 at 03:17:01PM -0700, Charles Jones wrote:
My production BigBrother server is running BigBrother + bbgen 2.5 (I know there is newer bbgen, I plan on replacing BB with a Hobbit server).
Wow, that's a pretty old bbgen version - 1œ years, in fact.
My current bb+bbgen setup has problems whenever a machine dies in such a way that it is pingable, but when you connect to any open TCP port you get nothing back (usually caused by a memory error or overheating). When my current bb+bbgen setup tries to test one of these machines that has zombified, it gets hung testing that host, and eventually everything turns purple since bb isn't updating anymore.
Does Hobbit have proper timeouts to timeout a hung TCP connection so this sort of thing does not happen?
If not, then it's definitely a bug. All network tests done by Hobbit must timeout if the other end doesn't respond. The default timeout is 10 seconds (set with the "--timeout=N" option to bbtest-net).
Looking back through the bbgen changelog, there are a couple of bugfixes through the 2.x series that seem likely to fix it. But without knowing exactly what's triggering this behaviour it is hard to say for sure.
Henrik
Henrik Stoerner wrote:
All network tests done by Hobbit must timeout if the other end doesn't respond. The default timeout is 10 seconds (set with the "--timeout=N" option to bbtest-net).
The problem is, the ports DO respond, you can telnet for example to port 25, and it connects...but the daemon does not respond...you can input text you will get nothing back, and unless you ^], break, the telnet session it will stay hung and connected indefinitely. It's these sort of hangs I'm hoping Hobbit can sense and timeout on. Unfortunately the only way for me to test it, is for a machine to lock up in that manner, and although it happens every now and then, I cannot reproduce it at will.
-Charles
I think it would be cool if Hobbit graphed the number of alerts it sent out. It could be included on the hobbitd status page. Trending alerts is good for showing how much pages the Oncall persons are responding to :-)
-Charles
I am still unable to get the elusive apache1-apache3 graphs to display.
Here's my relavant bb-hosts entries" paeg WEB Web Sites 1.2.3.4 www.mysite.com # noconn http://www.mysite.com apache=http://1.2.3.4/server-status?auto LARRD:*,apache:apache1|apache2|apache3
I have verified that going to http://1.2.3.4/server-status?auto works, here is the data it returns:
Total Accesses: 237 Total kBytes: 1279 CPULoad: 9.10606 Uptime: 66 ReqPerSec: 3.59091 BytesPerSec: 19843.9 BytesPerReq: 5526.14 BusyWorkers: 2 IdleWorkers: 12
Can you see anything I'm doing wrong? Note I also tried just having the simple keyword "apache" instead of apache=..., still no luck. I'm not even getting an "apache" column (although I wouldn't mind if the graphs just appeared in the http status info page).
Scoreboard: _C_________W__..................................................................................................................................................................................................................................................
On Thu, Jan 27, 2005 at 10:19:29PM -0700, Charles Jones wrote:
I am still unable to get the elusive apache1-apache3 graphs to display.
Here's my relavant bb-hosts entries" paeg WEB Web Sites 1.2.3.4 www.mysite.com # noconn http://www.mysite.com apache=http://1.2.3.4/server-status?auto LARRD:*,apache:apache1|apache2|apache3
Do you have an apache.rrd file in ~/data/rrd/www.mysite.com/ ?
If you do, then the graphs should show up on the "trends" page after a while; the "trends" page is updated every 15 minutes by default so it may take a while after you change the bb-hosts file for the new graphs to show up.
If not, then there's a problem with the data collection. But your bb-hosts entry looks right, and it seems your server sends the right data.
Henrik
Henrik Stoerner wrote:
On Thu, Jan 27, 2005 at 10:19:29PM -0700, Charles Jones wrote:
I am still unable to get the elusive apache1-apache3 graphs to display.
Here's my relavant bb-hosts entries" paeg WEB Web Sites 1.2.3.4 www.mysite.com # noconn http://www.mysite.com apache=http://1.2.3.4/server-status?auto LARRD:*,apache:apache1|apache2|apache3
Do you have an apache.rrd file in ~/data/rrd/www.mysite.com/ ?
Yep: -rw-r--r-- 1 hobbit other 114492 Jan 28 02:56 apache.rrd
If you do, then the graphs should show up on the "trends" page after a while; the "trends" page is updated every 15 minutes by default so it may take a while after you change the bb-hosts file for the new graphs to show up.
It's been in the bb-hosts file for a few hours and still no apache column, nor extra graphs in the http column status page.
If not, then there's a problem with the data collection. But your bb-hosts entry looks right, and it seems your server sends the right data.
How do we troubleshoot this? I checked the apache server logs and the access log shows that Hobbit is hitting the server-status url. Is there some way to manually query data from the apache.rrd file to see if it has anything in it?
-Charles
Okay, scratch my last message. I re-read what you said and looked at the *trends* page and the graphs are there. This is what I get for working on stuff past midnight :-)
Now my question is, how can I get the apache graphs to display on the httpd page as well as in trends page?
-Charles
Charles Jones wrote:
Henrik Stoerner wrote:
On Thu, Jan 27, 2005 at 10:19:29PM -0700, Charles Jones wrote:
I am still unable to get the elusive apache1-apache3 graphs to display.
Here's my relavant bb-hosts entries" paeg WEB Web Sites 1.2.3.4 www.mysite.com # noconn http://www.mysite.com apache=http://1.2.3.4/server-status?auto LARRD:*,apache:apache1|apache2|apache3
Do you have an apache.rrd file in ~/data/rrd/www.mysite.com/ ?
Yep: -rw-r--r-- 1 hobbit other 114492 Jan 28 02:56 apache.rrd
If you do, then the graphs should show up on the "trends" page after a while; the "trends" page is updated every 15 minutes by default so it may take a while after you change the bb-hosts file for the new graphs to show up.
It's been in the bb-hosts file for a few hours and still no apache column, nor extra graphs in the http column status page.
If not, then there's a problem with the data collection. But your bb-hosts entry looks right, and it seems your server sends the right data.
How do we troubleshoot this? I checked the apache server logs and the access log shows that Hobbit is hitting the server-status url. Is there some way to manually query data from the apache.rrd file to see if it has anything in it?
-Charles
On Fri, Jan 28, 2005 at 03:06:15AM -0700, Charles Jones wrote:
Okay, scratch my last message. I re-read what you said and looked at the *trends* page and the graphs are there. This is what I get for working on stuff past midnight :-)
I had the same feeling last night.
Now my question is, how can I get the apache graphs to display on the httpd page as well as in trends page?
Right now you cannot.
Ideally, bb-hostsvc.cgi that generates the html view of a status log would pick up the LARRD setting from bb-hosts, and give you the same graphs that you get on the trends page. It doesn't right now - for historical reasons, mostly.
Henrik
I'm assuming this wont work with Hobbit, since Hobbit stores the rrd files differently. Do you think temperature-larrd.pl could be modified to run on the Hobbit server and work? Or should I instead attempt to hack the client temperature.sh to send the temp as a data message and then create a do_temp.c module?
Speaking of this, it sure would be nice to have some sort of plugin system, or something for easily creating custom graphs. I can think of many uses for simple one-element graphs (temperature, emails sent per day, etc). I've been up all night because of temperature issues in my server room, so forgive me if I'm not making much sense :-)
-Charles
On Fri, Jan 28, 2005 at 04:47:27AM -0700, Charles Jones wrote:
I'm assuming this wont work with Hobbit, since Hobbit stores the rrd files differently. Do you think temperature-larrd.pl could be modified to run on the Hobbit server and work? Or should I instead attempt to hack the client temperature.sh to send the temp as a data message and then create a do_temp.c module?
I looked at converting temperature-larrd.pl when doing the Hobbit larrd stuff, but I couldn't find the script that feeds it - and without some idea of what the input data looks like, it's a bit hard to do the data collection.
Where can I find the client side script ? Or perhaps you can just send me a sample of the status it reports.
Speaking of this, it sure would be nice to have some sort of plugin system, or something for easily creating custom graphs. I can think of many uses for simple one-element graphs (temperature, emails sent per day, etc).
You mean doing it in C is too hard :-)
The current work-around is to enable the hobbitd_filestore module to save status- and data-reports to files, the way Big Brother does.
There's an option for hobbitd_filestore so you need not save all status logs on disk, but only the ones you want to process with some other tool.
Henrik
Henrik Stoerner wrote:
On Fri, Jan 28, 2005 at 04:47:27AM -0700, Charles Jones wrote:
I'm assuming this wont work with Hobbit, since Hobbit stores the rrd files differently. Do you think temperature-larrd.pl could be modified to run on the Hobbit server and work? Or should I instead attempt to hack the client temperature.sh to send the temp as a data message and then create a do_temp.c module?
I looked at converting temperature-larrd.pl when doing the Hobbit larrd stuff, but I couldn't find the script that feeds it - and without some idea of what the input data looks like, it's a bit hard to do the data collection.
Where can I find the client side script ? Or perhaps you can just send me a sample of the status it reports.
The client script is on deadcat.net - http://www.deadcat.net/viewfile.php?fileid=501 Here is a sample status message, from my BigBrother server that is using it: logs]# cat *temp green Fri Jan 28 09:13:19 MST 2005 Temperature status: Device Temp(C) Temp(F)
&green AMBIENT 24 75 &green CPU0 40 104 &green CPU1 40 104 &green CPU2 40 104 &green CPU3 40 104
Status green: All devices look okay
Status unchanged in 5.12 hours Status message received from 1.2.3.4
Note that the output can vary depending on which kind of machine temperature.sh is run on, but I believe they all have AMBIENT so thats the main value we want to grab and trend
Speaking of this, it sure would be nice to have some sort of plugin system, or something for easily creating custom graphs. I can think of many uses for simple one-element graphs (temperature, emails sent per day, etc).
You mean doing it in C is too hard :-)
Okay ya got me there :P
The current work-around is to enable the hobbitd_filestore module to save status- and data-reports to files, the way Big Brother does.
There's an option for hobbitd_filestore so you need not save all status logs on disk, but only the ones you want to process with some other tool.
Blah...I'm trying to not use any of the backwards compatible features...I want new and improved all the way :-)
Henrik Storner wrote:
<snip>
Thanks for the explanation of larrd. It helped a lot.
Where do you get the I/O wait information from ?
On RHEL3 (procps-2.0.17-10), there is a value for it in column 14 of vmstat's output, labeled "wa" under "cpu", so I modified a section of larrd-0.43c's vmstat-larrd.pl so it'd recognize this value and use it when dealing with rhel3 systems. I hacked my client's vmstat larrd bf script to make it determine if the system rhel3 or not, then exported the BBOSNAME as rhel3 so this array assignment would used by vmstat-larrd.pl.
rhel3 => { cpu_r => 0, cpu_b => 1, mem_swpd => 2, mem_free => 3, mem_buff => 4, mem_cach => 5, mem_si => 6, mem_so => 7, dsk_bi => 8, dsk_bo => 9, cpu_int => 10, cpu_csw => 11, cpu_usr => 12, cpu_sys => 13, cpu_wait => 14, cpu_idl => 15,
I might try adding this to hobbitd/larrd/do_vmstat.c and see if I can make it work.
DEF:cpu_idl=vmstat.rrd:cpu_idl:AVERAGEi.e. grab the "vmstat.rrd" file, and extract the current average value of the "cpu_idl" dataset.
You can mix values from different RRD files in the same graph, e.g. the "vmstat2" graph uses both the "vmstat.rrd" file and the "la.rrd" file:
This is nice. Once I figured out what you were doing there, I thought "hey, all I've got to do is set up a def for cpu_wa|cpu_wait and I'm golden." Then I fired up rrdtool and checked the rrd file, only to realize that I didn't have the data to begin with...
If you have more questions, please ask. And if you have something that could be of interest to others, I'll be happy to include it with Hobbit.
I'll be happy to contribute any patches that I generate.
Tom
Tom
Tom Georgoulias wrote:
rhel3 => { cpu_r => 0, cpu_b => 1, mem_swpd => 2, mem_free => 3, mem_buff => 4, mem_cach => 5, mem_si => 6, mem_so => 7, dsk_bi => 8, dsk_bo => 9, cpu_int => 10, cpu_csw => 11, cpu_usr => 12, cpu_sys => 13, cpu_wait => 14, cpu_idl => 15,
I might try adding this to hobbitd/larrd/do_vmstat.c and see if I can make it work.
I was able to get this to work w/o much hassle at all--modifying hobbitd/larrd/do_vmstat.c to include the rhel3 array and lib/misc.c, lib/misc.h to define rhel3 as an os type did the trick. Then I created a vmstat graph config (vmstat_rhel3) in hobbitgraph.cfg that uses all 4 cpu status parameters and referenced that in bb-hosts for my rhel3 systems. Like I said in my last message, the vmstat bottom feeders on the clients have to be configured to set the BBOSTYPE to rhel3 when sending the data to the hobbit server for this to take effect, so this is more of a positive test result than a general purpose solution. I guess the point of this email is that it works just like I wanted it too. Getting it more generalized is my next step.
TOm
On Tue, 2005-01-25 at 08:27 -0500, Tom Georgoulias wrote:
Henrik Storner wrote:
<snip>
Thanks for the explanation of larrd. It helped a lot.
Where do you get the I/O wait information from ?
On RHEL3 (procps-2.0.17-10), there is a value for it in column 14 of vmstat's output, labeled "wa" under "cpu", so I modified a section of larrd-0.43c's vmstat-larrd.pl so it'd recognize this value and use it when dealing with rhel3 systems. I hacked my client's vmstat larrd bf
Actually, that is present in all kernel 2.6 versions, e.g. Mandrake 10.0 and 10.1. I'd love to be able to capture that - I beat on bb-central for quite a while trying to track it.
Tracking wait state is great for figuring out which boxes need more ram.
-- Daniel J McDonald, CCIE # 2495, CNX Austin Energy
dan.mcdonald at austinenergy.com
On Tue, Jan 25, 2005 at 08:27:37AM -0500, Tom Georgoulias wrote:
Henrik Storner wrote:
Where do you get the I/O wait information from ?
On RHEL3 (procps-2.0.17-10), there is a value for it in column 14 of vmstat's output, labeled "wa" under "cpu"
Aha! So that's it - I had been wondering a bit why my load graphs didn't always add up to 100% !
This is quite interesting, and definitely something that should be tracked. So I hope you don't mind that I've tried adding it myself ...
One annoying bit with the RRD files is that changing the dataset (e.g. adding an extra variable) is not possible. So adding the cpu_wait data will break any existing vmstat data that has been collected. So if we're gonna break the vmstat RRD layout for Linux clients, we might as well do it now before the official release. And that should also include getting the very old layout (the one from Linux 2.2 kernels, with the "r b w" proces-counts) aligned with the new layout - effectively creating a single vmstat RRD format regardless of what Linux version you are running.
So: I've modified the Linux vmstat RRD layout to always include the "cpu_w" (from the very old vmstat version) and "cpu_wait" columns (from the latest vmstat versions). If the client doesn't report a value for these, they are set to the special RRD-value "undefined". So when someone upgrades a system from Linux 2.2. to 2.4, or from 2.4 to 2.6, the vmstat data will still work.
I've also defined a "vmstat1" graph similar to the normal "vmstat" graph, but with the cpu_wait data added (it stacks on top of the "system" time, below "user" time).
Some sample graphs (they don't have any data yet, so you're probably better off waiting a couple of hours before you view them):
Linux 2.6 host: http://www.hswn.dk/hobbit-cgi/hobbitgraph.sh?host=voodoo.hswn.dk&service=vms...
Linux 2.4 host: http://www.hswn.dk/hobbit-cgi/hobbitgraph.sh?host=tyge.sslug.dk&service=vmst...
Linux 2.2 host (actually 2.4, but an old vmstat version): http://www.hswn.dk/hobbit-cgi/hobbitgraph.sh?host=fenris.hswn.dk&service=vms...
Henrik
Henrik Stoerner wrote:
On RHEL3 (procps-2.0.17-10), there is a value for it in column 14 of vmstat's output, labeled "wa" under "cpu"
Aha! So that's it - I had been wondering a bit why my load graphs didn't always add up to 100% !
This is quite interesting, and definitely something that should be tracked. So I hope you don't mind that I've tried adding it myself ...
Oh no, please do. Mine is a hack, your's would be a release. ;)
So adding the
cpu_wait data will break any existing vmstat data that has been collected. So if we're gonna break the vmstat RRD layout for Linux clients, we might as well do it now before the official release. And that should also include getting the very old layout (the one from Linux 2.2 kernels, with the "r b w" proces-counts) aligned with the new layout - effectively creating a single vmstat RRD format regardless of what Linux version you are running.
Good. Very good.
So: I've modified the Linux vmstat RRD layout to always include the "cpu_w" (from the very old vmstat version)
Isn't that value the number of processes swapped out, the third column from old vmstat? That is basically going to be ignored, unless someone has a custom larrd graph that uses it, right?
and "cpu_wait" columns
(from the latest vmstat versions). If the client doesn't report a value for these, they are set to the special RRD-value "undefined". So when someone upgrades a system from Linux 2.2. to 2.4, or from 2.4 to 2.6, the vmstat data will still work.
Cool. I'm looking forward to testing it out in the next beta.
Tom
participants (5)
-
dan.mcdonald@austinenergy.com
-
henrik@hswn.dk
-
jonescr@cisco.com
-
mailinglists@websitemanagers.com.au
-
tgeorgoulias@nandomedia.com