trying to get netapp filer data into larrd graphs
I'm using the filerstats2bb script from deadcat.net to get from my Netapp filers and displaying it in hobbit. This is what is displayed on the status page:
conn, cpu, disk, info, inode, qtree, trends, user_quota
The data displayed is accurate, but the only graph that works is conn. The rest are severely broken. I checked the rrd directory and the only files that existed were:
memory.real.rrd memory.swap.rrd tcp.conn.rrd
Not sure why memory* exist at all, but they do and are empty.
I would like to fix this, starting with CPU. I'm hoping that what I learn here can be used when I attempt to create custom graphs for user_quota & qtree with the custom RRD feature described in hobbitd_larrd. The rest of this message concerns only the load average/CPU graph problem, since I figure this ought to work without any modification.
For example, this is the contents of a status summary displayed on the CPU status page:
Wed Feb 16 14:59:46 EST 2005 - CPU Utilization on filerA.nandomedia.com is OK. Uptime: 63 days, 06:57:54.29, load=1
LOAD AVG on filerA.nandomedia.com is 1
Underneath the status message, a link for "hobbit graph la" instead of a real graph.
There is data in the usual places in hobbit/data/hist/
cat hobbit/data/hist/filerA,nandomedia,com.cpu <snip> Sat Feb 12 16:32:47 2005 purple 1108243967 140478 Mon Feb 14 07:34:05 2005 green 1108384445 1050 Mon Feb 14 07:51:35 2005 purple 1108385495 304 Mon Feb 14 07:56:39 2005 green 1108385799
cat hobbit/data/hist/filerA.nandomedia.com | grep cpu <snip> cpu 1108128065 1108073455 54610 gr pu 1 cpu 1108243967 1108128065 115902 pu gr 2 cpu 1108384445 1108243967 140478 gr pu 1 cpu 1108385495 1108384445 1050 pu gr 2 cpu 1108385799 1108385495 304 gr pu 1
So why isn't the data being pulled from the status report messages and put into an rrd file to larrd can use it?
Tom
In <4213B272.7040306 at nandomedia.com> Tom Georgoulias <tgeorgoulias at nandomedia.com> writes:
I'm using the filerstats2bb script from deadcat.net to get from my Netapp filers and displaying it in hobbit. This is what is displayed on the status page:
conn, cpu, disk, info, inode, qtree, trends, user_quota
The data displayed is accurate, but the only graph that works is conn. The rest are severely broken.
That figures, since the "conn" test is run by Hobbit (bbtest-net) and reports data in a form that Hobbit knows how to handle.
I would like to fix this, starting with CPU. I'm hoping that what I learn here can be used when I attempt to create custom graphs for user_quota & qtree with the custom RRD feature described in hobbitd_larrd.
You can use some of it, but there is a difference between fixing an existing handler (hobbit already handles some "cpu" data), and adding a new handler that hobbit does not know about. Simply because when fixing the cpu-handler, you really have to fix the current C code.
The rest of this message concerns only the load average/CPU graph problem, since I figure this ought to work without any modification.
For example, this is the contents of a status summary displayed on the CPU status page: == Wed Feb 16 14:59:46 EST 2005 - CPU Utilization on filerA.nandomedia.com is OK. Uptime: 63 days, 06:57:54.29, load=1
The best way of working with the RRD data that Hobbit handles is to snoop on the data that is sent from hobbitd to the hobbitd_larrd program. You can do that by listening on the hobbit "status" channel: ~/server/bin/bbcmd sh hobbitd_channel --channel=status cat When the "cpu" status arrives, you'll see something like this: @@status#121308|1108589727.548324|172.16.10.2||voodoo.hswn.dk|cpu|1108591527|green||green|1106668421|0||0| status voodoo,hswn,dk.cpu green Wed Feb 16 22:35:27 CET 2005 up: 23 days, 2 users, 171 procs, load=11 top - 22:35:27 up 23 days, 48 min, 2 users, load average: 0.24, 0.11, 0.09 Tasks: 170 total, 1 running, 169 sleeping, 0 stopped, 0 zombie Cpu(s): 4.2% us, 1.5% sy, 0.1% ni, 91.2% id, 2.8% wa, 0.1% hi, 0.1% si Mem: 646876k total, 635204k used, 11672k free, 194116k buffers Swap: 787176k total, 23608k used, 763568k free, 123284k cached [lots of lines from "top" snipped] @@ The first line with "@@status..." is the beginning of a message - it has some information that hobbitd picks out from all messages, like the hostname, test-name, color etc. The important thing here is to see that hobbitd does see that it is a "cpu" status - there's "|cpu|" in the first line. That means hobbitd_larrd will send this message through the "cpu" handler in hobbitd/larrd/do_la.c. So we need to look at what the do_la.c file does. eoln = strchr(msg, '\n'); if (eoln) *eoln = '\0'; This finds the first new-line character, and cuts off anything after that. So essentially, it only looks at the first line of the status message. p = strstr(msg, "up: "); if (p) { .... process the message .... This searches the message (or rather, the first line of it), for the string "up: " . I suspect this is where it breaks for your Netapp reports, because they have "Uptime:", not "up: "
Wed Feb 16 14:59:46 EST 2005 - CPU Utilization on filerA.nandomedia.com is OK. Uptime: 63 days, 06:57:54.29, load=1
Yes, computers are picky about such details ... So the first fix is to change those lines above to handle a report with the keyword "Uptime:" - e.g. like this: p = strstr(msg, "up: "); if (!p) p = strstr(msg, "Uptime:"); if (p) { Just one line added. But in this case, I think it makes all the difference - because the rest of the reports looks like it will be handled just fine by the current code in do_la.c I've added this fix to my sources. Not much info here about doing custom graphs, I'm afraid. But if you look over the example in the hobbitd_larrd man-page, it should get you started. If not, feel free to ask for more help. Henrik PS: If you want me to look at that Netapp disk-report that isn't being graphed, just send me an example of what such a report looks like. H.
Henrik Storner wrote:
The best way of working with the RRD data that Hobbit handles is to snoop on the data that is sent from hobbitd to the hobbitd_larrd program. You can do that by listening on the hobbit "status" channel:
~/server/bin/bbcmd sh hobbitd_channel --channel=status cat
The first line with "@@status..." is the beginning of a message - it has some information that hobbitd picks out from all messages, like the hostname, test-name, color etc. The important thing here is to see that hobbitd does see that it is a "cpu" status - there's "|cpu|" in the first line. That means hobbitd_larrd will send this message through the "cpu" handler in hobbitd/larrd/do_la.c.
THis was extremely useful to learn. Thanks for sharing it.
So the first fix is to change those lines above to handle a report with the keyword "Uptime:" - e.g. like this:
p = strstr(msg, "up: "); if (!p) p = strstr(msg, "Uptime:"); if (p) {
Just one line added. But in this case, I think it makes all the difference - because the rest of the reports looks like it will be handled just fine by the current code in do_la.c
I've added this fix to my sources.
I added the line to do_la.c and a rrd file is being created for la, but the data used in the graph was being converted or truncated in some manner on its way from the status report message to the rrd file. The "load average" collected by this script is actually the %CPU utilization, not a true unix load average. I thought that it may have been getting converted by the operation that converts load averages when DISPREALLOADAVG=FALSE, so I added a line to the perl script that adds 2 digits after a decimal when returning the CPU load avg to hobbit. Now a CPU utilization of 11% is displayed as "load=11.00", which seems to be working better. So as it stands now, the trend charting works and I've found a new problem while pulling my hair out on this one: The CPU utilization data obtained by SNMP is not always accurate (netapp bug #145119). In my experience, it seems to be about 5-10% off. That's not something that I can fix, so I'm just going to have to live with it for now. Still didn't make troubleshooting this hobbit graphing any easier! ;) Coincidence or not, it seems that after I applied the fix above and rebuilt hobbit, sometime later a hobbitd_larrd column appeared and stayed red then purple for a very long time. The error message was "fatal signal caught" or something like that. I ended up using the bb 127.0.0.1 "drop servername hobbitd_larrd" command just to get rid of it, with the intention of adding it back later once I was sure it wasn't a bogus message. I'm beginning to regret that, since in my haste I may have thrown out perfectly good data. Was that a new feature that was added in RC2? How would I get it back? Add hobbitd_larrd to bb-hosts?
PS: If you want me to look at that Netapp disk-report that isn't being graphed, just send me an example of what such a report looks like.
Sure thing. See below, sorry about the line wrap. After seeing what you looked at in the CPU case, I think I know what the problem could be. The rest of my systems use the phrase "Disk partitions" while the filer uses "NetAPP Volumes". I poked at the do_disk.c code but was clearly out of my league when it came to fixing it. The column ordering is different too, although I can reorder it in the perl script to match the other linux style systems if needed. Thu Feb 17 08:12:36 EST 2005 - NetAPP Volumes on filerA.nandomedia.com OK Volume: Size: Used: Avail: %Used green /vol/test01/ 382G 92915122176 296G 22.63% green /vol/test01/.snapshot 96G 27266535424 70G 26.56% green /vol/test01/total 478G 120181657600 366G 23.41% green /vol/vol0/ 96G 193298432 95G 0.19% green /vol/vol0/.snapshot 24G 129028096 24G 0.50% green /vol/vol0/total 120G 322326528 119G 0.25%
On Thu, Feb 17, 2005 at 01:21:42PM -0500, Tom Georgoulias wrote:
Coincidence or not, it seems that after I applied the fix above and rebuilt hobbit, sometime later a hobbitd_larrd column appeared and stayed red then purple for a very long time. The error message was "fatal signal caught" or something like that.
Aha - all of the hobbitd programs have a built-in feature so that if they do crash, they'll try to let you know it happened by sending off a status-message about themselves, like the one you saw. Since hobbitd_larrd doesn't normally send status messages, it will eventually go purple.
I'll look over the code - there's probably something that needs more thorough error-checking to withstand all kinds of input.
PS: If you want me to look at that Netapp disk-report that isn't being graphed, just send me an example of what such a report looks like.
Sure thing. See below, sorry about the line wrap. After seeing what you looked at in the CPU case, I think I know what the problem could be. The rest of my systems use the phrase "Disk partitions" while the filer uses "NetAPP Volumes". I poked at the do_disk.c code but was clearly out of my league when it came to fixing it.
A bit of experience with the code does help :-) The disk handler is one of the more complicated ones.
The column ordering is different too, although I can reorder it in the perl script to match the other linux style systems if needed.
That won't be necessary.
I think I have something now that appears to work. I'll send you the latest source-files directly to test, and then it will be in the next release.
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
tgeorgoulias@nandomedia.com