Here I am with some new data, because the problem still exists. I know that the rrd-data-daemon crashes with the "xstrdup: Cannot dup NULL string" error. I have setup netapp.pl with $Hobbit_fd_lib::debug = 2; and fount that the systat output is different; don't know if it is the real cause of the crash...?!
orwell:/usr/lib/hobbit/server/ext # cat /var/lib/hobbit/tmp/netapp.sysstat.DEBUG.camelot CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 29% 0 7976 0 7976 3147 5098 3872 3104 0 0 3 96% 12% T 8% 0 0 0 0 0 0
orwell:/usr/lib/hobbit/server/ext # cat /var/lib/hobbit/tmp/netapp.sysstat.DEBUG.noah CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s in out read write read write age hit time ty util in out 8% 0 0 0 0 1 6 986 1988 0 0
60 100% 13% T 9% 0 0 0 0
The other files (/var/lib/hobbit/tmp/netapp.xtstats.DEBUG.camelot) also show a change of output. The current beginning was previously the ending of the output file. So now it begins with :
system:system:nfs_ops:3190/s system:system:cifs_ops:0/s system:system:http_ops:0/s system:system:fcp_ops:0/s system:system:iscsi_ops:0/s system:system:read_ops:619/s system:system:write_ops:144/s system:system:net_data_recv:4187KB/s system:system:net_data_sent:23328KB/s system:system:disk_data_read:5493KB/s system:system:disk_data_written:6156KB/s system:system:cpu_busy:10% system:system:avg_processor_busy:10% system:system:total_processor_busy:20% system:system:num_processors:2 system:system:time:1244021254s system:system:uptime:1048085s disk:2000001D:38B5ED6F:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:total_transfers:8/s disk:2000001D:38B5ED6F:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:user_read_chain:3.60
Were from our pre-7.3.1.1 filers, the output is:
..... disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_read_latency:0us disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_read_blocks:0/s disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_write_latency:0us disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:guarenteed_write_blocks:0/s disk:6BE7CF95:56AFA883:F30CAEF5:83103FAC:00000000:00000000:00000000:00000000:00000000:00000000:disk_busy:0% system:system:nfs_ops:0/s system:system:cifs_ops:0/s system:system:http_ops:0/s system:system:dafs_ops:0/s system:system:fcp_ops:0/s system:system:iscsi_ops:0/s system:system:net_data_recv:13KB/s system:system:net_data_sent:47KB/s system:system:disk_data_read:986KB/s system:system:disk_data_written:1988KB/s system:system:cpu_busy:8% system:system:avg_processor_busy:5% system:system:total_processor_busy:10% system:system:num_processors:2 system:system:time:1244021255s system:system:uptime:7436873s
2009/5/30 Peter Welter <peter.welter at gmail.com>:
Addendum: Turning off 'netapp.pl' to all filers and selectively turning it on again, it appears that there are no problems with On Tap 7.2.3 and 7.2.4. The error does not show up and all trending (also for other data-dependant trending) show no holes anymore.
But these 7.3.1.1-filers are very important, so I have to turn the monitoring on again on this NetApp-cluster. Will see if debugging the perl script will give more relevant data.
2009/5/29 Peter Welter <peter.welter at gmail.com>:
Hello all,
Last friday may 22 at 8:20 we finished our upgrade from our NetApp-filers (version 7.2.3 to 7.3.1.1). These filers were (and are) monitored by Xymon in combination with the perl-netapp-client. Combined a great combo!
However, since the upgrade I keep getting this error in /var/log/hobbit/rrd-data.log: ... 2009-05-22 08:22:00 xstrdup: Cannot dup NULL string 2009-05-22 08:22:00 Worker process died with exit code 6, terminating ....
This error appears every 5 minutes.
Only one graph-type is not trended anymore since the upgrade, the xtstatscolumn which deliver all statistics about each drive in the filer. (About +/- 20 graphs). Sometimes, it does trend some data but that is for a very short time, let's say 5 or 15 minutes. Then for hours, nothing.
One filer has not been upgraded, but shows the same lack of trending. But that can be caused because I have set it up with MultiThreading (something that can be set using a parameter).
Now I will change this to 1 (for each filer a seperate process) to see if the problem can be narrowed, so I'll update this problem later on this weekend.
Regards, Peter
PS I do not know if this has to do with either Xymon of netapp.pl, but since it is integrated into the Xymon-source (hobbitd/rrd/do_netapp.c) I think it should be posted here.