Dominique Frise wrote:
Hi,
We track "surgemail" processes using following rule in hobbit-clients.cfg:
HOST=xyz PROC ./surgemail min=0 TRACK=surgemail
The ps listing in msg.xyz.txt reports 315 "./surgemail" processes, while the rrd graph only shows ~30 processes.
Here the last corresponding dataset of processes.surgemail.rrd file (after flushing the cache by stopping Xymon):
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE rrd SYSTEM "http://oss.oetiker.ch/rrdtool/rrdtool.dtd"> <!-- Round Robin Database Dump --><rrd> <version> 0003 </version> <step> 300 </step> <!-- Seconds --> <lastupdate> 1239775972 </lastupdate> <!-- 2009-04-15 08:12:52 CEST -->
<ds> <name> count </name> <type> GAUGE </type> <minimal_heartbeat> 600 </minimal_heartbeat> <min> 0.0000000000e+00 </min> <max> NaN </max>
<!-- PDP Status --> <last_ds> 30 </last_ds> <value> 5.1600000000e+03 </value> <unknown_sec> 0 </unknown_sec> </ds>
<!-- Round Robin Archives --> <rra>
We tried to let Xymon recreate a fresh rrd without success. The same configuration was working with Hobbit-4.2.0/RRDtool 1.2.19 (same version)
The rrd-code has pretty changed since 4.2.0 and I don't really see what code is involved to try debugging this. Any help appreciated!
Dominique
This is a more general problem. The data messages passed to hobbitd_rrd are truncated. Debugging showed that messages are going correctly out of hobbitd but read incorrectly by hobbitd_channel. Here below the debug output of hobbitd and hobbitd_channel with extra printf lines to dump the messages. ------ hobbitd.log -------- 2009-04-17 16:22:21 <- do_message/1 2009-04-17 16:22:21 -> do_message/1 (86 bytes): data blind.ifstat 2009-04-17 16:22:21 -> update_statistics 2009-04-17 16:22:21 <- update_statistics 2009-04-17 16:22:21 -> oksender 2009-04-17 16:22:21 <- oksender(1-a) 2009-04-17 16:22:21 ->handle_data 2009-04-17 16:22:21 -> posttochannel 2009-04-17 16:22:21 Posting message 2 to 1 readers 2009-04-17 16:22:21 <- posttochannel 2009-04-17 16:22:21 <-handle_data 2009-04-17 16:22:21 msg: data blind.ifstat solaris bge:0:bge0:obytes64 267829127 bge:0:bge0:rbytes64 1208836563 2009-04-17 16:22:21 <- do_message/1 2009-04-17 16:22:21 -> do_message/1 (104 bytes): data blind.vmstat 2009-04-17 16:22:21 -> update_statistics 2009-04-17 16:22:21 <- update_statistics 2009-04-17 16:22:21 -> oksender 2009-04-17 16:22:21 <- oksender(1-a) 2009-04-17 16:22:21 ->handle_data 2009-04-17 16:22:21 -> posttochannel 2009-04-17 16:22:21 Posting message 3 to 1 readers 2009-04-17 16:22:21 <- posttochannel 2009-04-17 16:22:21 <-handle_data 2009-04-17 16:22:21 msg: data blind.vmstat solaris 0 0 0 11938312 10700752 3 19 0 0 0 0 0 2 2 2 0 343 2099 1006 1 2 97 2009-04-17 16:22:21 <- do_message/1 2009-04-17 16:22:21 -> do_message/1 (1315 bytes): data blind.iostatdisk ------- rrd-data.log -------- 2009-04-17 16:22:21 Peer not up, flushing message queue 2009-04-17 16:22:21 Connecting to peer 0.0.0.0:0 2009-04-17 16:22:21 Peer is UP 2009-04-17 16:22:21 inbuf: @@data#2/blind|1239978141.731166|130.223.27.23||blind|ifstat|sunos|intraDevServ,adminSys data blind.ifstat solaris bge:0:bge0:obytes64 267829127 bge:0:bge0:rbytes64 12088365 @@ 2009-04-17 16:22:21 inbuf: @@data#3/blind|1239978141.731938|130.223.27.23||blind|vmstat|sunos|intraDevServ,adminSys data blind.vmstat solaris 0 0 0 11938312 10700752 3 19 0 0 0 0 0 2 2 2 0 343 2099 1006 1 2 @@ The last value of ifstat and vmstat (1208836563,97) becomes 12088365 and NULL respectively. Hope Henrick can help us to solve this issue. Dominique