[hobbit] server fails to receive all of client message
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Rodolfo Pilas wrote:
Adam, take a look at:
http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_f...
I've checked that, and it does not seem to be the problem.
All original info is below, a quick recap, my hobbit server doesn't receive the complete client data, and so procs (and sometimes ports) go red since the data is missing, the procs are not found).
Currently, the bbproxy server (which is running a hobbit on 127.0.0.1) shows a green for procs for itself (the bbclient running locally) which means the bbclient passed a message to 10.30.10.9 (bbproxy) which passed the message to 127.0.0.1 (hobbit) which displayed it correctly.
The same bbproxy failed to send that message to the remote hobbit server, which has the following as the "Client Data" when I click at the bottom of the red procs page:
client mail,servername,com,au.linux linux [date] Tue Dec 16 15:56:14 EST 2008 [uname] Linux mail 2.6.18-6-686 i686 [osversion] Debian 4.0 [uptime] 15:56:14 up 11 days, 6:53, 1 user, load average: 0.22, 0.39, 0.43 [who] userag pts/0 Dec 16 13:29 (123.123.123.12.static.net.au) [df] Filesystem 1024-blocks Used Available Capacity Mounted on /dev/md0 9775120 7792092 1983028 80% / /dev/md1 146002196 23881268 122120928 17% /home [mount] /dev/md0 on / type reiserfs (rw,notail) tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) procbususb on /proc/bus/usb type usbfs (rw) udev on /dev type tmpfs (rw,mode=0755) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) /dev/md1 on /home type reiserfs (rw) //ptserver/Shared$ on /mnt/ptserver/shared type smbfs (ro) //ptserver/Shared$ on /mnt/ptserver/sharedrw type smbfs (rw) [free] total used free shared buffers cached Mem: 3112548 2735760 376788 0 451828 1188956
- -/+ buffers/cache: 1094976 2017572 Swap: 1012072 88 1011984 [ifconfig] eth0 Link encap:Ethernet HWaddr 00:13:20:5F:EC:F3 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::213:20ff:fe5f:ecf3/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14847337 errors:0 dropped:0 overruns:0 frame:0 TX packets:15440612 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:795753357 (758.8 MiB) TX bytes:4174004073 (3.8 GiB)
eth1 Link encap:Ethernet HWaddr 00:09:5B:1A:16:26 inet addr:10.30.10.9 Bcast:10.30.15.255 Mask:255.255.240.0 inet6 addr: fe80::209:5bff:fe1a:1626/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:18232874 errors:0 dropped:0 overruns:0 frame:0 TX packets:25222742 errors:7 dropped:0 overruns:7 carrier:7 collisions:0 txqueuelen:1000 RX bytes:771706320 (735.9 MiB) TX bytes:2584475027 (2.4 GiB) Interrupt:74 Base address:0xc000
lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:7545664 errors:0 dropped:0 overruns:0 frame:0 TX packets:7545664 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3978669335 (3.7 GiB) TX bytes:3978669335 (3.7 GiB)
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:10.30.99.1 P-t-P:10.30.99.2 Mask:255.255.255.255 UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1 RX packets:767727 errors:0 dropped:0 overruns:0 frame:0 TX packets:873992 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:57714836 (55.0 MiB) TX bytes:210654878 (200.8 MiB)
[route] Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 10.30.99.2 0.0.0.0 255.255.255.255 UH 0 0 0 tun0 10.30.99.0 10.30.99.2 255.255.255.0 UG 0 0 0 tun0 192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 10.30.0.0 0.0.0.0 255.255.240.0 U 0 0 0 eth1 10.30.0.0 10.30.10.254 255.255.0.0 UG 0 0 0 eth1 0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 eth0 [netstat] Ip: 40747731 total packets received 138429 with invalid addresses 1633704 forwarded 1 with unknown protocol 0 incoming packets discarded 38950970 incoming packets delivered 50359204 requests sent out 15 outgoing packets dropped 1 fragments dropped after timeout 1831 reassemblies required 915 packets reassembled ok 1 packet reassembles failed Icmp: 1657334 ICMP messages received 3902 input ICMP message failed. ICMP input histogram: destination unreachable: 1566573 timeout in transit: 54 redirects: 123 echo requests: 26880 echo replies: 61402 139635 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 112755 echo replies: 26880 Tcp: 1669446 active connections openings 1247657 passive connection openings 361872 failed connection attempts 132167 connection resets received 96 connections established 33096061 segments received 40931113 segments send out 715004 segments retransmited 0 bad segments received. 125842 resets sent Udp: 4173585 packets received 23298 packets to unknown port received. 62 packet receive errors 6868647 packets sent TcpExt: 482 resets received for embryonic SYN_RECV sockets 268 packets pruned from receive queue because of socket buffer overrun 14 ICMP packets dropped because they were out-of-window 1084404 TCP sockets finished time wait in fast timer 2236 time wait sockets recycled by time stamp 354 packets rejects in established connections because of timestamp 366541 delayed acks sent 179 delayed acks further delayed because of locked socket Quick ack mode was activated 111492 times 2510467 packets directly queued to recvmsg prequeue. 3003729 of bytes directly received from backlog 380000212 of bytes directly received from prequeue 7194390 packet headers predicted 724319 packets header predicted and directly queued to user 8998070 acknowledgments not containing data received 5065326 predicted acknowledgments 1472 times recovered from packet loss due to fast retransmit 42465 times recovered from packet loss due to SACK data 161 bad SACKs received Detected reordering 165 times using FACK Detected reordering 181 times using SACK Detected reordering 444 times using reno fast retransmit Detected reordering 3197 times using time stamp 1530 congestion windows fully recovered 17437 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 155 73943 congestion windows recovered after partial ack 29880 TCP data loss events TCPLostRetransmit: 42 541 timeouts after reno fast retransmit 60499 timeouts after SACK recovery 11423 timeouts in loss state 84462 fast retransmits 6777 forward retransmits 45329 retransmits in slow start 261506 other TCP timeouts TCPRenoRecoveryFail: 203 14386 sack retransmits failed 2738 times receiver scheduled too late for direct processing 13320 packets collapsed in receive queue due to low socket buffer 89544 DSACKs sent for old packets 1294 DSACKs sent for out of order packets 21665 DSACKs received 446 DSACKs for out of order packets received 34120 connections reset due to unexpected data 3110 connections reset due to early user close 34312 connections aborted due to timeout [ports] Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:20000 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:1984 0.0.0.0:* LISTEN tcp 0 0 10.30.10.9:1984 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:20002 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:55555 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:37 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:389 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:21000 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:873 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:9 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:139 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:13 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:783 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN tcp 0 0 10.30.10.9:8080 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:113 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN tcp 0 0 10.30.10.9:1080 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:5432 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:25 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:445 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:14238 0.0.0.0:* LISTEN tcp 0 0 192.168.1.2:56464 123.123.32.80:80 ESTABLISHED tcp 0 0 192.168.1.2:51389 123.123.176.189:80 ESTABLISHED tcp 0 0 192.168.1.2:52865 123.123.28.123:80 ESTABLISHED tcp 0 0 10.30.10.9:8080 10.30.10.16:1028 TIME_WAIT tcp 0 0 192.168.1.2:40408 123.213.88.51:80 ESTABLISHED tcp 0 0 192.168.1.2:53207 123.123.50.28:443 ESTABLISHED tcp 0 0 10.30.10.9:33905 10.30.10.1:139 ESTABLISHED tcp 0 0 192.168.1.2:49885 123.123.28.124:80 ESTABLISHED tcp 0 0 192.168.1.2:53071 123.123.176.176:80 ESTABLISHED tcp 0 0 127.0.0.1:5432 127.0.0.1:53017 ESTABLISHED tcp 0 0 192.168.1.2:25 123.123.132.183:57226 TIME_WAIT tcp 0 0 10.30.10.9:8080 10.30.10.14:2826 ESTABLISHED tcp 0 0 127.0.0.1:54646 127.0.0.1:1984 TIME_WAIT tcp 0 0 127.0.0.1:54645 127.0.0.1:1984 TIME_WAIT tcp 0 0 127.0.0.1:54655 127.0.0.1:1984 TIME_WAIT tcp 0 0 127.0.0.1:54639 127.0.0.1:1984 TIME_WAIT tcp 0 0 10.30.10.9:8080 10.30.10.14:2829 ESTABLISHED tcp 0 0 127.0.0.1:54664 127.0.0.1:1984 TIME_WAIT tcp 0 0 127.0.0.1:54665 127.0.0.1:1984 TIME_WAIT tcp 0 0 10.30.10.9:8080 10.30.10.14:2830 ESTABLISHED tcp 0 0 192.168.1.2:41020 123.123.38.94:80 ESTABLISHED tcp 0 0 10.30.10.9:8080 10.30.10.16:1050 ESTABLISHED tcp 0 0 192.168.1.2:35851 123.123.88.59:80 TIME_WAIT tcp 0 0 10.30.10.9:8080 10.30.10.16:1051 ESTABLISHED tcp 0 0 192.168.1.2:47514 123.123.88.51:80 ESTABLISHED tcp 0 0 192.168.1.2:37393 123.123.176.45:443 ESTABLISHED tcp 0 0 10.30.10.9:49509 10.30.10.9:995 ESTABLISHED tcp 0 0 192.168.1.2:38569 123.123.15.124:80 ESTABLISHED tcp 0 0 10.30.10.9:8080 10.30.10.16:1068 ESTABLISHED tcp 0 1640 192.168.1.2:40677 123.213.176.45:443 ESTABLISHED tcp 0 1 192.168.1.2:59642 123.123.176.176:80 LAST_ACK tcp 0 0 10.30.10.9:8080 10.30.10.16:1066 ESTABLISHED tcp 0 0 192.168.1.2:46183 123.123.88.51:80 ESTABLISHED tcp 0 0 10.30.10.9:8080 10.30.10.16:1067 ESTABLISHED tcp 0 1640 192.168.1.2:32966 123.123.176.45:443 ESTABLISHED tcp 0 0 192.168.1.2:41246 123.123.88.56:80 ESTABLISHED tcp 0 0 192.168.1.2:35872 123.123.38.94:80 TIME_WAIT tcp 0 0 10.30.10.9:139 10.30.10.1:3133 ESTABLISHED tcp 0 0 10.30.10.9:47825 10.30.10.1:139 TIME_WAIT tcp 0 0 10.30.10.9:47828 10.30.10.1:139 TIME_WAIT tcp 0 0 192.168.1.2:33758 123.123.38.94:80 TIME_WAIT tcp 0 0 10.30.10.9:8080 10.30.10.17:3132 ESTABLISHED tcp
Note, this is clearly truncated mid-line/mid-report!
So, the procs test on the remote hobbit shows red, and alerts/etc... The interesting thing to note this time is that the actual procs report showed most of the procs were actually found with some (truncated) ps output data. Which means, hobbit actually had more data than is displayed on the Client Data page....
If anyone can advise how to resolve this, I would be exceptionally keen to hear about it.
A upgrade to the bandwidth for one end is scheduled for 2 weeks time, which may help this site, but I am still seeing the same problem with other hosts in other locations.
Two additional questions:
Can some sort of checksum/verification marker be added to the end of the client data so the server knows that if it doesn't see the end marker to discard the entire message?
Compression of the client data? Nevermind, I'll post this as a separate email so it can get a little better visibility...
Thanks, Adam
Adam Goryachev escribió:
Adam Goryachev wrote:
Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).
I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:
- Most of them are running bbproxy and passing status messages from a number of clients.
- The rest of them are on very slow connections, or frequently very busy connections.
I have made some 'progress' of sorts.
I've increased the MAX values as I was getting some "Oversize ... truncated" messages in my log file. I then went home thinking "Great, I managed to solve this one thing today at least". Except, I started getting messages a few hours later.
So after further investigation, I've decided I really can't work out what is happening, and why it isn't working. I've enabled debug output from bbproxy, but I don't really know what it all means.
I can see that if I set bbproxy to only forward messages to 127.0.0.1 the local hobbit server gets all the data correctly. If I add the remote server, then some things don't work properly. Since it is likely all a big jumbled mess by now, I'll post a few sections of config files, and hopefully someone will notice my stupid mistake (or multiple mistakes)...
I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are windows PC's using the BB windows client), one is a linux hobbit-client and of course 10.30.10.9 is a hobbit client (plus a couple of old ext scripts using the old BB env). I think all this is working fine, since nothing goes randomly purple/red.
10.30.10.9 is behind NAT but has complete access to the internet.
I have a remote server behind a NAT router which has port 1984 port forwarded to it. It is receiving reports from around 20 other hobbit client machines perfectly, so I don't suspect the NAT router/hobbit config itself.
Some config from 10.30.10.9:
hobbitserver.cfg: BBSERVERIP="127.0.0.1" BBDISP="127.0.0.1" BBDISPLAYS="" MAXLINE="32768"
hobbitclient.cfg BBDISP="10.30.10.9" BBDISPLAYS="" BB="$BBHOME/bin/bb --debug --timeout=60" MAXLINE="32768"
hobbitlaunch.cfg [hobbitd] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --listen=127.0.0.1
[bbproxy] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details LOGFILE $BBSERVERLOGS/bbproxy.log
[hobbitclient] ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg NEEDS hobbitd CMD /usr/lib/hobbit/client/bin/hobbitclient.sh LOGFILE $BBSERVERLOGS/hobbitclient.log INTERVAL 5m
On the remote hobbit server with the public IP I have: hobbitserver.cfg BBSERVERIP="192.168.2.6" BBDISP="192.168.2.6" BBDISPLAYS="" MAXLINE="32768" MAXMSG_STATUS="1024" MAXMSG_CLIENT="1024" MAXMSG_DATA="512"
hobbitlaunch.cfg [hobbitd] HEARTBEAT ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --timeout=60
Any suggestions as to what is going wrong would be really appreciated.
BTW, bbnet tests from the 10.30.10.9 host are not submitted to the bbproxy at all because of the BBDISP setting in the hobbitserver.cfg, but if I change this to point to 10.30.10.9 then it seems to break the web interface. I'm not really too concerned about this right now though....
Thanks for any tips/pointers/etc
Regards, Adam
Adam Goryachev Website Managers Ph: +61 2 8304 0000 adam at websitemanagers.com.au Fax: +61 2 8304 0001 www.websitemanagers.com.au -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAklHQnsACgkQGyoxogrTyiVq/QCgjU1lIamzEs5lq5bKGo9K3Hgt XQAAnAvUe9sQC3PHeTeJjFi2+XGG50dT =65Io -----END PGP SIGNATURE-----
participants (1)
-
adam@websitemanagers.com.au