[hobbit] server fails to receive all of client message

16 Dec 2008

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Rodolfo Pilas wrote:
...
Adam, take a look at:
http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_f...
I've checked that, and it does not seem to be the problem.
All original info is below, a quick recap, my hobbit server doesn't
receive the complete client data, and so procs (and sometimes ports) go
red since the data is missing, the procs are not found).
Currently, the bbproxy server (which is running a hobbit on 127.0.0.1)
shows a green for procs for itself (the bbclient running locally) which
means the bbclient passed a message to 10.30.10.9 (bbproxy) which passed
the message to 127.0.0.1 (hobbit) which displayed it correctly.
The same bbproxy failed to send that message to the remote hobbit
server, which has the following as the "Client Data" when I click at the
bottom of the red procs page:
client mail,servername,com,au.linux linux
[date]
Tue Dec 16 15:56:14 EST 2008
[uname]
Linux mail 2.6.18-6-686 i686
[osversion]
Debian 4.0
[uptime]
15:56:14 up 11 days,  6:53,  1 user,  load average: 0.22, 0.39, 0.43
[who]
userag   pts/0        Dec 16 13:29 (123.123.123.12.static.net.au)
[df]
Filesystem         1024-blocks      Used Available Capacity Mounted on
/dev/md0               9775120   7792092   1983028      80% /
/dev/md1             146002196  23881268 122120928      17% /home
[mount]
/dev/md0 on / type reiserfs (rw,notail)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /home type reiserfs (rw)
//ptserver/Shared$ on /mnt/ptserver/shared type smbfs (ro)
//ptserver/Shared$ on /mnt/ptserver/sharedrw type smbfs (rw)
[free]
total       used       free     shared    buffers     cached
Mem:       3112548    2735760     376788          0     451828    1188956

-/+ buffers/cache:    1094976    2017572
Swap:      1012072         88    1011984
[ifconfig]
eth0      Link encap:Ethernet  HWaddr 00:13:20:5F:EC:F3
inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
inet6 addr: fe80::213:20ff:fe5f:ecf3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:14847337 errors:0 dropped:0 overruns:0 frame:0
TX packets:15440612 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:795753357 (758.8 MiB)  TX bytes:4174004073 (3.8 GiB)

eth1      Link encap:Ethernet  HWaddr 00:09:5B:1A:16:26
inet addr:10.30.10.9  Bcast:10.30.15.255  Mask:255.255.240.0
inet6 addr: fe80::209:5bff:fe1a:1626/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:18232874 errors:0 dropped:0 overruns:0 frame:0
TX packets:25222742 errors:7 dropped:0 overruns:7 carrier:7
collisions:0 txqueuelen:1000
RX bytes:771706320 (735.9 MiB)  TX bytes:2584475027 (2.4 GiB)
Interrupt:74 Base address:0xc000
lo        Link encap:Local Loopback
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:7545664 errors:0 dropped:0 overruns:0 frame:0
TX packets:7545664 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3978669335 (3.7 GiB)  TX bytes:3978669335 (3.7 GiB)
tun0      Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.30.99.1  P-t-P:10.30.99.2  Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
RX packets:767727 errors:0 dropped:0 overruns:0 frame:0
TX packets:873992 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:57714836 (55.0 MiB)  TX bytes:210654878 (200.8 MiB)
[route]
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt
Iface
10.30.99.2      0.0.0.0         255.255.255.255 UH        0 0          0
tun0
10.30.99.0      10.30.99.2      255.255.255.0   UG        0 0          0
tun0
192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0
eth0
10.30.0.0       0.0.0.0         255.255.240.0   U         0 0          0
eth1
10.30.0.0       10.30.10.254    255.255.0.0     UG        0 0          0
eth1
0.0.0.0         192.168.1.1     0.0.0.0         UG        0 0          0
eth0
[netstat]
Ip:
40747731 total packets received
138429 with invalid addresses
1633704 forwarded
1 with unknown protocol
0 incoming packets discarded
38950970 incoming packets delivered
50359204 requests sent out
15 outgoing packets dropped
1 fragments dropped after timeout
1831 reassemblies required
915 packets reassembled ok
1 packet reassembles failed
Icmp:
1657334 ICMP messages received
3902 input ICMP message failed.
ICMP input histogram:
destination unreachable: 1566573
timeout in transit: 54
redirects: 123
echo requests: 26880
echo replies: 61402
139635 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 112755
echo replies: 26880
Tcp:
1669446 active connections openings
1247657 passive connection openings
361872 failed connection attempts
132167 connection resets received
96 connections established
33096061 segments received
40931113 segments send out
715004 segments retransmited
0 bad segments received.
125842 resets sent
Udp:
4173585 packets received
23298 packets to unknown port received.
62 packet receive errors
6868647 packets sent
TcpExt:
482 resets received for embryonic SYN_RECV sockets
268 packets pruned from receive queue because of socket buffer overrun
14 ICMP packets dropped because they were out-of-window
1084404 TCP sockets finished time wait in fast timer
2236 time wait sockets recycled by time stamp
354 packets rejects in established connections because of timestamp
366541 delayed acks sent
179 delayed acks further delayed because of locked socket
Quick ack mode was activated 111492 times
2510467 packets directly queued to recvmsg prequeue.
3003729 of bytes directly received from backlog
380000212 of bytes directly received from prequeue
7194390 packet headers predicted
724319 packets header predicted and directly queued to user
8998070 acknowledgments not containing data received
5065326 predicted acknowledgments
1472 times recovered from packet loss due to fast retransmit
42465 times recovered from packet loss due to SACK data
161 bad SACKs received
Detected reordering 165 times using FACK
Detected reordering 181 times using SACK
Detected reordering 444 times using reno fast retransmit
Detected reordering 3197 times using time stamp
1530 congestion windows fully recovered
17437 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 155
73943 congestion windows recovered after partial ack
29880 TCP data loss events
TCPLostRetransmit: 42
541 timeouts after reno fast retransmit
60499 timeouts after SACK recovery
11423 timeouts in loss state
84462 fast retransmits
6777 forward retransmits
45329 retransmits in slow start
261506 other TCP timeouts
TCPRenoRecoveryFail: 203
14386 sack retransmits failed
2738 times receiver scheduled too late for direct processing
13320 packets collapsed in receive queue due to low socket buffer
89544 DSACKs sent for old packets
1294 DSACKs sent for out of order packets
21665 DSACKs received
446 DSACKs for out of order packets received
34120 connections reset due to unexpected data
3110 connections reset due to early user close
34312 connections aborted due to timeout
[ports]
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address
State
tcp        0      0 0.0.0.0:20000           0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:1984          0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:1984         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:20002           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:55555           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:37              0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:389             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:21000           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:873             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:9               0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:139             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:13              0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:783           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:80              0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:8080         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:113             0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:53            0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:1080         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:5432            0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:25              0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:445             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:14238           0.0.0.0:*
LISTEN
tcp        0      0 192.168.1.2:56464       123.123.32.80:80
ESTABLISHED
tcp        0      0 192.168.1.2:51389       123.123.176.189:80
ESTABLISHED
tcp        0      0 192.168.1.2:52865       123.123.28.123:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1028
TIME_WAIT
tcp        0      0 192.168.1.2:40408       123.213.88.51:80
ESTABLISHED
tcp        0      0 192.168.1.2:53207       123.123.50.28:443
ESTABLISHED
tcp        0      0 10.30.10.9:33905        10.30.10.1:139
ESTABLISHED
tcp        0      0 192.168.1.2:49885       123.123.28.124:80
ESTABLISHED
tcp        0      0 192.168.1.2:53071       123.123.176.176:80
ESTABLISHED
tcp        0      0 127.0.0.1:5432          127.0.0.1:53017
ESTABLISHED
tcp        0      0 192.168.1.2:25          123.123.132.183:57226
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2826
ESTABLISHED
tcp        0      0 127.0.0.1:54646         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54645         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54655         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54639         127.0.0.1:1984
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2829
ESTABLISHED
tcp        0      0 127.0.0.1:54664         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54665         127.0.0.1:1984
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2830
ESTABLISHED
tcp        0      0 192.168.1.2:41020       123.123.38.94:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1050
ESTABLISHED
tcp        0      0 192.168.1.2:35851       123.123.88.59:80
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.16:1051
ESTABLISHED
tcp        0      0 192.168.1.2:47514       123.123.88.51:80
ESTABLISHED
tcp        0      0 192.168.1.2:37393       123.123.176.45:443
ESTABLISHED
tcp        0      0 10.30.10.9:49509        10.30.10.9:995
ESTABLISHED
tcp        0      0 192.168.1.2:38569       123.123.15.124:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1068
ESTABLISHED
tcp        0   1640 192.168.1.2:40677       123.213.176.45:443
ESTABLISHED
tcp        0      1 192.168.1.2:59642       123.123.176.176:80
LAST_ACK
tcp        0      0 10.30.10.9:8080         10.30.10.16:1066
ESTABLISHED
tcp        0      0 192.168.1.2:46183       123.123.88.51:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1067
ESTABLISHED
tcp        0   1640 192.168.1.2:32966       123.123.176.45:443
ESTABLISHED
tcp        0      0 192.168.1.2:41246       123.123.88.56:80
ESTABLISHED
tcp        0      0 192.168.1.2:35872       123.123.38.94:80
TIME_WAIT
tcp        0      0 10.30.10.9:139          10.30.10.1:3133
ESTABLISHED
tcp        0      0 10.30.10.9:47825        10.30.10.1:139
TIME_WAIT
tcp        0      0 10.30.10.9:47828        10.30.10.1:139
TIME_WAIT
tcp        0      0 192.168.1.2:33758       123.123.38.94:80
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.17:3132
ESTABLISHED
tcp
Note, this is clearly truncated mid-line/mid-report!
So, the procs test on the remote hobbit shows red, and alerts/etc...
The interesting thing to note this time is that the actual procs report
showed most of the procs were actually found with some (truncated) ps
output data. Which means, hobbit actually had more data than is
displayed on the Client Data page....
If anyone can advise how to resolve this, I would be exceptionally keen
to hear about it.
A upgrade to the bandwidth for one end is scheduled for 2 weeks time,
which may help this site, but I am still seeing the same problem with
other hosts in other locations.
Two additional questions:

Can some sort of checksum/verification marker be added to the end of
the client data so the server knows that if it doesn't see the end
marker to discard the entire message?

Compression of the client data?
Nevermind, I'll post this as a separate email so it can get a little
better visibility...

Thanks,
Adam
...
Adam Goryachev escribió:
...
Adam Goryachev wrote:
...
Anyway, the problem is that approximately since then, a number of client
reports are not completely received. Sometimes some of the ps output is
truncated, sometimes the ports sections is truncated, etc. This leads to
false positive alerts (ie, procs goes red because some monitored procs
are not running since they were after the truncated section).
...
I've increased the timeout on the hobbitd (--timeout=60) but this
doesn't seem to have helped. The only common factor between the clients
which have this problem are:
...

Most of them are running bbproxy and passing status messages from a
number of clients.
The rest of them are on very slow connections, or frequently very
busy connections.

I have made some 'progress' of sorts.
I've increased the MAX values as I was getting some "Oversize ...
truncated" messages in my log file. I then went home thinking "Great, I
managed to solve this one thing today at least". Except, I started
getting messages a few hours later.
So after further investigation, I've decided I really can't work out
what is happening, and why it isn't working. I've enabled debug output
from bbproxy, but I don't really know what it all means.
I can see that if I set bbproxy to only forward messages to 127.0.0.1
the local hobbit server gets all the data correctly. If I add the remote
server, then some things don't work properly. Since it is likely all a
big jumbled mess by now, I'll post a few sections of config files, and
hopefully someone will notice my stupid mistake (or multiple mistakes)...
I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all
client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are
windows PC's using the BB windows client), one is a linux hobbit-client
and of course 10.30.10.9 is a hobbit client (plus a couple of old ext
scripts using the old BB env). I think all this is working fine, since
nothing goes randomly purple/red.
10.30.10.9 is behind NAT but has complete access to the internet.
I have a remote server behind a NAT router which has port 1984 port
forwarded to it. It is receiving reports from around 20 other hobbit
client machines perfectly, so I don't suspect the NAT router/hobbit
config itself.
Some config from 10.30.10.9:
hobbitserver.cfg:
BBSERVERIP="127.0.0.1"
BBDISP="127.0.0.1"
BBDISPLAYS=""
MAXLINE="32768"
hobbitclient.cfg
BBDISP="10.30.10.9"
BBDISPLAYS=""
BB="$BBHOME/bin/bb --debug --timeout=60"
MAXLINE="32768"
hobbitlaunch.cfg
[hobbitd]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
--restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
--checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
--admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
--listen=127.0.0.1
[bbproxy]
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD $BBHOME/bin/bbproxy --hobbitd
--bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
--report=$MACHINE.bbproxy --no-daemon --timeout=30
--pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
--listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
--pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
LOGFILE $BBSERVERLOGS/bbproxy.log
[hobbitclient]
ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg
NEEDS hobbitd
CMD /usr/lib/hobbit/client/bin/hobbitclient.sh
LOGFILE $BBSERVERLOGS/hobbitclient.log
INTERVAL 5m
On the remote hobbit server with the public IP I have:
hobbitserver.cfg
BBSERVERIP="192.168.2.6"
BBDISP="192.168.2.6"
BBDISPLAYS=""
MAXLINE="32768"
MAXMSG_STATUS="1024"
MAXMSG_CLIENT="1024"
MAXMSG_DATA="512"
hobbitlaunch.cfg
[hobbitd]
HEARTBEAT
ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
--restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
--checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
--admin-senders=127.0.0.1,$BBSERVERIP
--maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
--store-clientlogs=!msgs --timeout=60
Any suggestions as to what is going wrong would be really appreciated.
BTW, bbnet tests from the 10.30.10.9 host are not submitted to the
bbproxy at all because of the BBDISP setting in the hobbitserver.cfg,
but if I change this to point to 10.30.10.9 then it seems to break the
web interface. I'm not really too concerned about this right now
though....
Thanks for any tips/pointers/etc
Regards,
Adam

Adam Goryachev
Website Managers
Ph: +61 2 8304 0000                            adam at websitemanagers.com.au
Fax: +61 2 8304 0001                            www.websitemanagers.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAklHQnsACgkQGyoxogrTyiVq/QCgjU1lIamzEs5lq5bKGo9K3Hgt
XQAAnAvUe9sQC3PHeTeJjFi2+XGG50dT
=65Io
-----END PGP SIGNATURE-----

adam＠websitemanagers.com.au

tags

participants (1)