server fails to receive all of client message
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I have a hobbit server which has been running for a long time quite nicely. Recently, I noticed it was consuming approx 8G worth of data per month (from all the remote clients reporting their data). This was costing quite a bit of money (we pay per MB), so I modified all the client to report using a different IP (and hence provider which has much cheaper rates).
Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).
I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:
- Most of them are running bbproxy and passing status messages from a number of clients.
- The rest of them are on very slow connections, or frequently very busy connections.
Around the same time I actually 'fixed' bbproxy to on the remote sites, prior to this the clients were reporting directly to both hobbit servers.
I've looked for an option to stop bbproxy from 'cacheing and combining' multiple clients into a single connection, but this doesn't seem to be possible. I don't seem to get any logs/alerts from hobbit when this happens.
Can anyone suggest where I should look, what I can do to try and resolve this?
(My main problem is that I've started ignoring the late night SMS notifications, and I'm sure I will end up missing something important because of that).
Running hobbit version 4.2.0 from package 4.2.0-1 on the server.
Thanks, Adam
Adam Goryachev Website Managers www.websitemanagers.com.au -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFIHn+dGyoxogrTyiURAg/tAJ9Lgz930wMxCssZwOKQI6Tb05bncACfcRxJ Z6ofUwHrkxKvfJ9aSEVg4Nc= =pxM2 -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Adam Goryachev wrote:
Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).
I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:
- Most of them are running bbproxy and passing status messages from a number of clients.
- The rest of them are on very slow connections, or frequently very busy connections.
I have made some 'progress' of sorts.
I've increased the MAX values as I was getting some "Oversize ... truncated" messages in my log file. I then went home thinking "Great, I managed to solve this one thing today at least". Except, I started getting messages a few hours later.
So after further investigation, I've decided I really can't work out what is happening, and why it isn't working. I've enabled debug output from bbproxy, but I don't really know what it all means.
I can see that if I set bbproxy to only forward messages to 127.0.0.1 the local hobbit server gets all the data correctly. If I add the remote server, then some things don't work properly. Since it is likely all a big jumbled mess by now, I'll post a few sections of config files, and hopefully someone will notice my stupid mistake (or multiple mistakes)...
I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are windows PC's using the BB windows client), one is a linux hobbit-client and of course 10.30.10.9 is a hobbit client (plus a couple of old ext scripts using the old BB env). I think all this is working fine, since nothing goes randomly purple/red.
10.30.10.9 is behind NAT but has complete access to the internet.
I have a remote server behind a NAT router which has port 1984 port forwarded to it. It is receiving reports from around 20 other hobbit client machines perfectly, so I don't suspect the NAT router/hobbit config itself.
Some config from 10.30.10.9:
hobbitserver.cfg: BBSERVERIP="127.0.0.1" BBDISP="127.0.0.1" BBDISPLAYS="" MAXLINE="32768"
hobbitclient.cfg BBDISP="10.30.10.9" BBDISPLAYS="" BB="$BBHOME/bin/bb --debug --timeout=60" MAXLINE="32768"
hobbitlaunch.cfg [hobbitd] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
- --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
- --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
- --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
- --listen=127.0.0.1
[bbproxy] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD $BBHOME/bin/bbproxy --hobbitd
- --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
- --report=$MACHINE.bbproxy --no-daemon --timeout=30
- --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
- --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
- --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details LOGFILE $BBSERVERLOGS/bbproxy.log
[hobbitclient] ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg NEEDS hobbitd CMD /usr/lib/hobbit/client/bin/hobbitclient.sh LOGFILE $BBSERVERLOGS/hobbitclient.log INTERVAL 5m
On the remote hobbit server with the public IP I have: hobbitserver.cfg BBSERVERIP="192.168.2.6" BBDISP="192.168.2.6" BBDISPLAYS="" MAXLINE="32768" MAXMSG_STATUS="1024" MAXMSG_CLIENT="1024" MAXMSG_DATA="512"
hobbitlaunch.cfg [hobbitd] HEARTBEAT ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
- --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
- --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
- --admin-senders=127.0.0.1,$BBSERVERIP
- --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
- --store-clientlogs=!msgs --timeout=60
Any suggestions as to what is going wrong would be really appreciated.
BTW, bbnet tests from the 10.30.10.9 host are not submitted to the bbproxy at all because of the BBDISP setting in the hobbitserver.cfg, but if I change this to point to 10.30.10.9 then it seems to break the web interface. I'm not really too concerned about this right now though....
Thanks for any tips/pointers/etc
Regards, Adam -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFIHyvcGyoxogrTyiURAhpyAKCsnO4px+b4Ml04yjzZvXgFxeuaogCeKwy6 KwOEboPhIXFb4YVgdA0ndlk= =T5Lc -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Adam Goryachev wrote:
Adam Goryachev wrote:
Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).
I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:
- Most of them are running bbproxy and passing status messages from a number of clients.
- The rest of them are on very slow connections, or frequently very busy connections.
I've made some more possible progress, I still don't really know how to approach this problem, or try to solve it....
Basically, I used tcpdump to catch all traffic sent to port 1984 on my local server. I then used wireshark to analyse the data to find the specific stream of packets that lead to hobbit getting a red alert due to truncated client report.
It now seems to point toward some sort of transport 'problem' in that I get a number of 'errors' such as "TCP Previous segment lost" and "TCP Dup ACK" and "TCP Retransmission" and the final packet is a "RST" which I assume is when you would normally get a "Connection reset by peer" type error.
I would love to publish the trace, but don't know how to obfuscate it's contents to conceal some of the details (ie, the contents of the hobbit client status that was being reported).
However, I do have the following questions:
If the connection died due to an error, why does hobbit still use the contents of what it received? (Is this the better to know half the information than none, or we can't tell the difference between connection closed due to an error and connection closed at end of transport?)
From what I know, TCP is meant to be fairly robust in the face of lost packets, and other errors. The fact I am seeing these sort of failures concerns me that my network must be unhappy in some way. Yet, from a user experience point of view, everything seems to be working perfectly..... ie, web browsing/ssh connections /etc...
BTW, the network connection is quite busy during the times when these errors happen due to remote backups being done at those times. Could that be the cause of the problem?
Any comments, suggestions, etc, would be greatly appreciated.
Regards, Adam
Adam Goryachev Website Managers www.websitemanagers.com.au -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFII9+MGyoxogrTyiURAoZ5AJ4uwxQMAIuEvF32XWxZuBPqBU3bYQCfYtVy T4RIJ40hdntCZtTIXRouCtY= =Begp -----END PGP SIGNATURE-----
Adam, take a look at:
http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_f...
Adam Goryachev escribió:
Adam Goryachev wrote:
Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).
I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:
- Most of them are running bbproxy and passing status messages from a number of clients.
- The rest of them are on very slow connections, or frequently very busy connections.
I have made some 'progress' of sorts.
I've increased the MAX values as I was getting some "Oversize ... truncated" messages in my log file. I then went home thinking "Great, I managed to solve this one thing today at least". Except, I started getting messages a few hours later.
So after further investigation, I've decided I really can't work out what is happening, and why it isn't working. I've enabled debug output from bbproxy, but I don't really know what it all means.
I can see that if I set bbproxy to only forward messages to 127.0.0.1 the local hobbit server gets all the data correctly. If I add the remote server, then some things don't work properly. Since it is likely all a big jumbled mess by now, I'll post a few sections of config files, and hopefully someone will notice my stupid mistake (or multiple mistakes)...
I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are windows PC's using the BB windows client), one is a linux hobbit-client and of course 10.30.10.9 is a hobbit client (plus a couple of old ext scripts using the old BB env). I think all this is working fine, since nothing goes randomly purple/red.
10.30.10.9 is behind NAT but has complete access to the internet.
I have a remote server behind a NAT router which has port 1984 port forwarded to it. It is receiving reports from around 20 other hobbit client machines perfectly, so I don't suspect the NAT router/hobbit config itself.
Some config from 10.30.10.9:
hobbitserver.cfg: BBSERVERIP="127.0.0.1" BBDISP="127.0.0.1" BBDISPLAYS="" MAXLINE="32768"
hobbitclient.cfg BBDISP="10.30.10.9" BBDISPLAYS="" BB="$BBHOME/bin/bb --debug --timeout=60" MAXLINE="32768"
hobbitlaunch.cfg [hobbitd] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --listen=127.0.0.1
[bbproxy] ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1 --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30 --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details LOGFILE $BBSERVERLOGS/bbproxy.log
[hobbitclient] ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg NEEDS hobbitd CMD /usr/lib/hobbit/client/bin/hobbitclient.sh LOGFILE $BBSERVERLOGS/hobbitclient.log INTERVAL 5m
On the remote hobbit server with the public IP I have: hobbitserver.cfg BBSERVERIP="192.168.2.6" BBDISP="192.168.2.6" BBDISPLAYS="" MAXLINE="32768" MAXMSG_STATUS="1024" MAXMSG_CLIENT="1024" MAXMSG_DATA="512"
hobbitlaunch.cfg [hobbitd] HEARTBEAT ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs --timeout=60
Any suggestions as to what is going wrong would be really appreciated.
BTW, bbnet tests from the 10.30.10.9 host are not submitted to the bbproxy at all because of the BBDISP setting in the hobbitserver.cfg, but if I change this to point to 10.30.10.9 then it seems to break the web interface. I'm not really too concerned about this right now though....
Thanks for any tips/pointers/etc
Regards, Adam
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
participants (2)
-
mailinglists@websitemanagers.com.au
-
rodolfo@pilas.net