-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Adam Goryachev wrote:
Adam Goryachev wrote:
Anyway, the problem is that approximately since then, a number of client reports are not completely received. Sometimes some of the ps output is truncated, sometimes the ports sections is truncated, etc. This leads to false positive alerts (ie, procs goes red because some monitored procs are not running since they were after the truncated section).
I've increased the timeout on the hobbitd (--timeout=60) but this doesn't seem to have helped. The only common factor between the clients which have this problem are:
- Most of them are running bbproxy and passing status messages from a number of clients.
- The rest of them are on very slow connections, or frequently very busy connections.
I've made some more possible progress, I still don't really know how to approach this problem, or try to solve it....
Basically, I used tcpdump to catch all traffic sent to port 1984 on my local server. I then used wireshark to analyse the data to find the specific stream of packets that lead to hobbit getting a red alert due to truncated client report.
It now seems to point toward some sort of transport 'problem' in that I get a number of 'errors' such as "TCP Previous segment lost" and "TCP Dup ACK" and "TCP Retransmission" and the final packet is a "RST" which I assume is when you would normally get a "Connection reset by peer" type error.
I would love to publish the trace, but don't know how to obfuscate it's contents to conceal some of the details (ie, the contents of the hobbit client status that was being reported).
However, I do have the following questions:
If the connection died due to an error, why does hobbit still use the contents of what it received? (Is this the better to know half the information than none, or we can't tell the difference between connection closed due to an error and connection closed at end of transport?)
From what I know, TCP is meant to be fairly robust in the face of lost packets, and other errors. The fact I am seeing these sort of failures concerns me that my network must be unhappy in some way. Yet, from a user experience point of view, everything seems to be working perfectly..... ie, web browsing/ssh connections /etc...
BTW, the network connection is quite busy during the times when these errors happen due to remote backups being done at those times. Could that be the cause of the problem?
Any comments, suggestions, etc, would be greatly appreciated.
Regards, Adam
Adam Goryachev Website Managers www.websitemanagers.com.au -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFII9+MGyoxogrTyiURAoZ5AJ4uwxQMAIuEvF32XWxZuBPqBU3bYQCfYtVy T4RIJ40hdntCZtTIXRouCtY= =Begp -----END PGP SIGNATURE-----