Hi Oliver,
what version of Hobbit ? And what OS/hardware are you running on ?
On Mon, Aug 01, 2005 at 01:36:03PM +0200, Olivier Beau wrote:
i'm having a problem: -bbtest-net reports one or two "Whoops ! bb failed to send message - timeout" in the report -which is causing a bunch of net test to go purple (pretty embarrasing..) -i tried to play with BBMAXMSGSPERCOMBO and BBSLEEPBETWEENMSGS, but doesnt seem to have any effect... -once in while bbtest-net does report everything fine to hobbitd, without any changes on the server
here's the output from bbtest-net --debug where the whoops happens: 2005-08-01 13:15:24 Recipient listed as '127.0.0.1' 2005-08-01 13:15:24 Standard BB protocol on port 1985 2005-08-01 13:15:24 Will connect to address 127.0.0.1 port 1985 2005-08-01 13:15:24 Connect status is 0 2005-08-01 13:15:24 Sent 65532 bytes 2005-08-01 13:15:24 Sent 81921 bytes 2005-08-01 13:15:24 Sent 49152 bytes 2005-08-01 13:15:29 Whoops ! bb failed to send message - timeout
Is there an equivalent number of "Bogus/Timeout" messages reported in the Hobbit servers' "hobbitd" status column ? Are there any unusual messages in the hobbitd.log file ?
The timeout that bbtest-net hits is a 5 second timeout which is the default one used whenever a message is sent off to the Hobbit daemon. The 5 secs was chosen back when bbtest-net was sending to the Big Brother daemon, and considering that fact that Hobbit can generate much larger messages it might be worth a try to increase that timeout somewhat. Unfortunately, that one is set at compile-time and cannot be changed easily - so could you try editing the lib/sendmsg.h file and change the line #define BBTALK_TIMEOUT 5 to #define BBTALK_TIMEOUT 15 Then run "make clean; make" and as root "make install" to build and install the tools with the new timeout setting.
Also, on the Hobbit server it might be necessary to up the timeout on the receiver side - so add a "--timeout=30" to the hobbitd command in ~hobbit/server/etc/hobbitlaunch.cfg
it looks like bbtest-net actually connected to hobbitd ! -> could bbtest-net re-open a connection and resend the affected statuses when a oops happens ?
It's tricky. Basically these timeouts should not happen (especially not when we're connecting to "localhost"), so I'd rather try and figure out why they happen.
later in the bbtest-net log i see this, which is different since i suppose bbtest-net got a connection closed the first try: 2005-08-01 13:15:29 Recipient listed as '127.0.0.1' 2005-08-01 13:15:29 Standard BB protocol on port 1985 2005-08-01 13:15:29 Will connect to address 127.0.0.1 port 1985 2005-08-01 13:15:34 Timeout while talking to bbd at 127.0.0.1:1985 - retrying 2005-08-01 13:15:35 Will connect to address 127.0.0.1 port 1985 2005-08-01 13:15:35 Connect status is 0 2005-08-01 13:15:35 Sent 466 bytes 2005-08-01 13:15:35 Closing connection
Yes, this is a situation where the first connection attempt fails. This is retried and the second connection attempt succeeds and sends the message.
Any idea of what could be going in hobbitd ? (my understanding is that hobbitd kind of drops heavy status connections..)
Not really. hobbitd is a single-thread application that is designed to do as little disk I/O as possible - the only real disk I/O it performs is to read the bb-hosts file - and instead handle everything in memory. 5 seconds is a very long time; you can do a lot of cpu- and memory-bound activity during that time - *if* the hobbitd process is scheduled to run. I have seen some situations where a broken disk driver would cause the entire box to freeze up for several seconds at a time, and hobbitd doesn't like that at all ...
Henrik