I have been running hobbit for several months now without incident. I am running hobbit 4.1.2p1 on Redhat Enterprise 3 on IBM pseries hardware. I haven't had any issues until this morning. Now it appears after about one hour of running the system flat out dies. I am sent a notification for every system connected. Then it appears the network process dies. I was running Tcpdump to see what was wrong. I see the completion of a network test about 30 minutes ago to a machine on the same subnet. I am not running iptables/ipchains. I am not experienced at hard-core hobbit debugging. I looked in /var/log/hobbit and don't see anything strange. There are no core files on the hobbit directory.
Any advise on where to start? All my network test are now purple.
Thanks in advance, Jim
Jim Horwath
This message, and any attachments to it, may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are notified that any use, dissemination, distribution, copying, or communication of this message is strictly prohibited. If you have received this message in error, please notify the sender immediately by return e-mail and delete the message and any attachments. Thank you.
On Mon, Mar 13, 2006 at 12:45:27PM -0500, James B Horwath wrote:
I have been running hobbit for several months now without incident. I am running hobbit 4.1.2p1 on Redhat Enterprise 3 on IBM pseries hardware. I haven't had any issues until this morning. Now it appears after about one hour of running the system flat out dies. I am sent a notification for every system connected. Then it appears the network process dies. I was running Tcpdump to see what was wrong. I see the completion of a network test about 30 minutes ago to a machine on the same subnet. I am not running iptables/ipchains. I am not experienced at hard-core hobbit debugging. I looked in /var/log/hobbit and don't see anything strange. There are no core files on the hobbit directory.
Any advise on where to start? All my network test are now purple.
Is there a "bbtest-net" and/or "fping" process which hangs ? If there is, it would be interesting to attach to it with "gdb" and see what it is doing. Alternatively, kill it with a "kill -6" which will trigger a core dump in ~hobbit/data/tmp/ - you can run the core dump through gdb, which might give me an idea what it is doing.
You can also try su'ing to the hobbit user and run the command
bbcmd bbtest-net --debug host1 host2
(replace "host1" and "host2" with a couple of the hosts in your bb-hosts file).
Is DNS lookups working on this box ? That is one of the few things that can cause the network tests to slow down dramatically. But they ought to time out automatically. Same goes for the other commands that run as part of the network tests (rpc and ntp queries).
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
JamesHorwath@glic.com