On Mon, Mar 13, 2006 at 12:45:27PM -0500, James B Horwath wrote:
I have been running hobbit for several months now without incident. I am running hobbit 4.1.2p1 on Redhat Enterprise 3 on IBM pseries hardware. I haven't had any issues until this morning. Now it appears after about one hour of running the system flat out dies. I am sent a notification for
every system connected. Then it appears the network process dies. I was running Tcpdump to see what was wrong. I see the completion of a network test about 30 minutes ago to a machine on the same subnet. I am not running iptables/ipchains. I am not experienced at hard-core hobbit debugging. I looked in /var/log/hobbit and don't see anything strange. There are no core files on the hobbit directory.
Any advise on where to start? All my network test are now purple.
Is there a "bbtest-net" and/or "fping" process which hangs ? If there is, it would be interesting to attach to it with "gdb" and see what it is doing. Alternatively, kill it with a "kill -6" which will trigger a core dump in ~hobbit/data/tmp/ - you can run the core dump through gdb, which might give me an idea what it is doing.
You can also try su'ing to the hobbit user and run the command
bbcmd bbtest-net --debug host1 host2
(replace "host1" and "host2" with a couple of the hosts in your bb-hosts file).
Is DNS lookups working on this box ? That is one of the few things that can cause the network tests to slow down dramatically. But they ought to time out automatically. Same goes for the other commands that run as part of the network tests (rpc and ntp queries).
Henrik,
Thanks for the tips. DNS works fine and the network tests seems to work fine when I try them. My system is pretty much idle and I don't see anything nasty in the system logs. I have included the process table and an lsof bbtest-net process. When I did the kill -6 on the network process it worked once and then failed stopped again. I did a strings on the core and may have found a machine with a slove DNS resolution. I am keeping my fingers crossed.
Regards, Jim
[root at bigbrother etc]# ps -ef | grep hobbit
hobbit 18470 1 0 15:10 ? 00:00:00 /usr/local/hobbit/server/bin/hobbitlaunch --config=/usr/local/hobbit/server/etc/hobbitlaunch.cfg --env=/usr/local/hobbit/server/etc/hobbitserver.cfg --log=/var/log/hobbit/hobbitlaunch.log --pidfile=/var/log/hobbit/hobbitlaunch.pid hobbit 18471 18470 0 15:10 ? 00:00:05 hobbitd --pidfile=/var/log/hobbit/hobbitd.pid --restart=/usr/local/hobbit/server/tmp/hobbitd.chk --checkpoint-file=/usr/local/hobbit/server/tmp/hobbitd.chk --checkpoint-interval=600 --log=/var/log/hobbit/hobbitd.log --admin-senders=127.0.0.1 10.98.200.46 hobbit 18473 18470 0 15:10 ? 00:00:00 hobbitd_channel --channel=stachg --log=/var/log/hobbit/history.log hobbitd_history hobbit 18474 18473 0 15:10 ? 00:00:00 hobbitd_history hobbit 18475 18470 0 15:10 ? 00:00:01 hobbitd_channel --channel=page --log=/var/log/hobbit/page.log hobbitd_alert --checkpoint-file=/usr/local/hobbit/server/tmp/alert.chk --checkpoint-interval=600 hobbit 18476 18475 0 15:10 ? 00:00:00 hobbitd_alert --checkpoint-file=/usr/local/hobbit/server/tmp/alert.chk --checkpoint-interval=600 hobbit 18477 18470 0 15:10 ? 00:00:19 hobbitd_channel --channel=status --log=/var/log/hobbit/rrd-status.log hobbitd_rrd --rrddir=/usr/local/hobbit/rrd hobbit 18478 18470 0 15:10 ? 00:00:00 hobbitd_channel --channel=data --log=/var/log/hobbit/rrd-data.log hobbitd_rrd --rrddir=/usr/local/hobbit/rrd hobbit 18479 18470 0 15:10 ? 00:00:00 hobbitd_channel --channel=client --log=/var/log/hobbit/clientdata.log hobbitd_client hobbit 18480 18478 0 15:10 ? 00:00:00 hobbitd_rrd --rrddir=/usr/local/hobbit/rrd hobbit 18481 18477 0 15:10 ? 00:00:14 hobbitd_rrd --rrddir=/usr/local/hobbit/rrd hobbit 18482 18479 0 15:10 ? 00:00:00 hobbitd_client hobbit 18634 18470 0 15:20 ? 00:00:00 bbtest-net --report --ping --checkresponse --timeout=60 --debug hobbit 21820 1 0 22:02 ? 00:00:00 sh -c vmstat 300 2 1>/usr/local/hobbit/client/tmp/hobbit_vmstat.21809 2>&1; mv /usr/local/hobbit/client/tmp/hobbit_vmstat.21809 /usr/local/hobbit/client/tmp/hobbit_vmstat hobbit 21821 21820 0 22:02 ? 00:00:00 vmstat 300 2 root 21861 21698 0 22:06 pts/0 00:00:00 grep hobbit
[root at bigbrother etc]# lsof -p 18634 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME bbtest-ne 18634 hobbit cwd DIR 8,3 4096 376833 /usr/local/hobbit/server bbtest-ne 18634 hobbit rtd DIR 8,11 4096 2 / bbtest-ne 18634 hobbit txt REG 8,3 170076 393236 /usr/local/hobbit/server/bin/bbtest-net bbtest-ne 18634 hobbit mem REG 8,11 61504 80026 /lib/libnss_files-2.3.2.so bbtest-ne 18634 hobbit mem REG 8,11 14592 80056 /lib/liblaus.so.1.0.0 bbtest-ne 18634 hobbit mem REG 8,11 39468 82638 /lib/libpam.so.0.75 bbtest-ne 18634 hobbit mem REG 8,11 29100 80014 /lib/libcrypt-2.3.2.so bbtest-ne 18634 hobbit mem REG 8,7 28672 144356 /usr/lib/libgdbm.so.2.0.0 bbtest-ne 18634 hobbit mem REG 8,7 59608 144390 /usr/lib/libz.so.1.1.4 bbtest-ne 18634 hobbit mem REG 8,11 19992 80016 /lib/libdl-2.3.2.so bbtest-ne 18634 hobbit mem REG 8,11 79916 80036 /lib/libresolv-2.3.2.so bbtest-ne 18634 hobbit mem REG 8,7 78360 272188 /usr/kerberos/lib/libk5crypto.so.3.0 bbtest-ne 18634 hobbit mem REG 8,7 11072 272178 /usr/kerberos/lib/libcom_err.so.3.0 bbtest-ne 18634 hobbit mem REG 8,7 391564 272198 /usr/kerberos/lib/libkrb5.so.3.1 bbtest-ne 18634 hobbit mem REG 8,7 77448 272184 /usr/kerberos/lib/libgssapi_krb5.so.2.2 bbtest-ne 18634 hobbit mem REG 8,7 57768 144429 /usr/lib/libsasl.so.7.1.11 bbtest-ne 18634 hobbit mem REG 8,11 1608896 32013 /lib/tls/libc-2.3.2.so bbtest-ne 18634 hobbit mem REG 8,11 1104580 80070 /lib/libcrypto.so.0.9.7a bbtest-ne 18634 hobbit mem REG 8,11 220772 80071 /lib/libssl.so.0.9.7a bbtest-ne 18634 hobbit mem REG 8,7 49304 144433 /usr/lib/liblber.so.2.0.17 bbtest-ne 18634 hobbit mem REG 8,7 186348 144435 /usr/lib/libldap.so.2.0.17 bbtest-ne 18634 hobbit mem REG 8,11 115228 80005 /lib/ld-2.3.2.so bbtest-ne 18634 hobbit 0r CHR 1,3 65675 /dev/null bbtest-ne 18634 hobbit 1w REG 8,6 5775484 432036 /var/log/hobbit/bb-network.log bbtest-ne 18634 hobbit 2w REG 8,6 5775484 432036 /var/log/hobbit/bb-network.log bbtest-ne 18634 hobbit 3u IPv4 219456 UDP bigbrother:35123->n9000sd1.nro.glic.com:domain
This message, and any attachments to it, may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are notified that any use, dissemination, distribution, copying, or communication of this message is strictly prohibited. If you have received this message in error, please notify the sender immediately by return e-mail and delete the message and any attachments. Thank you.
participants (1)
-
JamesHorwath@glic.com