In <A3D12FAD74FC8B46991703F40C182BAB01078343 at permls102.wde.woodside.com.au> "Everett, Vernon" <Vernon.Everett at woodside.com.au> writes:
My Hobbit server crashed and died.
This happened before, a few months ago, and I shrugged it off - sometimes sh1t happens. Then it happened last week again. This time I was concerned. Now it has just happened again, about 40 minutes ago.
I tried to restart hobbit, without much luck, then I walked away, put my son into bed, and then tried again. This time it worked.
The logs never showed anything conclusive, but maybe I just don't know what I am looking for.
The symptoms were the same all three times. All "passive" server based tests go purple. By passive server based, I mean conn, http, content, ssh, ftp, ftps, etc. The tests that do not rely on a client. Also went purple, was bbd and bbtest.
All client based tests were unaffected. Graphing worked as normal. And alerts were being sent out.
Your description sounds very much as if the only thing that stopped were the network tests (bbtest-net). Since the client-side tests are updating, network tests go purple and alerts go out, I think that is where the problem is. "bbtest" going purple also points in this direction.
Next time it happens, see if there's a "bbtest-net" process running (and possible a "hobbitping" or "fping" process as well); if there is, kill it with a "kill -6" to make it dump core. Then do the usual stuff of getting a stacktrace from the core file ( http://www.hswn.dk/hobbit/help/known-issues.html#bugreport )
Are you running bbtest-net with the "--no-ares" option ? Then a hung/slow DNS server can make your network tests run very slowly.
Henrik