hobbit-client: segfault on several systems
Hi all,
I'm currently testing hobbit und I'm quite satisfied with it. Great monitoring tool! Hobbit will definitely replace our BigBrother installation.
But yesterday a somehow strange error occured. About 15 hobbit clients suddenly died (all at the same time! +/- 5min) They all wrote a core file.
Some clients reported a segfault: hobbitlaunch[15432]: segfault at 00007fff75d42a48 rip 00000000004047ad rsp 00007fff75d42a50 error 6
I compiled the clients from snapshot version 20071026. The server is already version 20080115.
Client-Systems are Debian Linux Sarge/Etch and Solaris.
gdb-Output (gdb ./client/bin/hobbitlaunch core) :
GNU gdb 6.4.90-debian Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i486-linux-gnu"...Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
Reading symbols from /lib/tls/i686/cmov/libc.so.6...done. Loaded symbols for /lib/tls/i686/cmov/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Core was generated by `/opt/hobbit/client/bin/hobbitlaunch --config=/opt/hobbit/client/etc/clientlaunc'. Program terminated with signal 11, Segmentation fault. #0 errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 42 errormsg.c: Datei oder Verzeichnis nicht gefunden. in errormsg.c (gdb) (gdb) bt #0 errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #1 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #2 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #3 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #4 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #5 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #6 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #7 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 [cut] #3966 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #3967 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #3968 0x0804ac2d in main (argc=4, argv=0x0) at hobbitlaunch.c:536
Are there any suggestions? Any ideas?
Thank you for your help!
Regards, Alexander
Alexander Keller wrote:
Hi all,
I'm currently testing hobbit und I'm quite satisfied with it. Great
monitoring tool! Hobbit will definitely replace our BigBrother installation.
But yesterday a somehow strange error occured. About 15 hobbit clients
suddenly died (all at the same time! +/- 5min) They all wrote a
core file.
Some clients reported a segfault:
Are your clients or server running on virtual machines of any kind?
Hi all,
I'm currently testing hobbit und I'm quite satisfied with it. Great monitoring tool! Hobbit will definitely replace our BigBrother installation.
But yesterday a somehow strange error occured. About 15 hobbit clients suddenly died (all at the same time! +/- 5min) They all wrote a core file.
Some clients reported a segfault: Are your clients or server running on virtual machines of any kind?
The clients are all physical machines, the server indeed runs in a Xen domain (Debian Etch amd64 with Xen 3.0.3).
Regards, Alexander
On Sat, Jan 26, 2008 at 11:51:02AM +0100, Alexander Keller wrote:
But yesterday a somehow strange error occured. About 15 hobbit clients suddenly died (all at the same time! +/- 5min) They all wrote a core file. [snip] Core was generated by `/opt/hobbit/client/bin/hobbitlaunch --config=/opt/hobbit/client/etc/clientlaunc'. Program terminated with signal 11, Segmentation fault. (gdb) bt #0 errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #1 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #2 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #3 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #4 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42
hobbitlaunch ended up doing an endless recursion, thereby filling up the stack and crashing.
It was caused by the clock on your boxes suddenly being set back. Normally, time cannot go backwards (unless your systems are moving faster than light, which - according to Einstein - is impossible). Hobbit detects when it happens, but couldn't quite handle it.
I've attached a patch that should prevent this from happening again.
Regards, Henrik
hhhhm ok. I must admit that even our newest Opteron boxes are not moving faster than light ;-)
Your patch is already applied and so far everything runs smooth.
Thanks!
Alexander
On Sat, Jan 26, 2008 at 11:51:02AM +0100, Alexander Keller wrote:
But yesterday a somehow strange error occured. About 15 hobbit clients suddenly died (all at the same time! +/- 5min) They all wrote a core file. [snip] Core was generated by `/opt/hobbit/client/bin/hobbitlaunch --config=/opt/hobbit/client/etc/clientlaunc'. Program terminated with signal 11, Segmentation fault. (gdb) bt #0 errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #1 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #2 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42 #3 0x08050c8e in getcurrenttime (retparm=0x0) at timefunc.c:55 #4 0x0804c10b in errprintf (fmt=0x805507c "Time warp detected: Adjusting returned clock by %d seconds\n") at errormsg.c:42
hobbitlaunch ended up doing an endless recursion, thereby filling up the stack and crashing.
It was caused by the clock on your boxes suddenly being set back. Normally, time cannot go backwards (unless your systems are moving faster than light, which - according to Einstein - is impossible). Hobbit detects when it happens, but couldn't quite handle it.
I've attached a patch that should prevent this from happening again.
Regards, Henrik
participants (3)
-
henrik@hswn.dk
-
hobbit@alexkeller.de
-
jonescr@cisco.com