Hi,
the xymon.com server had a minor disk "hiccup" last Saturday. Unfortunately, this triggered a kernel panic and things went pretty bad after that - eventually causing the whole server to die last Monday Aug. 20th.
Unfortunately, I was 4000 km away - and even though I did manage to get an SSH session opened to the server, all attempts to reboot it remotely just gave me a "Bus error".
My apologies for the inconvenience, it was a classic case of Murphy's Law that everything will go bad, at the worst possible time.
I expect the mails submitted to the mailing list will show up over the next 24 hours or so, once the various mailservers retry their connection to xymon.com
Regards, Henrik
Thanks for the notice!
Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
On Mon, Aug 27, 2012 at 11:19 AM, Henrik Størner <henrik at hswn.dk> wrote:
Hi,
the xymon.com server had a minor disk "hiccup" last Saturday. Unfortunately, this triggered a kernel panic and things went pretty bad after that - eventually causing the whole server to die last Monday Aug. 20th.
Unfortunately, I was 4000 km away - and even though I did manage to get an SSH session opened to the server, all attempts to reboot it remotely just gave me a "Bus error".
My apologies for the inconvenience, it was a classic case of Murphy's Law that everything will go bad, at the worst possible time.
I expect the mails submitted to the mailing list will show up over the next 24 hours or so, once the various mailservers retry their connection to xymon.com
Regards, Henrik ______________________________**_________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/**mailman/listinfo/xymon<http://lists.xymon.com/mailman/listinfo/xymon>
On 27-08-2012 17:19, Henrik Størner wrote:
the xymon.com server had a minor disk "hiccup" last Saturday. Unfortunately, this triggered a kernel panic and things went pretty bad after that - eventually causing the whole server to die last Monday Aug. 20th.
Turns out it was more than just a hiccup - I was bitten by a firmware bug in my Crucial M4 SSD disk http://forum.crucial.com/t5/Solid-State-Drives-SSD/Firmware-Update-Notificat...
"an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point."
If any of you have Crucial M4 SSD disks in use, I'd recommend checking the firmware version ASAP - it must be version 0309 or 000F. "smartctl -a" on Linux can tell you.
Regards, Henrik
participants (2)
-
henrik@hswn.dk
-
josh@imaginenetworksllc.com