[Xymon] external script problem - question

5 Oct 2011

      On 05-10-2011 17:41, Steve Holmes wrote:
...
The test is an external script which basically does a 'sudo touch foo'
on each file system and waits for it to either return with no problem,
or return with an error indicating that the file system is read-only, or
after 60 seconds declares that the file system is 'hung'.
[snip]
The problem with the test is that once or twice a day we get a flurry of
alerts from a dozen or so servers, all at about the same time reporting
that there is a hung file system. Other file systems on the same server
are reporting that it takes longer to do the touch than we think it
should (e.g. 12 to 25 or even 60 seconds). The alerts all go away the
next test cycle. The file systems are on local fiber channel disks (i.e.
not NFS mounted). The servers getting the alerts are not all VMs and it
is not always the same set of servers that show up.
Sounds nasty, troubleshooting that kind of "only happens occasionally"
problems is really difficult.
OK, off the top of my head here are some ideas:

sudo - what kind of user authentication are you using ? If it's LDAP
or NIS, could that explain why the test suddenly takes longer ?

clocks - how do you measure the time it takes to run the test ? If you
just use "date" before and after the touch-command, what happens if your
server's clocks are stepped (jump a few seconds) while the test is
running? In my experience, clocks on virtual machines are horrible at
keeping correct time and can quite easily skip a couple of seconds if
set to follow the clock of the host OS.

Have you looked at the vmstat1 graphs for these systems ? How is the
"I/O wait" on them ? Some types of I/O on Linux systems can cause quite
a slow-down; deleting large files on ext2 or ext3 systems could be quite
time-consuming and cause the whole system to really stall. Also doing
things that touch a lot of files - a large find, or grep'ing through a
large number of files, especially if you don't mount filesystems with
the "noatime" option - can cause a lot of I/O that slows down filesystem
operations.

I've seen VMware Workstation consistently bring a system to its knees
when a VM was being shut down. Apparently some bad interaction between
the kernel version (2.6.18, if memory serves me right) and the way it
was updating the virtual disk images - it would just churn away for 5 or
10 minutes doing nothing but hitting the disk. Disappeared when I
upgraded the kernel on the box. No idea if your combination of RHEL and
ESX could do the same thing. But it was quite reproducible here, so it
should be easy to spot.

Just some thoughts.
Regards,
Henrik

[Xymon] external script problem - question

henrik＠hswn.dk