I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ... But I have a few comments.
On Tue, Feb 10, 2009 at 07:35:24AM +0000, Flyzone Micky wrote:
Well...We think it's a big bug, where 'we' is me and RedHat support. Of course I'm speaking of Linux and not about the Solaris bug, and my kernel parameter are ok.
I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, and nfs files storing the xymon data in a Veritas cluster. The xymon server get 3000 hosts and about 17093 status messages. The problem is...the timeout, the hobbit status page go in green, the pages sometimes are slow to be read or give a "Status not available"
3000 hosts is a fairly large setup. I assume you're doing data collection for graphs for all of these servers, and that you're running version 4.2.x of Xymon.
I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files. I know from personal experience that heavy disk I/O can cause network connections in Xymon to time out. Having your data on a network-filesystem is different from what I've tried, but it could make this problem worse - because the I/O is now entirely handled by the Linux kernel, whereas with a local disk for storage at least some of the I/O is handled by the disk controller.
What you could try - at least for a short period - would be to stop the [rrdstatus] and [rrddata] tasks in hobbitlaunch.cfg. This stops data from being collected into the graphs, but it will also reduce your disk I/O to practially nil. If your system then starts behaving properly, then we need to look at reducing the load from your RRD updates (I have a couple of suggestions). If the problem persists, then some other explanation must be found.
Speaking with Redhat premium support, I sent them a trace of the error (about 40MB gzip...) and for them the cause is a bug in the thread management cause in the RHEL5 is not more possible to use the old POSIX implementation of threading, but needs to use just the Linux Threading "version". Of course I have lost some of the sentences....sorry but I'm not a programmer.
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS.
Regards, Henrik