RHEL5 and status-board not available bug?
Well...We think it's a big bug, where 'we' is me and RedHat support. Of course I'm speaking of Linux and not about the Solaris bug, and my kernel parameter are ok.
I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, and nfs files storing the xymon data in a Veritas cluster. The xymon server get 3000 hosts and about 17093 status messages. The problem is...the timeout, the hobbit status page go in green, the pages sometimes are slow to be read or give a "Status not available"
Speaking with Redhat premium support, I sent them a trace of the error (about 40MB gzip...) and for them the cause is a bug in the thread management cause in the RHEL5 is not more possible to use the old POSIX implementation of threading, but needs to use just the Linux Threading "version". Of course I have lost some of the sentences....sorry but I'm not a programmer. They avoid at all a problem with the nfs share, the throughput of xymon is about a stable 30KB/s, while network test indicate a possibility of 50-78MB/s. However I had to modify the mount option to avoid many setattr calls.
As a workaround I have modify the sendmessage call in lib folder adding to repeat the send of message: if (res == BB_ETIMEOUT) { usleep(5); res = sendtomany((recipient ? recipient : bbdisp), xgetenv("BBDISPLAYS"), msg, respfd, respstr, fullresponse, timeout); } This of course increase the busy time but doesn't get again an "all system green" problem. I'm running a xymon 4.2.0 with allinonepatch and xymon 4.2.2 doesn't seem to have any changes in this problem however I'll try in the next days. Other issue...shutting down xymon I always need to clear all with ipcrm cause segments are yet present. Nothing more in logs, just the status-board not available.
If someone already got this issue (doesn't seem in the past posts) please give me a tip.... Ah..here my kernel parameter:
------ Shared Memory Limits -------- max number of segments = 8192 max seg size (kbytes) = 67108864 max total shared memory (kbytes) = 17179869184 min seg size (bytes) = 1
------ Semaphore Limits -------- max number of arrays = 128 max semaphores per array = 250 max semaphores system wide = 32000 max ops per semop call = 100 semaphore max value = 32767
------ Messages: Limits -------- max queues system wide = 16 max size of message (bytes) = 65536 default max size of queue (bytes) = 65536
Thanks in advance.
-- Be Yourself @ mail.com! Choose From 200+ Email Addresses Get a Free Account at www.mail.com
I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ... But I have a few comments.
On Tue, Feb 10, 2009 at 07:35:24AM +0000, Flyzone Micky wrote:
Well...We think it's a big bug, where 'we' is me and RedHat support. Of course I'm speaking of Linux and not about the Solaris bug, and my kernel parameter are ok.
I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, and nfs files storing the xymon data in a Veritas cluster. The xymon server get 3000 hosts and about 17093 status messages. The problem is...the timeout, the hobbit status page go in green, the pages sometimes are slow to be read or give a "Status not available"
3000 hosts is a fairly large setup. I assume you're doing data collection for graphs for all of these servers, and that you're running version 4.2.x of Xymon.
I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files. I know from personal experience that heavy disk I/O can cause network connections in Xymon to time out. Having your data on a network-filesystem is different from what I've tried, but it could make this problem worse - because the I/O is now entirely handled by the Linux kernel, whereas with a local disk for storage at least some of the I/O is handled by the disk controller.
What you could try - at least for a short period - would be to stop the [rrdstatus] and [rrddata] tasks in hobbitlaunch.cfg. This stops data from being collected into the graphs, but it will also reduce your disk I/O to practially nil. If your system then starts behaving properly, then we need to look at reducing the load from your RRD updates (I have a couple of suggestions). If the problem persists, then some other explanation must be found.
Speaking with Redhat premium support, I sent them a trace of the error (about 40MB gzip...) and for them the cause is a bug in the thread management cause in the RHEL5 is not more possible to use the old POSIX implementation of threading, but needs to use just the Linux Threading "version". Of course I have lost some of the sentences....sorry but I'm not a programmer.
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS.
Regards, Henrik
participants (2)
-
flyzone@technologist.com
-
henrik@hswn.dk