[hobbit] RHEL5 and status-board not available bug?
On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:
I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ...
I think is in Hobbit. And I have news about it, I'll write more down.
3000 hosts is a fairly large setup. I assume you're doing data collection for graphs for all of these servers, and that you're running version 4.2.x of Xymon.
Correct, I'll try the 4.3 in lab next week now that I know how the "bug" works.
I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files.
No, excluded at all, already tried to disable all the ext tests. However I tried also switching the data in local SCSI disks and iostat indicate a really low I/O wait.
If the problem persists, then some other explanation must be found.
Must for sure....it's a big trouble saw 3000 hosts becaming purple then green then purple :)
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS.
Ups...is not a multithread? I'm not a programmer but....how it can follow 3000 hosts sending data without multithread?
However here the news: the problem persist just with RHEL5 with architecture x86_64 with all kind of 2.6 kernels. With RHEL5 and x86 (32bit) there isn't the bug. I would like to try a Fedora on my notebook....I'll let you know. For us the best resolutions is to reinstall all in 32 bit, I'm already working on it (the first server it's already up, hobbit now it's working correctly just with this "little" edit)
However, the problem exist also in our hobbit lab (always 64bit) stressing the Hobbit with more than 20 "virtual host" Be sure of one things: is not a hardware or bottleneck related problem, the bottleneck was before on a old machine with a I/O wait really hight, now with this two new servers is not.
However, there is someone with a x86_64 architecture with similar problems? And if someone have a Redhat Developper support license, the RH support teams already told me that they can work on it.
Have a nice evening.
-- Be Yourself @ mail.com! Choose From 200+ Email Addresses Get a Free Account at www.mail.com
On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:
I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ...
I think is in Hobbit. And I have news about it, I'll write more down.
I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files.
No, excluded at all, already tried to disable all the ext tests. However I tried also switching the data in local SCSI disks and iostat indicate a really low I/O wait.
"really low" as in ... how much ? If you're looking at the vmstat output, check the "vmstat1" graph and see how much I/O wait takes up of your cpu time. AND - remember that disk I/O in Linux is a single-processor task, so on a dual-CPU box your I/O system is saturated when your vmstat1 graph shows 50% of the time is spent in I/O wait.
On a quad-cpu box the limit it 25%, obviously.
I also have my RRD files on 10k RPM SCSI disks, hardware raid controller etc. Without the caching in Xymon 4.3, it couldn't keep up with the amount of RRD updates I was feeding it it. Which also shows in the fact that flushing the cache - which essentially does the same amount of disk I/O as a full update of all the RRD files - takes about 8 minutes. No chance at all then of keeping up with 5-minute update cycles.
I really think you should try shutting off the hobbitd_rrd tasks, just to see what happens.
If the problem persists, then some other explanation must be found.
Must for sure....it's a big trouble saw 3000 hosts becaming purple then green then purple :)
For hosts to go purple they have to go more than 30 minutes without an update - they don't go purple just because they miss a single update.
I suppose you have check the kernel logs ('dmesg' output) for anything odd ?
I'm wondering if maybe you're running out of ports (there's only 64K of them, only about half can be used by normal apps). How many ports do you have in TIME_WAIT state ?
Another thing is the size of the ARP cache, if your hosts are all on the same IP network or your router/firewall is doing proxy-arp. This could be a problem - I've seen Hobbit break on a system with ~1200 hosts, because the network test would ping all of them, overflowing the ARP cache. This is tunable with sysctl net.ipv4.neigh.default.gc_thresh1=3072 sysctl net.ipv4.neigh.default.gc_thresh2=4096 (see the arp(7) man-page for what these do).
Is this server also running the network tests ?
Network-wise, it makes sense to tune a busy Hobbit server in the same manner that you would a very busy webserver (which also has to handle lots of short-lived connections). Another possible tuning parameter would be sysctl net.ipv4.tcp_tw_reuse=1 which enables the kernel to re-use ports that are in a TIME_WAIT state for new connections. It goes against the recommended way of doing TCP, but unless you're running Hobbit over high-latency networks it should not cause any problems.
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS.
Ups...is not a multithread? I'm not a programmer but....how it can follow 3000 hosts sending data without multithread?
By avoiding all the overhead of using threads :-)
Seriously, 3000 hosts on a 5-minute cycle is only 10 hosts/second. Each host triggers perhaps 5-10 connections (e.g. an old client reporting cpu,disk,memory,msgs,procs,conn), and since the core daemon isn't doing any disk I/O handling 50-100 connections per second isn't that big a deal.
However here the news: the problem persist just with RHEL5 with architecture x86_64 with all kind of 2.6 kernels. With RHEL5 and x86 (32bit) there isn't the bug.
It's quite odd that there is a problem on x86-64, but not on x86-32. One (I) would expect the 64-bit systems to have a bit more "oomph" so they should be the ones that worked best.
A datapoint here. I'm also running Hobbit on a 64-bit Linux platform, but it is using SPARC (Sun) hardware. Kernel is 2.6.18-6-sparc64. This hardware is *ancient* (about 10 years old), but handles twice the number of hosts and statuses that you have. I do have the RRD's on a different server, though.
However, the problem exist also in our hobbit lab (always 64bit) stressing the Hobbit with more than 20 "virtual host"
So you're saying that on a RHEL 5.3 64-bit Intel server, setting up Hobbit and feeding it with data from ~20 clients will make the system break?
I think I would have heard about it before if this was a general problem.
Regards, Henrik
participants (2)
-
flyzone@technologist.com
-
henrik@hswn.dk