On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:
I'm not completely sure if you believe there is a bug in Xymon, or in the Linux kernel of your RHEL system ...
I think is in Hobbit. And I have news about it, I'll write more down.
I would guess that your problems - at least in part - stem from the amount of I/O you're doing for updating all of the RRD-files.
No, excluded at all, already tried to disable all the ext tests. However I tried also switching the data in local SCSI disks and iostat indicate a really low I/O wait.
"really low" as in ... how much ? If you're looking at the vmstat output, check the "vmstat1" graph and see how much I/O wait takes up of your cpu time. AND - remember that disk I/O in Linux is a single-processor task, so on a dual-CPU box your I/O system is saturated when your vmstat1 graph shows 50% of the time is spent in I/O wait.
On a quad-cpu box the limit it 25%, obviously.
I also have my RRD files on 10k RPM SCSI disks, hardware raid controller etc. Without the caching in Xymon 4.3, it couldn't keep up with the amount of RRD updates I was feeding it it. Which also shows in the fact that flushing the cache - which essentially does the same amount of disk I/O as a full update of all the RRD files - takes about 8 minutes. No chance at all then of keeping up with 5-minute update cycles.
I really think you should try shutting off the hobbitd_rrd tasks, just to see what happens.
If the problem persists, then some other explanation must be found.
Must for sure....it's a big trouble saw 3000 hosts becaming purple then green then purple :)
For hosts to go purple they have to go more than 30 minutes without an update - they don't go purple just because they miss a single update.
I suppose you have check the kernel logs ('dmesg' output) for anything odd ?
I'm wondering if maybe you're running out of ports (there's only 64K of them, only about half can be used by normal apps). How many ports do you have in TIME_WAIT state ?
Another thing is the size of the ARP cache, if your hosts are all on the same IP network or your router/firewall is doing proxy-arp. This could be a problem - I've seen Hobbit break on a system with ~1200 hosts, because the network test would ping all of them, overflowing the ARP cache. This is tunable with sysctl net.ipv4.neigh.default.gc_thresh1=3072 sysctl net.ipv4.neigh.default.gc_thresh2=4096 (see the arp(7) man-page for what these do).
Is this server also running the network tests ?
Network-wise, it makes sense to tune a busy Hobbit server in the same manner that you would a very busy webserver (which also has to handle lots of short-lived connections). Another possible tuning parameter would be sysctl net.ipv4.tcp_tw_reuse=1 which enables the kernel to re-use ports that are in a TIME_WAIT state for new connections. It goes against the recommended way of doing TCP, but unless you're running Hobbit over high-latency networks it should not cause any problems.
I don't know how the change in "POSIX threading" plays into this. Hobbit is not a threaded application, it is plain and simple single-task application all the way through. It may have some meaning in relation to NFS.
Ups...is not a multithread? I'm not a programmer but....how it can follow 3000 hosts sending data without multithread?
By avoiding all the overhead of using threads :-)
Seriously, 3000 hosts on a 5-minute cycle is only 10 hosts/second. Each host triggers perhaps 5-10 connections (e.g. an old client reporting cpu,disk,memory,msgs,procs,conn), and since the core daemon isn't doing any disk I/O handling 50-100 connections per second isn't that big a deal.
However here the news: the problem persist just with RHEL5 with architecture x86_64 with all kind of 2.6 kernels. With RHEL5 and x86 (32bit) there isn't the bug.
It's quite odd that there is a problem on x86-64, but not on x86-32. One (I) would expect the 64-bit systems to have a bit more "oomph" so they should be the ones that worked best.
A datapoint here. I'm also running Hobbit on a 64-bit Linux platform, but it is using SPARC (Sun) hardware. Kernel is 2.6.18-6-sparc64. This hardware is *ancient* (about 10 years old), but handles twice the number of hosts and statuses that you have. I do have the RRD's on a different server, though.
However, the problem exist also in our hobbit lab (always 64bit) stressing the Hobbit with more than 20 "virtual host"
So you're saying that on a RHEL 5.3 64-bit Intel server, setting up Hobbit and feeding it with data from ~20 clients will make the system break?
I think I would have heard about it before if this was a general problem.
Regards, Henrik