[hobbit] RHEL5 and status-board not available bug?

12 Feb 2009 · *ancient*


      On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
...
On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:
...
I'm not completely sure if you believe there is a bug in Xymon,
or in the Linux kernel of your RHEL system ...
I think is in Hobbit. And I have news about it, I'll write more down.
...
I would guess that your problems - at least in part - stem from
the amount of I/O you're doing for updating all of the RRD-files.
No, excluded at all, already tried to disable all the ext tests.
However I tried also switching the data in local SCSI disks
and iostat indicate a really low I/O wait.
"really low" as in ... how much ? If you're looking at the vmstat
output, check the "vmstat1" graph and see how much I/O wait
takes up of your cpu time. AND - remember that disk I/O in
Linux is a single-processor task, so on a dual-CPU box
your I/O system is saturated when your vmstat1 graph shows
50% of the time is spent in I/O wait.
On a quad-cpu box the limit it 25%, obviously.
I also have my RRD files on 10k RPM SCSI disks, hardware raid
controller etc. Without the caching in Xymon 4.3, it couldn't
keep up with the amount of RRD updates I was feeding it it.
Which also shows in the fact that flushing the cache - which
essentially does the same amount of disk I/O as a full update
of all the RRD files - takes about 8 minutes. No chance at all
then of keeping up with 5-minute update cycles.
I really think you should try shutting off the hobbitd_rrd tasks,
just to see what happens.
...
...
If the problem persists, then some other explanation must be found.
Must for sure....it's a big trouble saw 3000 hosts becaming purple
then green then purple :)
For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.
I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?
I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ?
Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp. This could be a problem - I've seen Hobbit break
on a system with ~1200 hosts, because the network test would
ping all of them, overflowing the ARP cache. This is tunable
with
sysctl net.ipv4.neigh.default.gc_thresh1=3072
sysctl net.ipv4.neigh.default.gc_thresh2=4096
(see the arp(7) man-page for what these do).
Is this server also running the network tests ?
Network-wise, it makes sense to tune a busy Hobbit server in the
same manner that you would a very busy webserver (which also
has to handle lots of short-lived connections). Another possible
tuning parameter would be
sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT
state for new connections. It goes against the recommended way
of doing TCP, but unless you're running Hobbit over high-latency
networks it should not cause any problems.
...
...
I don't know how the change in "POSIX threading" plays into this.
Hobbit is not a threaded application, it is plain and simple
single-task application all the way through. It may have some
meaning in relation to NFS.
Ups...is not a multithread? I'm not a programmer but....how it can
follow 3000 hosts sending data without multithread?
By avoiding all the overhead of using threads :-)
Seriously, 3000 hosts on a 5-minute cycle is only 10 hosts/second.
Each host triggers perhaps 5-10 connections (e.g. an old client
reporting cpu,disk,memory,msgs,procs,conn), and since the core
daemon isn't doing any disk I/O handling 50-100 connections per
second isn't that big a deal.
...
However here the news: the problem persist just with RHEL5 with
architecture x86_64 with all kind of 2.6 kernels.
With RHEL5 and x86 (32bit) there isn't the bug.
It's quite odd that there is a problem on x86-64, but not on x86-32.
One (I) would expect the 64-bit systems to have a bit more "oomph"
so they should be the ones that worked best.
A datapoint here. I'm also running Hobbit on a 64-bit Linux
platform, but it is using SPARC (Sun) hardware. Kernel is
2.6.18-6-sparc64. This hardware is *ancient* (about 10 years old),
but handles twice the number of hosts and statuses that you have.
I do have the RRD's on a different server, though.
...
However, the problem exist also in our hobbit lab (always 64bit)
stressing the Hobbit with more than 20 "virtual host"
So you're saying that on a RHEL 5.3 64-bit Intel server, setting
up Hobbit and feeding it with data from ~20 clients will make
the system break?
I think I would have heard about it before if this was a general
problem.
Regards,
Henrik

[hobbit] RHEL5 and status-board not available bug?

henrik＠hswn.dk