Well, as nobody has suggested anything to my problem I guess that I'm the only one having this issue. I have managed to find the root cause. The hobbitd_rrd process was showing to be in "uninterruptible sleep" state most of the time with high iowait associated with the CPU it was running on. I suspected that the problem may be due to disk IO while updating rrds for the 2000 hosts. I created a tmpfs filesystem and copied the rrd directory into it. Since then (48 hours ago) my rrd graphs have been updating continuously. I do however need to write back to disk periodically to avoid loss of data after a reboot.
This is OK as a temporary fix but I would like to have a permanent solution. I would like to hear from other hobbit users who have more than 1000 hosts monitored. What type of servers and disk subsystems are they using? Perhaps my problem is to do with RedHat and Dell server combination. Perhaps I need to stripe over multiple spindles.
-Naeem
Naeem
Maqsud/SYBASE
To
08/18/2005 05:02 hobbit at hswn.dk
PM cc
Subject
hobbit_rrd stops working after
about 1 hour
Hi,
I'm testing out hobbit 4.1.1 for possible migration from big brother (with bbgen). I suspected scalability issues with BB as my rrd graphs were updated intermittently. However, hobbit is exhibiting similar problems. After about 1 hr of restarting hobbit, the rrd graphs stop updating except for the cpu utilization for the hobbit server itself.
The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon processors and 1GB of memory. About 800 servers are sending updates to the hobbit server. Another 1200 servers are getting remote tests.
Load average has stayed below 1 most of the time. CPU usage has been low with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with 100% utilization for the one hour that it is busy.
I hope someone can shed some light on this.
Thanks, Naeem