Trying to set up split client load over mutiple servers.
Sorry if this post gets a bit long, I've read thru the man pages and the archives and have failed to reach understanding and need some help.
I have over 4300 Hobbit 4.2.0+all-patches clients currently reporting to one server running Xymon 4.3.0-0.beta2. This server is also used for other workloads, which normally do not produce much of a load but can at times.
We are adding more clients to this configuration at rate of about 400/week for a final total of about 7200 clients.
As of this week it seems that my Xymon server is unable to keep up with the load and I suspect this is due to several reasons: the large number of clients, file system type (ext3) used for the data/.. directories (, the problem is.
I am trying to determine my best course to be able to handle the 7200+ clients ...
Some options I've considered:
A) simply split the clients over 2 or more independent Xymon servers
a. easiest to configure
b. lose 'single web page' overview
c. lose combined statistics/reporting
B) Split clients over multiple servers for data gathering/storage and (how?) use one server to display a bb2.html page which combines all non green from all 'data-gathering' servers
a. Is this even possible?
C) Some other method/configuration? Anyone on the list running this many clients?
a. How did you set up your environment?
b. What size/performance/type of system are you using for your Xymon server?
Some of my current symptoms:
The 'top' command on the server is indicating high I/O load (at times %iowait > 60%).
The 'procs' column warns/alerts about missing processes; closer examination shows that the client data received has been truncated, usually somewhere in the 'ps' outout section.
The bb2.html page sometimes shows many 'purple' clients; the clients in purple change (eg, a client goes purple for a while 5-30+ minutes, then we get another message processed and the client goes green again.
In /var/log/xymon/clientdata.log, I am seeing many (2212 yesterday alone) messages like:
2010-05-27 11:38:09 hobbitd_client: Got message 55294, expected 55277
2010-05-27 12:08:12 Flushed 7 stale messages for 0.0.0.0:0
2010-05-27 12:08:13 Flushed 43 stale messages for 0.0.0.0:0
2010-05-27 12:08:13 hobbitd_client: Got message 81190, expected 81140
2010-05-27 12:38:36 Flushed 16 stale messages for 0.0.0.0:0
2010-05-27 12:38:37 Flushed 43 stale messages for 0.0.0.0:0
2010-05-27 12:38:38 Flushed 26 stale messages for 0.0.0.0:0
2010-05-27 12:38:38 hobbitd_client: Got message 107484, expected 107399
The hobbitd page shows:
Statistics for Hobbit daemon Up since 26-May-2010 14:55:38 (0 days, 22:00:02)
Incoming messages : 16721697
- status : 11146704
- combo : 1122112
Incoming messages/sec : 216 (average last 300 seconds)
The bbtest is taking 225 seconds to complete
PING test completed (4390 hosts) 5005592.360761 203.397989 TIME TOTAL 225.391564
Hoping the group-mind can help me out,
Thanks,
Tom
Thomas Brand
Disclaimer: 1) all opinions are my own, 2) I may be completely wrong, 3) my advice is worth at least as much as what you are paying for it, or your money cheerfully refunded.
CONFIDENTIALITY NOTICE: This communication and any attachments may contain confidential and/or privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify the sender immediately by telephone and destroy all copies of this communication and any attachments.
What kind of hardware are you on now?
On 5/27/10, Brand, Thomas R. <TRBrand at cvs.com> wrote:
Sorry if this post gets a bit long, I've read thru the man pages and the archives and have failed to reach understanding and need some help.
I have over 4300 Hobbit 4.2.0+all-patches clients currently reporting to one server running Xymon 4.3.0-0.beta2. This server is also used for other workloads, which normally do not produce much of a load but can at times.
We are adding more clients to this configuration at rate of about 400/week for a final total of about 7200 clients.
As of this week it seems that my Xymon server is unable to keep up with the load and I suspect this is due to several reasons: the large number of clients, file system type (ext3) used for the data/.. directories (, the problem is.
I am trying to determine my best course to be able to handle the 7200+ clients ...
Some options I've considered:
A) simply split the clients over 2 or more independent Xymon servers
a. easiest to configure
b. lose 'single web page' overview
c. lose combined statistics/reporting
B) Split clients over multiple servers for data gathering/storage and (how?) use one server to display a bb2.html page which combines all non green from all 'data-gathering' servers
a. Is this even possible?
C) Some other method/configuration? Anyone on the list running this many clients?
a. How did you set up your environment?
b. What size/performance/type of system are you using for your Xymon server?
Some of my current symptoms:
The 'top' command on the server is indicating high I/O load (at times %iowait > 60%).
The 'procs' column warns/alerts about missing processes; closer examination shows that the client data received has been truncated, usually somewhere in the 'ps' outout section.
The bb2.html page sometimes shows many 'purple' clients; the clients in purple change (eg, a client goes purple for a while 5-30+ minutes, then we get another message processed and the client goes green again.
In /var/log/xymon/clientdata.log, I am seeing many (2212 yesterday alone) messages like:
2010-05-27 11:38:09 hobbitd_client: Got message 55294, expected 55277
2010-05-27 12:08:12 Flushed 7 stale messages for 0.0.0.0:0
2010-05-27 12:08:13 Flushed 43 stale messages for 0.0.0.0:0
2010-05-27 12:08:13 hobbitd_client: Got message 81190, expected 81140
2010-05-27 12:38:36 Flushed 16 stale messages for 0.0.0.0:0
2010-05-27 12:38:37 Flushed 43 stale messages for 0.0.0.0:0
2010-05-27 12:38:38 Flushed 26 stale messages for 0.0.0.0:0
2010-05-27 12:38:38 hobbitd_client: Got message 107484, expected 107399
The hobbitd page shows:
Statistics for Hobbit daemon Up since 26-May-2010 14:55:38 (0 days, 22:00:02)
Incoming messages : 16721697
- status : 11146704
- combo : 1122112
Incoming messages/sec : 216 (average last 300 seconds)
The bbtest is taking 225 seconds to complete
PING test completed (4390 hosts) 5005592.360761 203.397989 TIME TOTAL 225.391564
Hoping the group-mind can help me out,
Thanks,
Tom
Thomas Brand
Disclaimer: 1) all opinions are my own, 2) I may be completely wrong, 3) my advice is worth at least as much as what you are paying for it, or your money cheerfully refunded.
CONFIDENTIALITY NOTICE: This communication and any attachments may contain confidential and/or privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify the sender immediately by telephone and destroy all copies of this communication and any attachments.
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
“Success is not final, failure is not fatal: it is the courage to continue that counts.” --- Winston Churchill
Hi,
Some of my current symptoms:
The ‘top’ command on the server is indicating high I/O load (at times %iowait > 60%).
try changing the cache time of hobbitd_rrd from 30 minutes to 1 hour it's hardcoded in do_rrd.c change #define CACHESZ 6 to #define CACHESZ 12
The ‘procs’ column warns/alerts about missing processes; closer examination shows that the client data received has been truncated, usually somewhere in the ‘ps’ outout section.
try putting higher values for MAXLINE and MAXMSG* in hobbitserver.cfg
The bb2.html page sometimes shows many ‘purple’ clients; the clients in purple change (eg, a client goes purple for a while 5-30+ minutes, then we get another message processed and the client goes green again.
any logs on the client itself ? (not being able to report to your hobbitserver ?)
Olivier
participants (3)
-
josh@imaginenetworksllc.com
-
obeau79@gmail.com
-
TRBrand@cvs.com