Hi all,
I've been doing systems monitoring for a very long time now... I was early on with BB, used HP openview back in day day, blah blah.
Anyway, recently I've been told that in very large installations (multi thousands of devices) things like zabbix are the only thing(s) that will do.
What are the groups thoughts on this? What ARE the scaling limits of xymon and can they be overcome somehow?
Well, I'm monitoring ~2000 hosts on a fairly modest box (8 3Ghz cores, 8 GB of memory). I'm also running quite a few cpu intensive scripts on the same box that could be easily moved to another host, if needed. I do the network testing on separate hosts in each of our major security zones, for reliability of the tests more than to unload the main Xymon server. The main server is not operating anywhere near its capacity, it's using less than 10% (physical) of it's memory and the load average tends to stay around 1. I suspect that the box could handle 5000 hosts without too much trouble, maybe more.
If you do have scaling problems there are some things you can do, though. Move things like the network tests to separate hosts. You can also move the alerting to a different host using xymonproxy. I've found that the most likely limit you're likely to hit with Xymon is disk i/o, this can be helped by moving the data directory to SAN.
Thanks, Larry Barber
On Fri, Apr 5, 2013 at 12:57 PM, Bruce Ferrell <bferrell at baywinds.org>wrote:
Hi all,
I've been doing systems monitoring for a very long time now... I was early on with BB, used HP openview back in day day, blah blah.
Anyway, recently I've been told that in very large installations (multi thousands of devices) things like zabbix are the only thing(s) that will do.
What are the groups thoughts on this? What ARE the scaling limits of xymon and can they be overcome somehow? ______________________________**_________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/**mailman/listinfo/xymon<http://lists.xymon.com/mailman/listinfo/xymon>
hello
15 000 devices here. For me the key is ssd :)
I plan to monitore 60 000 devices with xymon. Only network devices.
We'll see the result.
oau
Le vendredi 05 avril 2013 à 13:54 -0500, Larry Barber a écrit :
Well, I'm monitoring ~2000 hosts on a fairly modest box (8 3Ghz cores, 8 GB of memory). I'm also running quite a few cpu intensive scripts on the same box that could be easily moved to another host, if needed. I do the network testing on separate hosts in each of our major security zones, for reliability of the tests more than to unload the main Xymon server. The main server is not operating anywhere near its capacity, it's using less than 10% (physical) of it's memory and the load average tends to stay around 1. I suspect that the box could handle 5000 hosts without too much trouble, maybe more.
If you do have scaling problems there are some things you can do, though. Move things like the network tests to separate hosts. You can also move the alerting to a different host using xymonproxy. I've found that the most likely limit you're likely to hit with Xymon is disk i/o, this can be helped by moving the data directory to SAN.
Thanks, Larry Barber
On Fri, Apr 5, 2013 at 12:57 PM, Bruce Ferrell <bferrell at baywinds.org> wrote: Hi all, I've been doing systems monitoring for a very long time now... I was early on with BB, used HP openview back in day day, blah blah. Anyway, recently I've been told that in very large installations (multi thousands of devices) things like zabbix are the only thing(s) that will do. What are the groups thoughts on this? What ARE the scaling limits of xymon and can they be overcome somehow? _______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Over 1000 devices monitored here and only real issue is rrd keeping up. I have been told an ssd for the rrd files will solve this issue.
Bruce White Senior Enterprise Systems Engineer | Phone: 1-630-671-5169 | Fax: 630-893-1648 | bewhite at fellowes.com | http://www.fellowes.com/ Disclaimer: The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Fellowes, Inc. -----Original Message----- From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Olivier AUDRY Sent: Friday, April 05, 2013 3:34 PM To: Larry Barber Cc: xymon at xymon.com Subject: Re: [Xymon] Scaling
hello
15 000 devices here. For me the key is ssd :)
I plan to monitore 60 000 devices with xymon. Only network devices.
We'll see the result.
oau
Le vendredi 05 avril 2013 à 13:54 -0500, Larry Barber a écrit :
Well, I'm monitoring ~2000 hosts on a fairly modest box (8 3Ghz cores, 8 GB of memory). I'm also running quite a few cpu intensive scripts on the same box that could be easily moved to another host, if needed. I do the network testing on separate hosts in each of our major security zones, for reliability of the tests more than to unload the main Xymon server. The main server is not operating anywhere near its capacity, it's using less than 10% (physical) of it's memory and the load average tends to stay around 1. I suspect that the box could handle 5000 hosts without too much trouble, maybe more.
If you do have scaling problems there are some things you can do, though. Move things like the network tests to separate hosts. You can also move the alerting to a different host using xymonproxy. I've found that the most likely limit you're likely to hit with Xymon is disk i/o, this can be helped by moving the data directory to SAN.
Thanks, Larry Barber
On Fri, Apr 5, 2013 at 12:57 PM, Bruce Ferrell <bferrell at baywinds.org> wrote: Hi all, I've been doing systems monitoring for a very long time now... I was early on with BB, used HP openview back in day day, blah blah. Anyway, recently I've been told that in very large installations (multi thousands of devices) things like zabbix are the only thing(s) that will do. What are the groups thoughts on this? What ARE the scaling limits of xymon and can they be overcome somehow? _______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <bewhite at fellowes.com> wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up. I have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really don't see any IO issues in the slightest. 6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of ram and the thing is snoring (LA: 0.25)
Regards, Cami
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <bewhite at fellowes.com> wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up. I have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really don't see any IO issues in the slightest. 6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of ram and the thing is snoring (LA: 0.25)
Regards, Cami
We're currently processing ~2K incoming messages a second on a single xymond instance. This is a pretty beefy box, but it's also handling lots of other concurrent monitoring tasks that we're slowly moving over to xymon... including a non-fping-enabled Icinga install >.<
]# xymon localhost "xymondboard test=info fields=hostname" | wc -l 42459
(Not all of those are full hosts; some are application nodes with statuses being generated server-side out of client-side jvm stats or the like.)
At these levels it's important to ensure you're using whatever NUMA capabilities your system has properly, since message passing is basically just shoveling incoming TCP data around within memory. Also, you might want to tweak net.ipv4.ip_local_port_range and enable net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more simultaneous testing out of xymonnet.
One of the beauties of Xymon's architecture is the ability to cleanly disconnect the components... Xymongen can run on some other box, xymond_locator can be used to send rrd data off somewhere if IO becomes an issue, xymonnet pollers can be distributed, and xymonproxy can be used as needed to aggregate and smooth out incoming status reports, etc.
There are lots of different mechanisms for "scaling" efficiently depending on your particular needs, but I'd bet that on decently modern server hardware you'll probably want to scale for HA purposes long before you actually /need/ the additional power.
HTH,
-jc
hello
I impressed with your 2k incoming message. I only got 600 and we have a lot of gap in our trends.
I suspect xymonproxy to add latency into the process or our huge and historical extra-rrd.pl
We don't have load or iowait.
I'm not sure that it could be network issue. So if you have an idee :)
oau
Le jeudi 11 avril 2013 à 17:18 +0000, cleaver at terabithia.org a écrit :
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <bewhite at fellowes.com> wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up. I have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really don't see any IO issues in the slightest. 6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of ram and the thing is snoring (LA: 0.25)
Regards, Cami
We're currently processing ~2K incoming messages a second on a single xymond instance. This is a pretty beefy box, but it's also handling lots of other concurrent monitoring tasks that we're slowly moving over to xymon... including a non-fping-enabled Icinga install >.<
]# xymon localhost "xymondboard test=info fields=hostname" | wc -l 42459
(Not all of those are full hosts; some are application nodes with statuses being generated server-side out of client-side jvm stats or the like.)
At these levels it's important to ensure you're using whatever NUMA capabilities your system has properly, since message passing is basically just shoveling incoming TCP data around within memory. Also, you might want to tweak net.ipv4.ip_local_port_range and enable net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more simultaneous testing out of xymonnet.
One of the beauties of Xymon's architecture is the ability to cleanly disconnect the components... Xymongen can run on some other box, xymond_locator can be used to send rrd data off somewhere if IO becomes an issue, xymonnet pollers can be distributed, and xymonproxy can be used as needed to aggregate and smooth out incoming status reports, etc.
There are lots of different mechanisms for "scaling" efficiently depending on your particular needs, but I'd bet that on decently modern server hardware you'll probably want to scale for HA purposes long before you actually /need/ the additional power.
HTH,
-jc _______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
hello
can you gives us more information on your numa config ?
As I understand I only see two node 1 per physical cpu
numactl --hardware available: 2 nodes (0-1) node 0 size: 12097 MB node 0 free: 594 MB node 1 size: 12120 MB node 1 free: 12 MB node distances: node 0 1 0: 10 20
event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
As I can see my two node are full. Not good at all I guess.
My policy is the default one. Perhaps you can advice a specific policy for a xymon setup ?
numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 cpubind: 0 1 nodebind: 0 1 membind: 0 1
I'm looking into /proc/pid/numa_maps to find more info.
If you can help it will be great :)
thx
oau
Le jeudi 11 avril 2013 à 17:18 +0000, cleaver at terabithia.org a écrit :
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <bewhite at fellowes.com> wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up. I have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really don't see any IO issues in the slightest. 6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of ram and the thing is snoring (LA: 0.25)
Regards, Cami
We're currently processing ~2K incoming messages a second on a single xymond instance. This is a pretty beefy box, but it's also handling lots of other concurrent monitoring tasks that we're slowly moving over to xymon... including a non-fping-enabled Icinga install >.<
]# xymon localhost "xymondboard test=info fields=hostname" | wc -l 42459
(Not all of those are full hosts; some are application nodes with statuses being generated server-side out of client-side jvm stats or the like.)
At these levels it's important to ensure you're using whatever NUMA capabilities your system has properly, since message passing is basically just shoveling incoming TCP data around within memory. Also, you might want to tweak net.ipv4.ip_local_port_range and enable net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more simultaneous testing out of xymonnet.
One of the beauties of Xymon's architecture is the ability to cleanly disconnect the components... Xymongen can run on some other box, xymond_locator can be used to send rrd data off somewhere if IO becomes an issue, xymonnet pollers can be distributed, and xymonproxy can be used as needed to aggregate and smooth out incoming status reports, etc.
There are lots of different mechanisms for "scaling" efficiently depending on your particular needs, but I'd bet that on decently modern server hardware you'll probably want to scale for HA purposes long before you actually /need/ the additional power.
HTH,
-jc _______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
hello
as I understand I should run xymon on a single node to improve memory access latency. Right ?
I will test this if I found the right command :)
oau
Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
hello
can you gives us more information on your numa config ?
As I understand I only see two node 1 per physical cpu
numactl --hardware available: 2 nodes (0-1) node 0 size: 12097 MB node 0 free: 594 MB node 1 size: 12120 MB node 1 free: 12 MB node distances: node 0 1 0: 10 20
event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
As I can see my two node are full. Not good at all I guess.
My policy is the default one. Perhaps you can advice a specific policy for a xymon setup ?
numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 cpubind: 0 1 nodebind: 0 1 membind: 0 1
I'm looking into /proc/pid/numa_maps to find more info.
If you can help it will be great :)
thx
oau
Le jeudi 11 avril 2013 à 17:18 +0000, cleaver at terabithia.org a écrit :
On Wed, Apr 10, 2013 at 5:51 PM, White, Bruce <bewhite at fellowes.com> wrote:
Over 1000 devices monitored here and only real issue is rrd keeping up. I have been told an ssd for the rrd files will solve this issue.
~2000 hosts and that will double or triple in the next few weeks. I really don't see any IO issues in the slightest. 6 x 15k RPM SCSI drives in Raid 5 on a Dell PowerEdge 2950 with 8 gigs of ram and the thing is snoring (LA: 0.25)
Regards, Cami
We're currently processing ~2K incoming messages a second on a single xymond instance. This is a pretty beefy box, but it's also handling lots of other concurrent monitoring tasks that we're slowly moving over to xymon... including a non-fping-enabled Icinga install >.<
]# xymon localhost "xymondboard test=info fields=hostname" | wc -l 42459
(Not all of those are full hosts; some are application nodes with statuses being generated server-side out of client-side jvm stats or the like.)
At these levels it's important to ensure you're using whatever NUMA capabilities your system has properly, since message passing is basically just shoveling incoming TCP data around within memory. Also, you might want to tweak net.ipv4.ip_local_port_range and enable net.ipv4.tcp_tw_reuse and/or net.ipv4.tcp_tw_recycle on Linux to eke more simultaneous testing out of xymonnet.
One of the beauties of Xymon's architecture is the ability to cleanly disconnect the components... Xymongen can run on some other box, xymond_locator can be used to send rrd data off somewhere if IO becomes an issue, xymonnet pollers can be distributed, and xymonproxy can be used as needed to aggregate and smooth out incoming status reports, etc.
There are lots of different mechanisms for "scaling" efficiently depending on your particular needs, but I'd bet that on decently modern server hardware you'll probably want to scale for HA purposes long before you actually /need/ the additional power.
HTH,
-jc _______________________________________________ Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
hello
as I understand I should run xymon on a single node to improve memory access latency. Right ?
--snip--
numactl --hardware available: 2 nodes (0-1) node 0 size: 12097 MB node 0 free: 594 MB node 1 size: 12120 MB node 1 free: 12 MB node distances: node 0 1 0: 10 20
event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
That seems odd; almost like hyperthreading is disabled? You should see "node 0 cpus: ..." above each size. I'm running RHEL 6.4; it's possible things have changed in that output over time if you're on a different system.
As I can see my two node are full. Not good at all I guess.
My policy is the default one. Perhaps you can advice a specific policy for a xymon setup ?
numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 cpubind: 0 1 nodebind: 0 1 membind: 0 1
Generally speaking, yeah, use numactl in front of xymonlaunch to ensure the entire process tree gets assigned to a single node. But it really depends on your workload (can everything fit in that node?) and what else is going on on the box. If you have something which analyzes xymondata in a large dump, then does heavy munging on it and sends it back, it might be better to have than on a different node than (say) the xymond_* worker modules.
'numastat -s -z -p xymon' is your friend
The RH Performance Tuning and Resource Management guides are definitely useful reading as well. I'm sure there's plenty of cgroup stuff that could be helpful if/when the time came, but there are only so many hours in the day and there's other low-hanging fruit at the moment :)
I'd definitely start with running the 'numad' service and seeing what it does over time; it really could be all that you need.
HTH,
-jc
great many thx for your time I will check this
but there are only so many hours in the day and there's other low-hanging fruit at the moment :)
so true :)
Le jeudi 11 avril 2013 à 20:12 +0000, cleaver at terabithia.org a écrit :
Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
hello
as I understand I should run xymon on a single node to improve memory access latency. Right ?
--snip--
numactl --hardware available: 2 nodes (0-1) node 0 size: 12097 MB node 0 free: 594 MB node 1 size: 12120 MB node 1 free: 12 MB node distances: node 0 1 0: 10 20
event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
That seems odd; almost like hyperthreading is disabled? You should see "node 0 cpus: ..." above each size. I'm running RHEL 6.4; it's possible things have changed in that output over time if you're on a different system.
As I can see my two node are full. Not good at all I guess.
My policy is the default one. Perhaps you can advice a specific policy for a xymon setup ?
numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 cpubind: 0 1 nodebind: 0 1 membind: 0 1
Generally speaking, yeah, use numactl in front of xymonlaunch to ensure the entire process tree gets assigned to a single node. But it really depends on your workload (can everything fit in that node?) and what else is going on on the box. If you have something which analyzes xymondata in a large dump, then does heavy munging on it and sends it back, it might be better to have than on a different node than (say) the xymond_* worker modules.
'numastat -s -z -p xymon' is your friend
The RH Performance Tuning and Resource Management guides are definitely useful reading as well. I'm sure there's plenty of cgroup stuff that could be helpful if/when the time came, but there are only so many hours in the day and there's other low-hanging fruit at the moment :)
I'd definitely start with running the 'numad' service and seeing what it does over time; it really could be all that you need.
HTH,
-jc
[Sorry to respond so late, I am catching up on emails]
I monitor about 43,000 devices split across 8 instances. It runs on ancient hardware with 2 CPU, 8GB RAM, sun x4200's
I split RRD's to a different host, as well as xymongen and histfiles being handled outside of stock xymon
The only issue I have run into (which I suspect will be fixed by beefier hardware) is that once I get around 5,000 hosts, if xymon crashes, the IPC/Shared Memory does not clean up right away, and it goes into a continual restart process - henrik posted to the list earlier a way to restart that kills all those things, so I haven't had issues since (still tracking down what causes the crash)
On 4/11/13 4:23 PM, "Olivier AUDRY" <olivier at audry.fr> wrote:
great many thx for your time I will check this
but there are only so many hours in the day and there's other low-hanging fruit at the moment :)
so true :)
Le jeudi 11 avril 2013 à 20:12 +0000, cleaver at terabithia.org a écrit :
Le jeudi 11 avril 2013 à 20:40 +0200, Olivier AUDRY a écrit :
hello
as I understand I should run xymon on a single node to improve memory access latency. Right ?
--snip--
numactl --hardware available: 2 nodes (0-1) node 0 size: 12097 MB node 0 free: 594 MB node 1 size: 12120 MB node 1 free: 12 MB node distances: node 0 1 0: 10 20
event I got 24 cpu. Multi core and hyperthreading. Is that correct ?
That seems odd; almost like hyperthreading is disabled? You should see "node 0 cpus: ..." above each size. I'm running RHEL 6.4; it's possible things have changed in that output over time if you're on a different system.
As I can see my two node are full. Not good at all I guess.
My policy is the default one. Perhaps you can advice a specific
policy
for a xymon setup ?
numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 cpubind: 0 1 nodebind: 0 1 membind: 0 1
Generally speaking, yeah, use numactl in front of xymonlaunch to ensure the entire process tree gets assigned to a single node. But it really depends on your workload (can everything fit in that node?) and what else is going on on the box. If you have something which analyzes xymondata in a large dump, then does heavy munging on it and sends it back, it might be better to have than on a different node than (say) the xymond_* worker modules.
'numastat -s -z -p xymon' is your friend
The RH Performance Tuning and Resource Management guides are definitely useful reading as well. I'm sure there's plenty of cgroup stuff that could be helpful if/when the time came, but there are only so many hours in the day and there's other low-hanging fruit at the moment :)
I'd definitely start with running the 'numad' service and seeing what it does over time; it really could be all that you need.
HTH,
-jc
Xymon mailing list Xymon at xymon.com http://lists.xymon.com/mailman/listinfo/xymon
This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
participants (7)
-
bewhite@fellowes.com
-
bferrell@baywinds.org
-
cami@hack.co.za
-
cleaver@terabithia.org
-
lebarber@gmail.com
-
olivier@audry.fr
-
sean.clark@twcable.com