Looking for some thoughts and experiences on how folks have configured their systems. Mainly in regard to classification/grouping of servers for alerting purposes. I'll try to keep this short.
Currently I'm running a total if 3 hobbit servers, each in a different data center. Each server monitors clients local to his network in addition to each of his partner servers smtp box, etc. This all works fine. However, our alerting system, which also works fine is overly complex and contains too many opportunities for bugs.
In a nutshell, we have 3 groups of sysadmins that rotate on call every nn interval. Each group may be involved with a number of systems in each location and some of the admins will work on multiple Operating Systems.
I'm looking for a way to avoid having specific alert rules for each server (lots of text, even with regex macros/vars). More to the point, I want to categorize the servers based on a sysadmin group then the rules can be considerably less complex. Dividing the alerting on OS categories does not work well as some of the admins are cross platform folks. Dividing the alerting by page does not work well as the same 'page' may contain servers belonging to one or more sysadmin groups. The 'Class' statement for bb-hosts seems like a possibility, however I think the intended purpose is more related to whatever logs are defined in client-local, so I don't think that will work beyond log files.
Ideally I'd like to define the sysadmin group in the bb-hosts file but I don't think this is possible.
In summary, if I maintain immense configuration files with somewhat repetitive data Hobbit works quite well. I'd like to reduce the complexity but maintain the functionality. Maybe its not in the cards, or maybe - and I am hoping this is the case - I missed some cool flag or config setting.
Thoughts?
Separate team, separate page(s). Look up PAGE= in the hobbit-alerts man page. Saved me a lot of pain.
Cheers Vernon
-----Original Message----- From: Tim McCloskey [mailto:tm at campnerd.com] Sent: Friday, 13 June 2008 11:31 AM To: hobbit at hswn.dk Subject: [hobbit] grouping methods
Looking for some thoughts and experiences on how folks have configured their systems. Mainly in regard to classification/grouping of servers for alerting purposes. I'll try to keep this short.
Currently I'm running a total if 3 hobbit servers, each in a different data center. Each server monitors clients local to his network in addition to each of his partner servers smtp box, etc. This all works fine. However, our alerting system, which also works fine is overly complex and contains too many opportunities for bugs.
In a nutshell, we have 3 groups of sysadmins that rotate on call every nn interval. Each group may be involved with a number of systems in each location and some of the admins will work on multiple Operating Systems.
I'm looking for a way to avoid having specific alert rules for each server (lots of text, even with regex macros/vars). More to the point, I want to categorize the servers based on a sysadmin group then the rules can be considerably less complex. Dividing the alerting on OS categories does not work well as some of the admins are cross platform folks. Dividing the alerting by page does not work well as the same 'page' may contain servers belonging to one or more sysadmin groups. The 'Class' statement for bb-hosts seems like a possibility, however I think the intended purpose is more related to whatever logs are defined in client-local, so I don't think that will work beyond log files.
Ideally I'd like to define the sysadmin group in the bb-hosts file but I don't think this is possible.
In summary, if I maintain immense configuration files with somewhat repetitive data Hobbit works quite well. I'd like to reduce the complexity but maintain the functionality. Maybe its not in the cards, or maybe - and I am hoping this is the case - I missed some cool flag or config setting.
Thoughts?
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
NOTICE: This email and any attachments are confidential. They may contain legally privileged information or copyright material. You must not read, copy, use or disclose them without authorisation. If you are not an intended recipient, please contact us at once by return email and then delete both messages and all attachments.
Thanks Vernon. I was really trying to avoid that route, even though it seems to be the cleanest approach available at this time. Seems to be a common thread here so I'll stop beating the bush on this one.....
Everett, Vernon wrote:
Separate team, separate page(s).
-----Original Message----- From: Tim McCloskey Dividing the alerting by page does not work well as the same 'page' may contain servers belonging to one or more sysadmin groups.
Currently I'm running a total if 3 hobbit servers, each in a different data center.
Why 3 servers?
We use one server to monitor hundreds of systems in data centers all over the world. Having one centralized configuration sure makes like a lot easier than trying to maintain three of them.
Doug Linder
Not sure what the real reasoning is behind this but if you have 1000 servers monitored behind 3 hobbit servers each, figure one Hobbit server goes down you lost 1000/3000 being monitored. If you have 3000 servers being monitored behind 1 hobbit server, that one point of failure leaves you blind of all 3000 servers.
Those are my thoughts, at least =)
On Mon, Jun 16, 2008 at 1:07 PM, Linder, Doug (SABIC Innovative Plastics, consultant) <Doug.Linder at sabic-ip.com> wrote:
Currently I'm running a total if 3 hobbit servers, each in a different data center.
Why 3 servers?
We use one server to monitor hundreds of systems in data centers all over the world. Having one centralized configuration sure makes like a lot easier than trying to maintain three of them.
Doug Linder
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
Those who don't understand UNIX are condemned to reinvent it, poorly. --- Henry Spencer
Josh Luthman wrote:
Not sure what the real reasoning is behind this but if you have 1000 servers monitored behind 3 hobbit servers each, figure one Hobbit server goes down you lost 1000/3000 being monitored. If you have 3000 servers being monitored behind 1 hobbit server, that one point of failure leaves you blind of all 3000 servers.
We do it with redundancy. Each server in our various data centers is monitored by two bb servers, with one of the two set up to send notifications, but in all other aspects the monitoring is active/active, and we get only one notification for alerts, rather than a pair of redundant notifications.
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control, and a bb server in Arizona can't talk to the corresponding bb server in California, so the normally passive monitoring server goes into failover mode, and begins sending notification for alerts, since it can't verify that the other bb server is alive.
Thus, we always receive notifications for all alerts, and in the worst case we may get redundant notifications in the case of a split brain situation, which is the lesser of the evils.
Once this notification failover capability makes it into hobbit, we can finally switch from bb to hobbit.
Joe
This is quite obviously a well found problem and sought after feature
- getting redundant Hobbit servers.
Please help us, code monkeys =)
Josh
On Mon, Jun 16, 2008 at 1:45 PM, Sloan <joe at tmsusa.com> wrote:
Josh Luthman wrote:
Not sure what the real reasoning is behind this but if you have 1000 servers monitored behind 3 hobbit servers each, figure one Hobbit server goes down you lost 1000/3000 being monitored. If you have 3000 servers being monitored behind 1 hobbit server, that one point of failure leaves you blind of all 3000 servers.
We do it with redundancy. Each server in our various data centers is monitored by two bb servers, with one of the two set up to send notifications, but in all other aspects the monitoring is active/active, and we get only one notification for alerts, rather than a pair of redundant notifications.
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control, and a bb server in Arizona can't talk to the corresponding bb server in California, so the normally passive monitoring server goes into failover mode, and begins sending notification for alerts, since it can't verify that the other bb server is alive.
Thus, we always receive notifications for all alerts, and in the worst case we may get redundant notifications in the case of a split brain situation, which is the lesser of the evils.
Once this notification failover capability makes it into hobbit, we can finally switch from bb to hobbit.
Joe
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
Those who don't understand UNIX are condemned to reinvent it, poorly. --- Henry Spencer
It would be nice to have an elegant solution, but I don't worry about it that much because 1) linux servers go down so infrequently, and 2) it would be pretty trivial to set up your own redundancy between hobbit servers. I can think of half a dozen ways to do it off the top of my head. For example:
Main Hobbit Server (MHS) does its thing normally. Backup Hobbit Server (BHS) syncs/mirrors the drive of the MHS server via rsync or whatever, and runs a copy of hobbit which monitors only one other system: the MHS. If the BHS detects that the MHS is down, the alert triggers a script that brings up its mirror copy of the server.
Doug
-----Original Message----- From: Josh Luthman [mailto:josh at imaginenetworksllc.com] Sent: Monday, June 16, 2008 1:58 PM To: hobbit at hswn.dk Subject: Re: [hobbit] grouping methods
This is quite obviously a well found problem and sought after feature
- getting redundant Hobbit servers.
Please help us, code monkeys =)
Josh
On Mon, Jun 16, 2008 at 1:45 PM, Sloan <joe at tmsusa.com> wrote:
Josh Luthman wrote:
Not sure what the real reasoning is behind this but if you
have 1000
servers monitored behind 3 hobbit servers each, figure one Hobbit server goes down you lost 1000/3000 being monitored. If you have 3000 servers being monitored behind 1 hobbit server, that one point of failure leaves you blind of all 3000 servers.
We do it with redundancy. Each server in our various data centers is monitored by two bb servers, with one of the two set up to send notifications, but in all other aspects the monitoring is active/active, and we get only one notification for alerts, rather than a pair of redundant notifications.
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control, and a bb server in Arizona can't talk to the corresponding bb server in California, so the normally passive monitoring server goes into failover mode, and begins sending notification for alerts, since it can't verify that the other bb server is alive.
Thus, we always receive notifications for all alerts, and in the worst case we may get redundant notifications in the case of a split brain situation, which is the lesser of the evils.
Once this notification failover capability makes it into hobbit, we can finally switch from bb to hobbit.
Joe
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
Those who don't understand UNIX are condemned to reinvent it, poorly. --- Henry Spencer
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Sloan [mailto:joe at tmsusa.com] wrote:
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control
This is by far the biggest annoyance we have with all system monitoring
- when networks go down. It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down. All it knows is that it can't talk to the system anymore and something is wrong, so it generates an alert. When a whole network goes down, it can become hundreds of simultaneous alerts. And that's annoying enough when it's just email alerts. When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to manually close.
We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as possible. But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases. This is especially a problem with overseas networks on the WAN.
I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."
Doug Linder
That is one thing I have thought about bringing up a few times - a summary alert.
When the power goes out or the WAN has issues, I get text messages of very important servers. The problem behind this is when they go up and down it is very irritating to battle through even several messages on my phone. I have a BB8800 which allows me to go through them pretty quick, but for an admin with a RAZR a dozen text messages would take several minutes to go through.
Maybe we could get some sort of toggle-able proxy for all alerts and the proxy sends out a summary every 60s? Just tossing ideas out here at this point.
Josh
On Mon, Jun 16, 2008 at 2:07 PM, Linder, Doug (SABIC Innovative Plastics, consultant) <Doug.Linder at sabic-ip.com> wrote:
Sloan [mailto:joe at tmsusa.com] wrote:
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control
This is by far the biggest annoyance we have with all system monitoring
- when networks go down. It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down. All it knows is that it can't talk to the system anymore and something is wrong, so it generates an alert. When a whole network goes down, it can become hundreds of simultaneous alerts. And that's annoying enough when it's just email alerts. When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to manually close.
We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as possible. But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases. This is especially a problem with overseas networks on the WAN.
I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."
Doug Linder
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
Those who don't understand UNIX are condemned to reinvent it, poorly. --- Henry Spencer
If this is a situation of routed networks, Hobbit can know about that with directives in the bb-hosts file. If it knows a host behind a router is down, it will only notify for the router, not the hosts behind the router.
Linder, Doug (SABIC Innovative Plastics, consultant) wrote:
Sloan [mailto:joe at tmsusa.com] wrote:
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control
This is by far the biggest annoyance we have with all system monitoring
- when networks go down. It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down. All it knows is that it can't talk to the system anymore and something is wrong, so it generates an alert. When a whole network goes down, it can become hundreds of simultaneous alerts. And that's annoying enough when it's just email alerts. When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to manually close.
We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as possible. But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases. This is especially a problem with overseas networks on the WAN.
I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."
Doug Linder
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rich Smrcina VM Assist, Inc. Phone: 414-491-6001 Ans Service: 360-715-2467 rich.smrcina at vmassist.com http://www.linkedin.com/in/richsmrcina
Catch the WAVV! http://www.wavv.org WAVV 2009 - Orlando, FL - May 15-19, 2009
Yes - I have that setup with customers' routers and CPEs.
The real problem is when, for example, 3 servers in one data center in New Mexico lose connectivity with us in Ohio. Then I get 3 SMS messages on my phone, followed by 3 more when it comes back up.
It would be very convenient to have 1 messages saying this, that and another thing went down in the last 60s.
On Mon, Jun 16, 2008 at 2:17 PM, Rich Smrcina <rsmrcina at wi.rr.com> wrote:
If this is a situation of routed networks, Hobbit can know about that with directives in the bb-hosts file. If it knows a host behind a router is down, it will only notify for the router, not the hosts behind the router.
Linder, Doug (SABIC Innovative Plastics, consultant) wrote:
Sloan [mailto:joe at tmsusa.com] wrote:
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control
This is by far the biggest annoyance we have with all system monitoring
- when networks go down. It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down. All it knows is that it can't talk to the system anymore and something is wrong, so it generates an alert. When a whole network goes down, it can become hundreds of simultaneous alerts. And that's annoying enough when it's just email alerts. When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to manually close.
We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as possible. But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases. This is especially a problem with overseas networks on the WAN.
I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."
Doug Linder
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rich Smrcina VM Assist, Inc. Phone: 414-491-6001 Ans Service: 360-715-2467 rich.smrcina at vmassist.com http://www.linkedin.com/in/richsmrcina
Catch the WAVV! http://www.wavv.org WAVV 2009 - Orlando, FL - May 15-19, 2009
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
Those who don't understand UNIX are condemned to reinvent it, poorly. --- Henry Spencer
Oh, I think I get it.... you want to be able to consolidate notifications. Somehow, if Hobbit knows that the same person is going to get notified of multiple events, that it should only send one.
Yes, nice....
Josh Luthman wrote:
Yes - I have that setup with customers' routers and CPEs.
The real problem is when, for example, 3 servers in one data center in New Mexico lose connectivity with us in Ohio. Then I get 3 SMS messages on my phone, followed by 3 more when it comes back up.
It would be very convenient to have 1 messages saying this, that and another thing went down in the last 60s.
-- Rich Smrcina VM Assist, Inc. Phone: 414-491-6001 Ans Service: 360-715-2467 rich.smrcina at vmassist.com http://www.linkedin.com/in/richsmrcina
Catch the WAVV! http://www.wavv.org WAVV 2009 - Orlando, FL - May 15-19, 2009
On Mon, Jun 16, 2008 at 1:41 PM, Rich Smrcina <rsmrcina at wi.rr.com> wrote:
Oh, I think I get it.... you want to be able to consolidate notifications. Somehow, if Hobbit knows that the same person is going to get notified of multiple events, that it should only send one.
Yes, nice....
It might not be perfect, but perhaps that could be managed via a couple of scripts. Configure Hobbit to alert using the SCRIPT option, and have that script append the message to a file named for the recipient. Have a second script fired by cron that would do the delivery via email, SMS, etc, then delete the file.
Ralph Mitchell
Exactly right! :)
On 6/16/08, Rich Smrcina <rsmrcina at wi.rr.com> wrote:
Oh, I think I get it.... you want to be able to consolidate notifications. Somehow, if Hobbit knows that the same person is going to get notified of multiple events, that it should only send one.
Yes, nice....
Josh Luthman wrote:
Yes - I have that setup with customers' routers and CPEs.
The real problem is when, for example, 3 servers in one data center in New Mexico lose connectivity with us in Ohio. Then I get 3 SMS messages on my phone, followed by 3 more when it comes back up.
It would be very convenient to have 1 messages saying this, that and another thing went down in the last 60s.
-- Rich Smrcina VM Assist, Inc. Phone: 414-491-6001 Ans Service: 360-715-2467 rich.smrcina at vmassist.com http://www.linkedin.com/in/richsmrcina
Catch the WAVV! http://www.wavv.org WAVV 2009 - Orlando, FL - May 15-19, 2009
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
Those who don't understand UNIX are condemned to reinvent it, poorly. --- Henry Spencer
I have used this method with great success, but it is a pain in the you-know-what to maintain. It would be nice if this "router" tagging could be made recursive so you only have to specify one upstream host for each host, assuming that the upstream host is also in Hobbit. As it is today you have to specify the full path to each "leaf" and this can get long.
GLH
-----Original Message----- From: Rich Smrcina [mailto:rsmrcina at wi.rr.com] Sent: Monday, June 16, 2008 1:18 PM To: hobbit at hswn.dk Subject: Re: [hobbit] grouping methods
If this is a situation of routed networks, Hobbit can know about that with directives in the bb-hosts file. If it knows a host behind a router is down, it will only notify for the router, not the hosts behind the router.
Linder, Doug (SABIC Innovative Plastics, consultant) wrote:
Sloan [mailto:joe at tmsusa.com] wrote:
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control
This is by far the biggest annoyance we have with all system monitoring
- when networks go down. It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down. All it knows is that it can't talk to the system anymore and something is wrong, so
it generates an alert. When a whole network goes down, it can become hundreds of simultaneous alerts. And that's annoying enough when it's
just email alerts. When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to manually close.
We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as possible. But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases. This is especially a problem with overseas networks on the WAN.
I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."
Doug Linder
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rich Smrcina VM Assist, Inc. Phone: 414-491-6001 Ans Service: 360-715-2467 rich.smrcina at vmassist.com http://www.linkedin.com/in/richsmrcina
Catch the WAVV! http://www.wavv.org WAVV 2009 - Orlando, FL - May 15-19, 2009
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
I would rather sever my you-know-what then have to go through all of that =)
That is a LOT of work to do. I'll put up with the annoying multi-message nights instead of doing that!
On Mon, Jun 16, 2008 at 2:36 PM, Hubbard, Greg L <greg.hubbard at eds.com> wrote:
I have used this method with great success, but it is a pain in the you-know-what to maintain. It would be nice if this "router" tagging could be made recursive so you only have to specify one upstream host for each host, assuming that the upstream host is also in Hobbit. As it is today you have to specify the full path to each "leaf" and this can get long.
GLH
-----Original Message----- From: Rich Smrcina [mailto:rsmrcina at wi.rr.com] Sent: Monday, June 16, 2008 1:18 PM To: hobbit at hswn.dk Subject: Re: [hobbit] grouping methods
If this is a situation of routed networks, Hobbit can know about that with directives in the bb-hosts file. If it knows a host behind a router is down, it will only notify for the router, not the hosts behind the router.
Linder, Doug (SABIC Innovative Plastics, consultant) wrote:
Sloan [mailto:joe at tmsusa.com] wrote:
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control
This is by far the biggest annoyance we have with all system monitoring
- when networks go down. It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down. All it knows is that it can't talk to the system anymore and something is wrong, so
it generates an alert. When a whole network goes down, it can become hundreds of simultaneous alerts. And that's annoying enough when it's
just email alerts. When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to manually close.
We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as possible. But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases. This is especially a problem with overseas networks on the WAN.
I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."
Doug Linder
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Rich Smrcina VM Assist, Inc. Phone: 414-491-6001 Ans Service: 360-715-2467 rich.smrcina at vmassist.com http://www.linkedin.com/in/richsmrcina
Catch the WAVV! http://www.wavv.org WAVV 2009 - Orlando, FL - May 15-19, 2009
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
-- Josh Luthman Office: 937-552-2340 Direct: 937-552-2343 1100 Wayne St Suite 1337 Troy, OH 45373
Those who don't understand UNIX are condemned to reinvent it, poorly. --- Henry Spencer
Might it be helpful if router paths could be 'macro-ed'? Something like notifications... so only the macro definitions had to be maintained?
Granted, I like Greg's idea much better... :)
Josh Luthman wrote:
I would rather sever my you-know-what then have to go through all of that =)
That is a LOT of work to do. I'll put up with the annoying multi-message nights instead of doing that!
On Mon, Jun 16, 2008 at 2:36 PM, Hubbard, Greg L <greg.hubbard at eds.com> wrote:
I have used this method with great success, but it is a pain in the you-know-what to maintain. It would be nice if this "router" tagging could be made recursive so you only have to specify one upstream host for each host, assuming that the upstream host is also in Hobbit. As it is today you have to specify the full path to each "leaf" and this can get long.
GLH
-- Rich Smrcina VM Assist, Inc. Phone: 414-491-6001 Ans Service: 360-715-2467 rich.smrcina at vmassist.com http://www.linkedin.com/in/richsmrcina
Catch the WAVV! http://www.wavv.org WAVV 2009 - Orlando, FL - May 15-19, 2009
I wrote a custom alert script to handle this. The first alert is sent immediately, then the rest are spooled up and sent later, as a batch.
(1) The alert script first checks if its spool file exists. If so, the current alert is appended to that file and the custom alert script exits. There is one spool file per recipient address.
(2) If the spool file does not exist, the custom alert script sends out the current alert as normal and then creates a zero length spool file. It also creates an "at" job. The "at" job will mail the spool file to it's normal recipent after one hours wait, and then delete it. This spoolfile deletion resets the spooling. You can vary the one hour setting to suite your needs.
(3) When the "at" job fires it mails the spoolfile if it is non-zero length. Then it deletes the spoolfile (reset).
Enhancements:
I found that if the server reboots while the spool file is spooling, the "at" job gets killed and you end up endlessly spooling forever and ever. To work around this:
(1) The custom alert script was modified to check the age of the spoolfile as its first step. If it's "too old" (in my example, over 1 hour 15 minutes old), the alert script mails it immediately, deletes it, and then starts from the beginning with the current alert.
(2) Additionally, a cronjob was added to check for stale spoolfiles. The job runs every 15 minutes and looks for spoolfiles over 1 hour 15 minutes old. If any are found, the cronjob does the mailing and deleting.
Those are the basics. I enhanced it further so that different alert types could be grouped together into different spoolfiles and spooling could be for different lengths of time. I did this by symlinking the alert script to different names. The name of the symlink was structured and the script looked at how it was invoked and parsed out the spooling group and length of time from its invokation name. The specific spoolfile was then named based on recipient, spool duration, and group.
It is more complex to describe what I did than to actually code it! Unfortunately I cannot post the script. It does a bunch more than just this spooling function, some of that being company proprietary. It would take quite a bit of work for me to strip out the proprietary stuff to create a generic demonstation script for posting.
This script also does a function similar to spooling, but not quite. Implemented as a symlink to a different name. I call it a "consolidate" funciton. It works pretty much the same as spooling, but instead of sending the spoolfile after an hour, it only waits 5 minutes, deletes the spoolfile without mailing it, and then basically does a "screen scrape" of the bb2.html page and lists all the non-green lights it finds there. This works well for pagers. Rather than getting a whole bunch of pages, you get one page that lists all the current light statuses. As part of the consolidation during the screen scrape (actually I open the actual html file, so I'm dependant on consistant file structure unfortunatly) I heavily abbreviate things so they will fit in the tight SMS message length limits. A consolidate message might look like this cryptic example, but I know what it means! "!BB! R:testa:srv1 R:testc:srv7 Y:testf:srv2 P:testq:srv3" I list things in order of importance (reds before yellows, etc) so if the messatge does get truncated, the most important parts make it through.
-----Original Message----- From: Linder, Doug (SABIC Innovative Plastics, consultant) [mailto:Doug.Linder at sabic-ip.com] Sent: Monday, June 16, 2008 12:08 PM To: hobbit at hswn.dk Subject: RE: [hobbit] grouping methods
Sloan [mailto:joe at tmsusa.com] wrote:
We've not had a bb server go down in all the years we've been using it, but sometimes wan connectivity goes away due to circumstances beyond our control
This is by far the biggest annoyance we have with all system monitoring
- when networks go down. It's a problem with every monitoring tool there is and I can't think of any way to solve it: the monitoring system has no way of knowing whether a system is down because it crashed or if it's down because the network went down. All it knows is that it can't talk to the system anymore and something is wrong, so it generates an alert. When a whole network goes down, it can become hundreds of simultaneous alerts. And that's annoying enough when it's just email alerts. When you use Hobbit to generate cases in your trouble ticket system, that can be hundreds of new, useless cases to manually close.
We don't want to raise the amount of time a system has to be down before Hobbit generates an alert, because we want to know as soon as possible. But if we keep that number too low, then when the network has a brief hiccup, we get hundreds of redundant cases. This is especially a problem with overseas networks on the WAN.
I think the only possible solution would be for Hobbit to have some kind of flood-detection routine built in, where it could tell how rapidly it was sending alerts about connection problems for machines all on the same network, and was smart enough to think "Whoa, I'm about to send 100 connection alarms about systems on the same network.... Instead of sending 100 of them, maybe I'll just send ONE alert saying "You got a big problem here."
Doug Linder
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
participants (10)
-
devzero@cox.net
-
Doug.Linder@sabic-ip.com
-
greg.hubbard@eds.com
-
haertig@avaya.com
-
joe@tmsusa.com
-
josh@imaginenetworksllc.com
-
ralphmitchell@gmail.com
-
rsmrcina@wi.rr.com
-
tm@campnerd.com
-
Vernon.Everett@woodside.com.au