[hobbit] Windows Cluster Monitor
That'd be great. We are demoing this for the new director on Monday, so that would really be nice. Thanks very much.
If not, we'll coble something together base on the responses we've gotten so far.
The UNIX stuff is easy, it's been in place forever. The windows stuff is new to me (I'm a home dabbler in windows) as far as a prd environment. And, our windows admins have resisted this every time we bring it up, so I'm also fighting that resistance :).
In any case, thanks for the reply...
Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 ajeffco at orhs.org
-----Original Message----- From: kevin grady [mailto:kevin.grady at gmail.com] Sent: Thursday, February 24, 2005 8:36 PM To: hobbit at hswn.dk Subject: Re: [hobbit] Windows Cluster Monitor
I'll post something this weekend as I have been working on this for a SQL cluster we have running.
On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <kevin.grady at gmail.com> wrote:
Use WMI to query the MSCluster_Resource groups and you can grab the status of each resource and then report back to hobbit.
Here's a link to some examples from MS.
http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/de fault.mspx
On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <ajeffco at orhs.org>
wrote:
Hello All,
Our new director would like to monitor EVERYTHING from BB/Hobbit. We have been monitoring our UNIX and Storage Devices for a few years now. Now that I have windows servers to monitor, I'd like to know if anyone has a decent way to monitor Windows Clusters? I had a thought to monitor by ping each node in the cluster, and the cluster name, ie:
Nodea - Application Offline Nodeb - Application Online Clustername - Application Responding @ This address
How would you set up resource (process) monitoring for an Active / Passive cluster? Or an Active / Active cluster?
This is in response to a problem that has been occurring on a new 24x7 Windows server blue screening daily, in spite of all the "fixes" that have occurred to solve the problem (more hardware, patches, reload os, etc, etc).
We'll soon be moving the application to an AIX server, but I'll have the same questions on an HACMP cluster at that point :)
TIA
Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 ajeffco at orhs.org
This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: postmaster at orlandoregional.org . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: postmaster at orlandoregional.org . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
Check out bb-mscs on deadcat. It does most of what I am looking for and maybe enough for your demo. You'd need to adjust to script if you want red if a resource goes offline. Right now it will turn yellow.
On Thu, 24 Feb 2005 21:14:00 -0500, Jeffcoat, Al <ajeffco at orhs.org> wrote:
That'd be great. We are demoing this for the new director on Monday, so that would really be nice. Thanks very much.
If not, we'll coble something together base on the responses we've gotten so far.
The UNIX stuff is easy, it's been in place forever. The windows stuff is new to me (I'm a home dabbler in windows) as far as a prd environment. And, our windows admins have resisted this every time we bring it up, so I'm also fighting that resistance :).
In any case, thanks for the reply...
Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 ajeffco at orhs.org
-----Original Message----- From: kevin grady [mailto:kevin.grady at gmail.com] Sent: Thursday, February 24, 2005 8:36 PM To: hobbit at hswn.dk Subject: Re: [hobbit] Windows Cluster Monitor
I'll post something this weekend as I have been working on this for a SQL cluster we have running.
On Thu, 24 Feb 2005 19:46:06 -0500, kevin grady <kevin.grady at gmail.com> wrote:
Use WMI to query the MSCluster_Resource groups and you can grab the status of each resource and then report back to hobbit.
Here's a link to some examples from MS.
http://www.microsoft.com/technet/scriptcenter/scripts/network/cluster/de fault.mspx
On Thu, 24 Feb 2005 18:27:35 -0500, Jeffcoat, Al <ajeffco at orhs.org>
wrote:
Hello All,
Our new director would like to monitor EVERYTHING from BB/Hobbit. We have been monitoring our UNIX and Storage Devices for a few years now. Now that I have windows servers to monitor, I'd like to know if anyone has a decent way to monitor Windows Clusters? I had a thought to monitor by ping each node in the cluster, and the cluster name, ie:
Nodea - Application Offline Nodeb - Application Online Clustername - Application Responding @ This address
How would you set up resource (process) monitoring for an Active / Passive cluster? Or an Active / Active cluster?
This is in response to a problem that has been occurring on a new 24x7 Windows server blue screening daily, in spite of all the "fixes" that have occurred to solve the problem (more hardware, patches, reload os, etc, etc).
We'll soon be moving the application to an AIX server, but I'll have the same questions on an HACMP cluster at that point :)
TIA
Al Jeffcoat IBM Certified Support Specialist, AIX Enterprise Storage Administrator System Programmer II (321)843-1051 ajeffco at orhs.org
This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: postmaster at orlandoregional.org . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
This e-mail message and any attached files are confidential and are intended solely for the use of the addressee(s) named above. If you are not the intended recipient, any review, use, or distribution of this e-mail message and any attached files is strictly prohibited. This communication may contain material protected by Federal privacy regulations, attorney-client work product, or other privileges. If you have received this confidential communication in error, please notify the sender immediately by reply e-mail message and permanently delete the original message. To reply to our email administrator directly, send an email to: postmaster at orlandoregional.org . If this e-mail message concerns a contract matter, be advised that no employee or agent is authorized to conclude any binding agreement on behalf of Orlando Regional Healthcare by e-mail without express written confirmation by an officer of the corporation. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Orlando Regional Healthcare.
To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe at hswn.dk
Couple of things:
This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature? I've been experiementing with turning off big brother on a client and causing purple pages, but using the ACK code in the emails does not prevent purple pages from continuing, nor does my explanation get recorded into the acklog. ACK works for red & yellow, though.
I could've sworn that I had read about the ability to merge all the purple alerts into a single email or behavior that did that automatically, but I can't seem to find it in the docs. Is that possible? Where can I read up on how to use it? I'd love to get a single alert if a client goes purple that can use a single ACK code to disable pages.
Tom
On Fri, Feb 25, 2005 at 01:13:50PM -0500, Tom Georgoulias wrote:
Couple of things:
- This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?
Yes, that's the idea.
I've been experiementing with turning off big brother on a client and causing purple pages, but using the ACK code in the emails does not prevent purple pages from continuing, nor does my explanation get recorded into the acklog. ACK works for red & yellow, though.
Hmm - odd. I'll try it out later tonight.
- I could've sworn that I had read about the ability to merge all the purple alerts into a single email or behavior that did that automatically, but I can't seem to find it in the docs. Is that possible?
No. I'd like to do some more general merging of alerts - not just purple ones - but that'll be later.
Where can I read up on how to use it? I'd love to get a single alert if a client goes purple that can use a single ACK code to disable pages.
OK, I'll let you in on a secret: If you send an acknowledge with minus-ACKCODE, it will work as an ack for all current alerts on that host.
Henrik
Henrik Stoerner wrote:
- This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?
Yes, that's the idea.
OK.
It seems that when I acknowledge a red/yellow alert, the trend chart is not updated during the acknowledgment time period, but resumes after the time period is over (without using any of the data that would've been collected during that time). Is that also expected?
Also, how can I unacknowledge a host, if I fix a problem before the time that I estimated it would take?
No. I'd like to do some more general merging of alerts - not just purple ones - but that'll be later.
OK, that explains why I couldn't find anything about it in the docs.
OK, I'll let you in on a secret: If you send an acknowledge with minus-ACKCODE, it will work as an ack for all current alerts on that host.
:) Sounds good.
Tom
On Fri, Feb 25, 2005 at 02:03:50PM -0500, Tom Georgoulias wrote:
Henrik Stoerner wrote:
- This may be a dumb question, but I'm gonna ask it anyway. Should I expect to be able to take the ACK code in purple pages and use it with the acknowledge alert feature?
Yes, that's the idea.
I tried it now, and ack'ing a purple status seems to work ok. I'll see if it stops sending me alerts.
It seems that when I acknowledge a red/yellow alert, the trend chart is not updated during the acknowledgment time period, but resumes after the time period is over (without using any of the data that would've been collected during that time). Is that also expected?
Ack'ing should not have any influence on whether data is collected or not. What matters is if there are any updates - if the host is down, you obviously won't be getting any new reports, and then the graphs won't update.
Also, how can I unacknowledge a host, if I fix a problem before the time that I estimated it would take?
You cannot, but the acknowledge should clear automatically as soon as an OK status arrives.
Regards, Henrik
Henrik Stoerner wrote:
I tried it now, and ack'ing a purple status seems to work ok. I'll see if it stops sending me alerts.
I am able to ack as well, so that works.
While were on the topic of purple status messages...Hobbit is config'd to turn a host purple if it hasn't heard from it in 30 mins. I want mine to go purple after 15, so I changed the PURPLEDELAY from "30" to "15" in hobbitserver.cfg, but that doesn't seem to make a difference. What else needs to be changed?
Ack'ing should not have any influence on whether data is collected or not. What matters is if there are any updates - if the host is down, you obviously won't be getting any new reports, and then the graphs won't update.
In the cases where I was testing and observed the behavior above (a 97% full disk partition), the client was online and sending data but the graphs had stalled.
This doesn't seem to be happening on RC4, so something was either fixed or the fresh install on my end helped.
Also, how can I unacknowledge a host, if I fix a problem before the time that I estimated it would take?
You cannot, but the acknowledge should clear automatically as soon as an OK status arrives.
I think I found a loop hole that may cause problems in certain circumstances: Say I get a red alert for something, give an estimate of 120 mins to fix it, and the host goes purple 45 mins later (i.e. it crashes), before the ack clears. That ack stays in the red state and I won't get a page for the red -> purple transition until after the 120 mins passed and paging resumes (presumably because the ack wasn't cleared because it never went green before going purple). This could be bad news if I have a system that crashes when the support tech is busy with other things or if a system is brought back online after a purple status and returns to something non green (i.e. disk is the only thing that is monitored on the system, and it immediately goes to red after boot up and stays that way for a while).
Tom
On Mon, Feb 28, 2005 at 01:28:18PM -0500, Tom Georgoulias wrote:
While were on the topic of purple status messages...Hobbit is config'd to turn a host purple if it hasn't heard from it in 30 mins. I want mine to go purple after 15, so I changed the PURPLEDELAY from "30" to "15" in hobbitserver.cfg, but that doesn't seem to make a difference. What else needs to be changed?
It's the program that generates the status message, that also determines how long it is valid. So this is something you set on each BB client or extension script. You actually cannot set it anywhere for the network tests performed by bbtest-net (I just checked and was a bit surprised that I had not provided some way of changing this).
I think I found a loop hole that may cause problems in certain circumstances: Say I get a red alert for something, give an estimate of 120 mins to fix it, and the host goes purple 45 mins later (i.e. it crashes), before the ack clears. That ack stays in the red state and I won't get a page for the red -> purple transition until after the 120 mins passed and paging resumes (presumably because the ack wasn't cleared because it never went green before going purple). This could be bad news if I have a system that crashes when the support tech is busy with other things or if a system is brought back online after a purple status and returns to something non green (i.e. disk is the only thing that is monitored on the system, and it immediately goes to red after boot up and stays that way for a while).
There are lots of ways you can outsmart the system. And you needn't have a purple status in-between:
- Disk fills up and goes red
- Clueless admin ack's the disk alert for 60 minutes, then reboots the server because that "usually fixes things"
- Disk stays red and no alerts go out until an hour has passed
In such cases there is little Hobbit can do. When you ack an alert, you take over the responsibility for that status for the time the ack is valid. If you "fix" something without checking that it actually did solve the problem, you're asking for trouble.
If you really want it, it's not a big problem to implement an "de-acknowledge" function. It might even be worthwhile for reporting purposes, to keep track of how much time your admins are using on troubleshooting. I'm open to suggestions.
Regards, Henrik
Henrik Stoerner wrote:
It's the program that generates the status message, that also determines how long it is valid. So this is something you set on each BB client or extension script.
OK, that is different than BB, which only needed to have the PURPLEDELAY set on the server side, in bbdef-server.sh.
In such cases there is little Hobbit can do. When you ack an alert, you take over the responsibility for that status for the time the ack is valid. If you "fix" something without checking that it actually did solve the problem, you're asking for trouble.
I've been thinking about this a bit and I cannot see a clean, easy way to solve it either. Having an ack clear each time the status changes could be rather annoying, and a complicated set of if/then conditions is bad too. So I've voting for leaving it as is for now. I trust our team to do the right thing and we generally strive to keep things in the green anyway. :)
If you really want it, it's not a big problem to implement an "de-acknowledge" function. It might even be worthwhile for reporting purposes, to keep track of how much time your admins are using on troubleshooting. I'm open to suggestions.
I can see this being helpful in cases where I'd like to wipe out all the various acks for whatever reason and return a system to its normal, paging self, but those situations are quite uncommon. If it's easy to implement, I wouldn't mind having it.
Tom
On Tue, Mar 01, 2005 at 04:24:55PM -0500, Tom Georgoulias wrote:
Henrik Stoerner wrote:
It's the program that generates the status message, that also determines how long it is valid. So this is something you set on each BB client or extension script.
OK, that is different than BB, which only needed to have the PURPLEDELAY set on the server side, in bbdef-server.sh.
No, this actually works exactly like in BB. PURPLEDELAY in BB only determines the interval between updates of a purple status *after* it has gone purple; it doesn't determine how long to wait before a normal status changes to purple.
That's why when you have scripts that run once an hour, you need to send in the status beginning with "status+65 ..." or it will go purple before the next planned update.
In such cases there is little Hobbit can do. When you ack an alert, you take over the responsibility for that status for the time the ack is valid. If you "fix" something without checking that it actually did solve the problem, you're asking for trouble.
I've been thinking about this a bit and I cannot see a clean, easy way to solve it either.
Well, we agree then :-)
If you really want it, it's not a big problem to implement an "de-acknowledge" function. It might even be worthwhile for reporting purposes, to keep track of how much time your admins are using on troubleshooting. I'm open to suggestions.
I can see this being helpful in cases where I'd like to wipe out all the various acks for whatever reason and return a system to its normal, paging self, but those situations are quite uncommon. If it's easy to implement, I wouldn't mind having it.
I knew you wouldn't :-))
Henrik
Henrik Stoerner wrote:
I've been experiementing with turning off big brother on a client and causing purple pages, but using the ACK code in the emails does not prevent purple pages from continuing, nor does my explanation get recorded into the acklog. ACK works for red & yellow, though.
Hmm - odd. I'll try it out later tonight.
A follow up: I restarted hobbit and repeated the experiement, turning off bbc on my client and waiting for 30 mins until it was put into a purple status. Then I took one of the ACK codes from my purple alert emails and it work just as expected, disabling paging for the time duration I entered and displaying the text message that I entered. Tested putting a "-" in front of the ACK code and it acknowledged all the purples for the host, so that's a neat little trick.
I cannot explain why this didn't work the first times that I tried it, but I swear it didn't.
Tom
participants (4)
-
ajeffco@orhs.org
-
henrik@hswn.dk
-
kevin.grady@gmail.com
-
tgeorgoulias@nandomedia.com